Wesley Dean
>
Metrics, Intent, and the Drift Problem in AI-Assisted Development image

Metrics, Intent, and the Drift Problem in AI-Assisted Development

8 min read

In my news feed, I recently saw a post from TechCrunch that talked about "tokenmaxxing" and how the use of LLM token usage as a metric for developer productivity wasn't having the desired effect. The article pointed out how various metrics over the years have been varying degrees of ineffective when it came to quantifying how productive developers were being. Other metrics included the number of lines of code (LoC) added to a project, the number of commits pushed to a repository, code acceptance rates, and many more.

The goal with these metrics is to find some way that the value of a developer can be reduced to a number that can be easily compared to other numbers. These numbers can be added to a spreadsheet, tracked over time, turned into graphs, and presented as objective signals of progress.

That approach is understandable, but it's also incomplete.

The Limits of Measuring Output

The issue is not that the metrics exist, but what they attempt to measure. Most commonly used development metrics focus on output:

  • lines of code written
  • commits pushed
  • pull requests merged
  • tokens consumed

What they do not capture is intent.

Software is not merely a collection of artifacts; it's the expression of decisions made under constraints. It reflects trade-offs, assumptions, priorities, and context. Those elements are not directly visible in output metrics.

When a system measures output without capturing intent, behavior adapts. This is not a sign of manipulation or laziness. It is a natural response to incentives.

If a developer is told that their performance will be evaluated based on a single number, it is reasonable to expect that they will optimize for that number. If lines of code are rewarded, code will become more verbose. If commit counts are rewarded, commits will become smaller and more frequent. If token usage is rewarded, interaction patterns will change accordingly.

The system produces exactly what it is designed to produce.

What Metrics Miss

Consider the team member who serves as the connective tissue across a group. They mentor others, resolve misunderstandings, and help maintain alignment. Their contributions are substantial. They are also difficult to quantify.

Consider the architect who spends hours working through system design at a whiteboard, identifying constraints, and preventing downstream failure modes. That work may not produce immediate artifacts, yet it shapes everything that follows.

Consider the DevSecOps professional who invests days building pipelines that ensure code is secure, reliable, and maintainable. Their work reduces incidents that never occur.

These contributions are real, valuable, and largely invisible to output-based metrics.

When Metrics Compete

The situation becomes more complex when multiple metrics are applied at once.

On one team, I was the senior DevSecOps role under a tight deadline. I was expected to mentor junior engineers while also delivering critical work. At one point, I asked management a simple question:

"I can teach them or I can do the work. I do not have the ability to do both at the same time. What is the priority?"

The response was, "they are both top priorities."

That answer created a system with no stable solution.

When expectations are misaligned and metrics pull in different directions, people are forced into trade-offs. Those trade-offs are rarely visible in the metrics themselves; instead, they show up in stress, in reduced quality, and in decisions made under pressure.

AI Amplifies the Problem

Recent studies highlight an emerging pattern.

A GitClear report (January 2026) found that developers who regularly used AI exhibited significantly higher code churn.

A Faros AI report (March 2026) showed similar results, with substantial increases in lines of code added and removed.

A Jellyfish analysis (2026q1) found that throughput increased, but at a dramatically higher token cost.

Taken together, these results suggest a consistent pattern:

AI increases the rate at which code is produced and modified, but it also increases the rate at which that code is reinterpreted.

That distinction matters.

AI systems operate by predicting plausible continuations. When asked to modify or improve code, they rely on patterns present in the input. They do not operate from an understanding of the original intent unless that intent has been made explicit.

Each interaction becomes an opportunity for reinterpretation; without a clear representation of intent, those reinterpretations can drift.

The Drift Problem

Drift appears as a sequence of reasonable changes, not as obvious failure:

  • a function is simplified
  • a validation step is consolidated
  • a retry loop is optimized
  • a data structure is normalized

Each change may be locally defensible, pass each review, and may even improve readability or consistency.

Over time, however, the system can move away from the decisions that shaped it. This is the same phenomenon observed in repeated transformations elsewhere:

Local plausibility does not guarantee global fidelity.

Metrics that reward output tend to reinforce this pattern; they encourage changes that are visible and measurable, but they do not distinguish between changes that preserve intent and changes that gradually erode it.

AI accelerates the cycle.

Documentation as Intent Preservation

This is where documentation becomes essential. It is a mechanism for preserving why the code exists in its current form, not just what the code does.

That includes:

  • constraints that must not be violated
  • trade-offs that were intentionally accepted
  • edge cases that influenced the design
  • failure modes that must be handled carefully

When that information is absent, both humans and AI systems are left to infer intent from structure alone. That inference is often reasonable. It is not reliable.

When documentation captures intent clearly, it changes the interaction.

A human reviewer can compare implementation to stated purpose. An AI system can operate within defined boundaries rather than default assumptions. Future changes must contend with explicit reasoning rather than implicit patterns.

Example Bash Function Documentation

Here's an example function documentation block from a project of mine:

## @fn verify_requirements()
## @brief verify that the local files we'll need are present and accessible
## @details This will verify that the certificate and private portion of
## the key are available and readable.  It doesn't help if we can connect and
## upload one file if the other doesn't exist or isn't readable.  So, to make
## sure we have everything we need, we make sure stuff's there.  If something
## isn't there or isn't readable, we want to report that back as soon as
## possible.
##
## The certificate is specified using the $CERTIFICATE variable; if that's
## not configured, the default is $DOMAIN.pem
##
## The private key is specified using the $KEY variable; if that's not
## configured, the default is $KEY.key
##
## The extra step of verifying if the files are readable is because they are
## often stored in a location that only root can access.
##
## We return a result code of 0 (True) if everything's good to go; otherwise,
## we return a non-zero code indicating what's wrong.
## @retval 0 (True) if everything exists and is readable
## @retval 1 (False) if the certificate is missing
## @retval 2 (False) if the certificate is unreadable
## @retval 3 (False) if the private key is missing
## @retval 4 (False) if the private key is unreadable
## @par Examples
## @code
## verify_requirements || exit 1
## @endcode

For more context, take a look at the source code . There's also a sample ADR in my template repository.

From Metrics to Meaning

Documentation does something that output metrics cannot do: it captures meaning.

Metrics can tell you how much code changed, but they can't tell you whether those changes aligned with the original goals of the system or whether an important constraint was preserved or quietly removed.

Documentation, when done well, provides that missing dimension. It allows teams to evaluate work not only in terms of quantity, but in terms of alignment with intent.

Knowledge, Discernment, and Action

At this point, the boundary becomes clear. Metrics provide signals but they do not provide understanding.

Documentation preserves intent, but it does not enforce correctness.

AI can generate, refactor, and optimize, but it does not assume responsibility.

That responsibility remains with people.

Knowledge includes metrics, documentation, and generated artifacts.

Discernment evaluates whether those artifacts remain aligned with the intended design.

Action determines what happens next: accept, revise, reject, or stop.

Tags