Velocity Is Not a Productivity Metric and Never Was

The most damaging measurement in modern software delivery is one nobody designed for what it's now used for.

There is a metric on a dashboard somewhere in your organization that is quietly destroying your team. It has the patina of rigor — it's a number, it goes up and down, you can chart it over time, you can compare it across teams. Leadership likes it because it feels like a measure of output, and output is the thing leadership is supposed to manage. Engineers tolerate it because pushing back tends to look like making excuses. The metric is velocity, and the way it's being used in nine out of ten organizations that use it is not just unhelpful — it's actively producing the dysfunction it claims to be measuring.

This isn't a controversial position among people who study software measurement. It's not even a controversial position among the people who created the practice. Mike Cohn, Ron Jeffries, the original signatories of the Agile Manifesto — every credible voice on the actual mechanics of estimation has said, repeatedly and on the record, that velocity is not a productivity metric and was never intended to be used as one. And yet here we are, twenty years later, with quarterly business reviews where engineering leaders justify headcount based on velocity charts. Something has gone badly wrong, and like everything else in this series, it's a structural problem, not a knowledge problem.

This post is the autopsy. What velocity was supposed to be, what it became, why it can't do the job it's been assigned, and what to measure instead if you actually want to know whether your engineering organization is effective.

What velocity was actually for

The original purpose of story points and velocity was modest, local, and entirely about helping a single team make near-term commitments to itself. Here's the mechanic, stripped of mythology.

A team is about to plan its next iteration. The team needs to decide how much work to commit to. Estimating each piece of work in hours is a fool's errand — software estimation in absolute time units is famously unreliable, because the time a piece of work takes depends on context the estimator doesn't have when they're estimating. So instead, the team estimates work in relative units. This piece of work is about the same size as that other piece. This one is roughly twice as big. This one is so unclear we shouldn't estimate it at all until we know more. The unit of relative size is the story point. It's a unit invented specifically to avoid having to estimate in absolute time.

Over a few iterations, the team observes how many story points it actually completes per cycle. That number is the velocity. The team can then use it to plan the next iteration: if we usually get 30 points done per sprint, and the candidate work for this sprint adds up to 65 points, we should not commit to all of it.

That's the whole purpose. Velocity is a self-calibration tool for a single team's near-term commitment-making. It exists because absolute estimation is unreliable, and relative estimation plus historical observation gives a team a more honest picture of its own near-term capacity than any other available technique.

Several things are notable about this purpose, and worth saying explicitly because they are exactly the things modern usage violates.

It's local to one team. The story points one team uses are calibrated against that team's particular work, particular composition, and particular context. They are not units of work in any objective sense. Two teams' story points are no more comparable than two people's "out of ten" pain ratings.

It's about commitment, not performance. Velocity helps a team avoid over-committing. It does not measure how productive the team is. A team that does fewer points in a sprint because the work was harder, or because someone was on leave, or because the team chose to invest in a refactor, has not become less productive. It has become more honest about its capacity.

It's a private number. The original literature is explicit: velocity should not be visible to anyone outside the team. The moment it becomes visible to people who can use it to compare, evaluate, or pressure the team, the team starts gaming it — not because engineers are dishonest, but because Goodhart's Law is a law. Any measure used as a target ceases to be a useful measure.

It's not a measure of value. Story points estimate effort, not impact. A team can spend a sprint doing high-velocity work that delivers nothing of value, or low-velocity work that ships the most important thing the company has shipped all year. Velocity has no opinion about which is which. It can't, by construction.

That's the original practice. Modest, local, internal, effort-not-value. Now look at how velocity is used in your organization.

What it became

The conversion is depressingly familiar at this point in the series, because it's the same conversion that ate standups, retrospectives, and every other agile practice. It happens in roughly four stages.

Stage one: leadership asks for visibility. The metric is local and private, but a director somewhere wants to know "how the teams are doing." Engineering managers, in good faith, surface velocity. It's the number they have. It feels measurable. It can be put on a slide.

Stage two: comparison creeps in. Once velocity is visible above the team level, comparison is inevitable. Team A does 40 points per sprint. Team B does 25. Why is Team B slower? Are they less productive? Should we look at whether they have the right people? Team B's manager, watching this conversation happen, learns to inflate estimates. Team A's manager learns the same lesson by induction. Within two quarters, both teams' velocities are calibrated to whatever number sounds defensible at QBR.

Stage three: velocity becomes a target. Leadership begins setting velocity expectations. "We need to see Team B at 40 points." "Team C's velocity has been flat — what's the plan to increase it?" The metric, which was supposed to be a description of what a team was capable of, becomes a number the team is responsible for hitting. Engineers learn that the safe move is to refuse small stories (low points, less velocity), inflate medium stories (more points, higher velocity), and avoid any work that doesn't have a clear point value (refactoring, technical debt, learning) because it doesn't show up on the chart.

Stage four: the metric becomes the work. Sprint planning is now structured around hitting the velocity number, not delivering the right outcome. Stories get sliced thinner so that "completion" can happen more often. Carryover is treated as failure, which means teams under-commit to avoid carrying work over, which means velocity goes up because the easy work fits but the hard work never gets started. The dashboard looks better than it ever has. The product is shipping less of what matters than it ever has.

By stage four, velocity has fully inverted from its original purpose. It started as a tool to help teams be honest with themselves about their capacity. It ended as a tool that forces teams to lie to leadership about everything else.

Why velocity can't do the job it's been assigned

Set aside the historical autopsy for a moment and consider the question on its merits. Could velocity, even in principle, function as a productivity metric?

The answer is no, and the reasons are structural, not ideological. There are at least four of them.

Reason one: the unit is incoherent across teams. A story point is a relative measure within one team's calibration. There is no conversion factor between teams. Team A's 5-point story might be Team B's 13-point story even if both teams are doing identical work, because the teams have calibrated to different baselines, and that calibration is not even particularly stable within a team over time. Comparing velocities across teams is not just unreliable — it is a category error, like comparing one country's GDP in dollars to another's in calories. The numbers can be put next to each other, but the comparison doesn't mean what people think it means.

Reason two: it measures effort, not value. Even within a single team, a sprint where 40 points were completed is not necessarily a more productive sprint than one where 20 points were completed. The 20-point sprint might have shipped the feature that drove the quarter's revenue. The 40-point sprint might have shipped a redesign that no user asked for. Velocity has no way of distinguishing these cases because it was never designed to. Using it as a productivity proxy means treating "amount of work done" as equivalent to "value created," which is the same fallacy that makes "lines of code" a famously bad measure.

Reason three: it's trivially gameable, and Goodhart's Law guarantees it will be gamed. The moment velocity becomes a target, engineers can hit it without doing anything different in their actual work — just by re-calibrating their estimates upward. This is not malicious. It's the natural response of a rational person to a measurement system that punishes honesty. The gaming is invisible because the metric has no external check. Nobody can verify whether a 5-point story "should" have been 3 points, because story points were never anchored to anything verifiable in the first place.

Reason four: it actively distorts the work. A team optimizing for velocity will systematically avoid work that doesn't fit the system. Refactoring doesn't produce points. Helping another team doesn't produce points. Talking to a customer doesn't produce points. Killing a bad feature doesn't produce points. Pausing to think doesn't produce points. All of these activities are part of healthy engineering work. A velocity-targeted team learns to skip them — not because the team is bad, but because the system is selecting against them. The metric drives behavior, and the behavior is worse than what existed before the metric was tracked.

Combine these four properties and you have a metric that is incomparable across teams, indifferent to value, gameable by anyone using it, and actively hostile to several of the activities that actually make engineering organizations effective. This is not a productivity metric in any sense the word "productivity" can support. Calling it one is a category error that has cost the industry an enormous amount of misdirected effort.

"But we have to measure something"

This is the response we get every time we have this conversation with engineering leaders, and it deserves a serious answer because the underlying instinct is correct. Engineering leaders are accountable for their organizations. They need ways to know whether things are going well or badly. They need numbers they can put in front of executives who will, reasonably, ask whether the engineering investment is paying off. The instinct to measure is not the problem. The problem is measuring the wrong thing because it's the easy thing.

There is a more honest set of measurements available. They are harder to collect, harder to chart, and harder to compare across teams — which is precisely why they're more useful, because the things they're harder at are the things that were corrupting velocity in the first place.

The shift is from measuring output to measuring outcomes and effectiveness. A few examples of what that looks like in practice.

Instead of measuring how much work a team completed, measure how often the work the team completed produced the result it was supposed to produce. If the team shipped a feature designed to improve activation, did activation improve? If the team shipped a refactor designed to reduce incident frequency, did incidents go down? The unit here is "predicted outcome achieved," and a healthy engineering organization should be able to tell you, for any non-trivial piece of work it shipped in the last quarter, what the work was supposed to do and whether it did that.

Instead of measuring sprint-over-sprint velocity, measure cycle time from decision to evidence. How long does it take, on average, between the moment your team decides to build something and the moment you have real evidence about whether it worked? This is a measure of feedback-loop health, which — as we argued in the AI post — is the actual bottleneck for most software organizations now. Healthy teams measure this in days or weeks. Unhealthy teams cannot measure it at all because they don't close the loop.

Instead of measuring tickets closed, measure things killed. Specifically: in the last quarter, how many features, projects, or initiatives did your organization stop doing because evidence suggested they weren't working? An organization that never kills anything is not making real decisions; it's just doing whatever was on the roadmap until the roadmap runs out. The ratio of "things shipped" to "things killed" is one of the most diagnostic numbers an engineering leader can track.

Instead of comparing teams' velocities, measure teams' decision latency. When a team identifies that something needs to change — a feature is failing, an architecture is wrong, a process is broken — how long does it take to actually change it? This measures organizational responsiveness, which is closer to what "agility" actually meant in the original sense. A team with high velocity and high decision latency is shipping a lot of work that nobody can course-correct. That is not productive; it's the opposite of productive.

These measurements are all harder than velocity. They require talking to customers, instrumenting outcomes, defining what success looks like before you ship, and being willing to count the things you stopped doing. They're harder for the same reason they're useful: the friction of collecting them is friction against the dysfunction that velocity hides.

What this means for engineering leaders

If you're running an engineering organization right now, the practical question is what to do on Monday. The temptation is to keep the velocity dashboard running while quietly de-emphasizing it, on the theory that you don't want to make sudden changes. This is the wrong move. The dashboard is shaping behavior whether you discuss it or not. As long as it exists, your teams are gaming it, even if nobody on the team can articulate that they're doing so.

The move that actually changes things is to publicly retire velocity as a leadership-visible metric — say it out loud, in writing, in front of the people who used to consume it — and replace it with two or three outcome-focused measurements that the organization commits to actually tracking and discussing. The specific measurements matter less than the act of replacement. The act of replacement signals that the organization is no longer optimizing for "amount of work shipped" and is now optimizing for "did the work matter."

This is harder than it sounds, because the velocity dashboard was doing something real for leadership. It was providing the feeling of measurement, the artifact of accountability. Removing it without replacing the underlying need is like canceling a meeting without addressing why it was being held. The need for visibility doesn't go away. It has to be redirected to a measurement that's actually pointing at the right thing.

This is, candidly, where most engineering leaders we work with get stuck. Replacing velocity is conceptually easy. Replacing the organizational machinery that velocity was feeding — the QBRs, the headcount justifications, the team comparisons, the executive reporting — is the actual work. It requires rebuilding the relationship between engineering and the rest of the business around a different unit of accountability. That rebuilding is what we spend most of our time on at Sierra Agility, and it's the part that almost never shows up in the framework literature, because frameworks don't engage with how organizations actually allocate trust and authority.

The reframe

Velocity is not a productivity metric. It was never designed to be one. Twenty years of treating it as one have produced a generation of engineering organizations that are simultaneously over-measured and under-informed — drowning in dashboards, with no idea whether the work they're shipping is actually working.

The way out is not a better velocity. There is no version of velocity that does the job it's been assigned, because the job was incoherent from the start. The way out is a different question. Not "how much did we do?" but "did what we did matter?" That question is harder to answer. It is also the only question worth answering. Engineering organizations that learn to answer it have a structural advantage over the ones that don't, and that advantage is widening every quarter as the build phase compresses and outcome differences become more visible.

You can keep tracking velocity if you want. Just stop pretending it's telling you something it isn't. And if you want to know whether your engineering organization is actually effective, start measuring the things that effectiveness is actually made of.

Next in the series: the SAFe critique is right. The conclusion is wrong.

Next
Next

AI Didn't Kill Agile. It Exposed Orgs That Were Faking It.