Should You Track Which Engineer Introduces the Most Bugs?

Name: Diffwise
Availability: InStock
Author: Diffwise

June 10, 2026 · 11 min read · by Amar Tripathi

Short answer: probably not. Per-person bug counts are one of the most tempting developer performance metrics and one of the least useful. The moment you put "bugs introduced by engineer" on a dashboard, two things happen. The number stops measuring what you think it measures, and your best engineers start optimizing for the metric instead of the codebase.

I have run engineering teams for over a decade, and I have watched this exact metric get proposed in roughly every other planning cycle. Usually by someone smart, usually with good intentions, usually right after a painful production incident. The instinct makes sense. Bugs cost money. Bugs have authors. Therefore, count bugs per author and fix the people at the top of the list.

The logic fails at almost every step. Here is why, and here is what to track instead.

Why is per-person bug tracking so tempting?

Because it feels like accountability, and accountability is genuinely scarce in software organizations.

When a checkout flow breaks at 2 a.m., somebody wrote the line that broke it. Git knows who. Your incident tracker knows the cost. Joining those two tables takes an afternoon and produces a chart that looks like insight. If you are a director getting pressure about quality, that chart is hard to resist.

It also borrows credibility from places where individual metrics work. Sales has quota attainment. Support has tickets resolved. Surely engineering can have defects introduced?

The difference is that a closed sales deal is an independent event with a clear owner. A bug is a property of a system. It emerges from the code that was there before, the tests that were not there, the review that missed it, the deadline that compressed it, and the requirements that hid it. The commit author is just the person who happened to be standing closest when all of that converged.

What does git blame actually measure?

This is where most git blame analytics quietly fall apart. People treat blame output as "who caused this code to behave this way." What it actually records is "whose commit last touched this line of text." Those are very different claims.

Consider how authorship gets scrambled in a normal, healthy codebase:

Refactors reassign blame wholesale. An engineer renames a module, extracts a function, or moves code between files. Git now shows them as the author of hundreds of lines containing logic they never wrote and may not fully understand. If one of those lines harbors a five-year-old bug, your dashboard charges it to the refactorer. You have just created an incentive to never clean anything up.

Formatting commits do the same thing at scale. Run Prettier or Black across a repository and one person now "owns" half the codebase. Most teams know to use --ignore-revs-file for this, but most homegrown bug-attribution dashboards do not.

Pairing and mobbing have one committer. Two engineers design a solution together, one types it. The typist eats the bug count. Co-authored-by trailers help when people remember to add them, which is intermittently.

AI-generated code breaks the model completely. In 2026 a large share of committed lines were drafted by a coding assistant and accepted by a human. The human is the author of record, but the causal story is "the model proposed it, the reviewer approved it, the tests allowed it." Charging that defect to one person's quality score is not measurement. It is bookkeeping fiction.

The riskiest work attracts the most bugs. Your strongest engineer takes the gnarly migration, the payment integration, the concurrency fix. Your most cautious engineer updates copy and bumps dependencies. Guess whose defect count looks worse. A naive bugs-per-engineer chart systematically punishes the people doing the hardest work.

Git blame is a fine forensic tool for answering "who should I ask about this line?" As an input to developer performance metrics, it measures proximity, not culpability.

What happens when you put the number on a dashboard?

Goodhart's law: when a measure becomes a target, it ceases to be a good measure. With bug attribution, the decay is fast and predictable.

Engineers are professional optimizers. Give them a function and they will minimize it. Within a quarter you will see:

Reluctance to touch legacy code, because touching it means owning its failures
Bug reports quietly reclassified as "expected behavior" or "improvement requests"
Negotiations in triage about whether something is really a bug or really a regression
Senior engineers declining risky-but-necessary work that juniors then absorb, badly
Blame-aware commits: splitting changes so the dangerous line lands in someone else's diff

None of this reduces defects. All of it reduces the honesty of your data, which is the one thing an engineering analytics program cannot survive.

There is a deeper cost too. Psychological safety research, including Google's Project Aristotle work, keeps landing on the same finding: teams that can discuss failure openly outperform teams that cannot. A bug leaderboard is a machine for making failure discussion unsafe. People stop saying "I think I broke this" in standup. Incidents take longer to resolve because the first instinct is self-defense, not diagnosis.

What should you measure instead?

The useful move is to change the unit of analysis. Stop asking "which person generates bugs?" and start asking "which files, patterns, and processes generate bugs?" Pattern-level engineering analytics give you almost everything the person-level version promised, without the gaming and the fear.

Four families of metrics earn their keep:

Hot file analysis. A small number of files generate a wildly disproportionate share of defects. This is one of the most replicated findings in software research, going back to Adam Tornhill's code-as-crime-scene work and Microsoft's studies on change-prone modules. When the same file shows up in finding after finding, the file is the problem: too much responsibility, too little test coverage, too many people editing it under pressure. That is a refactoring candidate with a business case attached. Tools like Diffwise surface this automatically by tracking which files accumulate the most review findings across every PR, so the "we should really rewrite the order service" conversation comes with data instead of vibes.

Recurring anti-pattern categories. If missing error handling shows up once, it is a mistake. If it shows up in 60 percent of your repositories, it is a gap in your standards, your onboarding, or your linting. Cross-repo anti-pattern tracking turns a pile of individual review comments into an organizational signal: this category keeps recurring, fix it once at the platform level. Diffwise flags categories that appear in more than half your repos as org-wide issues for exactly this reason. The fix for an org-wide pattern is a lint rule or a shared library, not a performance conversation.

Resolution velocity. How long does a critical finding stay open? Is mean time to fix trending up or down? This measures the system's responsiveness to known problems, which is a far better quality signal than who created the problem. A team that introduces bugs and fixes them in hours is healthier than a team that introduces fewer bugs and ships known issues for weeks.

Review turnaround. Time from PR opened to first meaningful review. Slow reviews cause big batches, big batches cause risky merges, risky merges cause defects. This is a process metric with a clear lever attached, and nobody has to be ranked to improve it.

Notice what these have in common. They all point at something you can fix without firing anyone: a file, a pattern, a queue. Person-level metrics point at someone you can blame. Pattern-level code quality metrics point at work you can do.

Person-level vs pattern-level metrics

Metric	What it claims to measure	What it actually measures	Failure mode	Better alternative
Bugs per engineer	Individual code quality	Who touched risky code last	Gaming, fear, avoidance of hard work	Hot file analysis
Lines of code per dev	Output	Verbosity and churn	Padding, copy-paste, resistance to deletion	PR throughput at team level
Commits per day	Activity	Commit habits	Micro-commits, busywork	Cycle time per change
Git blame on incidents	Root cause	Last editor of the failing line	Refactor and formatting misattribution	Anti-pattern category tracking
Individual fix rate	Diligence	Task assignment luck	Cherry-picking easy fixes	Resolution velocity by severity
Bug count rankings	Accountability	Risk appetite of assigned work	Best engineers avoid critical systems	Review turnaround and finding trends

Is individual-level data ever appropriate?

Yes, in narrow contexts, and the context is the whole game.

Private coaching. If you are managing someone and you notice their PRs repeatedly trip the same category of finding, say, SQL injection patterns or unhandled promise rejections, that is genuinely useful for a one-on-one. "I noticed the last three reviews flagged input validation, want to pair on it?" is good management. The same data on a team dashboard is a public shaming instrument. Same number, opposite effect.

Self-reflection. Engineers looking at their own patterns is healthy. Plenty of strong developers keep a private list of their recurring mistakes. Opt-in, self-directed, invisible to others.

Bus factor analysis. Knowing that one person authored 90 percent of the billing service is not a performance question, it is a risk question. Low bus factor means knowledge concentration, and the response is documentation and pairing, not praise or blame. This is the one place ownership data belongs in planning conversations, because the concern is "what happens if they win the lottery," not "are they good."

The rule I give new managers: individual data flows toward the individual and their direct manager, in private, framed as growth. It never flows into rankings, dashboards, calibration packets, or promotion math. The minute it does, Goodhart shows up and the data rots.

Where do LinearB, Swarmia, Jellyfish, and DORA fit?

The serious engineering analytics vendors mostly agree with this framing, which is worth saying plainly because people assume they sell stack-ranking tools.

DORA metrics (deployment frequency, lead time for changes, change failure rate, time to restore) are explicitly team-level and system-level. The DORA research program has been clear for years that using these to compare individuals is a misuse. LinearB and Swarmia both publish guidance against individual leaderboards and orient their products around team workflow bottlenecks. Swarmia's founders have written specifically about why developer stack-ranking fails. Jellyfish focuses on aligning engineering work with business priorities, allocation rather than individual output policing. If you adopt any of these, read their own docs on metric misuse. The vendors are often more careful than their customers.

Where Diffwise sits in this landscape: it is a code review layer rather than a delivery analytics platform. Forty-plus specialist agents review every PR, findings get tracked as Fixed, Still Open, or New across pushes, and the analytics that come out the other end are deliberately pattern-shaped: hot files, recurring anti-pattern categories across repos, resolution velocity by severity, repo health. There is no bugs-per-engineer view, and that is a design decision, not a missing feature. The free tier covers 50 reviews a month on 3 repos, and Pro is a flat $19 per month (or $9 if you bring your own API key), so a team can test whether pattern-level signals change their conversations before committing to anything.

How do you roll this out without creating fear?

A few things that have worked for me:

Announce the unit of analysis up front. Tell the team explicitly: we track files, patterns, and process times. We do not track individuals. Then keep the promise, including in the messy week after a bad incident when someone senior asks "but who wrote it?"

Use blameless incident reviews as the cultural anchor. If incident retros stay blameless, the analytics program inherits that trust. If retros turn into trials, no dashboard design will save you.

Act on the patterns visibly. The fastest way to build trust in engineering analytics is to fix something it found. Refactor the hottest file. Add the lint rule for the recurring category. When the team sees metrics turn into roadmap items instead of performance reviews, they stop being defensive about the data.

Keep individual signals in one-on-ones. If your tooling can show that one engineer keeps hitting the same finding category, treat it like a code review comment scaled up: specific, private, paired with an offer to help.

FAQ

Can git blame tell me who caused a production bug?

It can tell you who last edited the failing line, which is a starting point for investigation, not a verdict. Refactors, formatting commits, pair programming, and AI-assisted code all separate the commit author from the causal author. Use blame to find someone to ask, not someone to charge.

What is the single best metric for code quality?

There is no single one, but if forced to pick: resolution velocity on high-severity findings. It captures whether your system detects problems and how fast it responds to them, and it is hard to game without actually fixing things.

Are DORA metrics individual performance metrics?

No. All four DORA metrics describe a team's delivery system. The DORA researchers themselves warn against using them to rank individuals, and doing so produces the same Goodhart effects as bug counts.

How do I measure developer productivity without surveilling developers?

Measure the system around them: cycle time, review turnaround, deployment frequency, finding resolution rates, and hot files. Pair that with qualitative signals from one-on-ones and developer experience surveys. If a metric requires naming a person to be useful, treat it as a coaching input, not a dashboard.

Does Diffwise rank engineers by bugs introduced?

No. Diffwise tracks findings at the file, pattern, and repository level: hot files, cross-repo anti-pattern categories, and resolution velocity by severity. The intent is to show where the codebase generates problems, not who to blame for them.