Code Reviews Do Find Bugs

kqr

, published 2024-08-06

Tags:

There’s some 2015 research out of Microsoft titled Code Reviews Do Not Find Bugs1¹ Code Reviews Do Not Find Bugs; How the Current Code Review Best Practice Slows Us Down; Czerwonka, Greiler, Tilford; IEEE International Conference on Software Engineering; 2015. which seems strangely named because reviewers do find bugs.

Here’s what the authors say:

Contrary to the often stated primary goal of code reviews, they often do not find functionality defects that should block a code submission. Only about 15 % of comments provided by reviewers indicate a possible defect, much less a blocking defect.

This is a misleading statistic, because it says nothing about the defect detection rate. Only about 15 % of smokers get lung cancer, but that does not mean we can ignore smoking as a cause – smoking is the cause of more than 80 % of lung cancer cases. The fact that 15 % of code review comments are about defects is not a statement about a lack of detected defects, but rather a statement about how reviewers also write a lot of comments about other things – and we will see more about that later.

This does just not support the idea that reviewers do not find defects. For example, if the reviews in their data had an average of four comments each, then more than half of the reviews did get comments about defects!

But we don’t need to speculate on how many comments there were in their data, because there has been previous research on this. We have known for a while that code review (as well as pair programming) finds an additional 60 % of defects for only a 15 % increase in time investment.2² Balancing Agility and Discipline; Boehm, Turner, Booch, Cockburn, Pyster; Addison–Wesley Professional; 2003

At this point, we also have more detailed data on the effectiveness of code review. During the first 60 minutes of code review of the day, the reviewer finds roughly one defect per ten minutes of reviewing – as long as they review less than about 50 lines of code per ten minutes.3³ Making Software: What Really Works, and Why We Believe It; Oram & Wilson; O’Reilly Media; 2010. In other words, reviews are most effective on small chunks at a time, and mustn’t expand to cover large fractions of the workday.

What other activity can you imagine where a single developer can uncover a defect every 10 minutes? Certainly not through manual testing – not even all forms of automated testing beat this rate, given that the total cost of automated testing of a component quickly exceeds 10 minutes of time investment. Code review is ridiculously effective. Of all the quality measures we might have in place, code review is the last one I’d get rid of.4⁴ But if we produce more than about 300 lines of code per developer and day, I would be tempted to let the excess lines of code go unreviewed, and focus reviewing on the most important 300 lines per developer and day.

The Microsoft authors also report on how code review is remarkably effective for learning the codebase.

The usefulness of code review comments – as judged by the author of a code change – is positively correlated with reviewers’ experience. Without prior exposure to the part of code base being reviewed, on average only 33 % of any reviewer’s comments are deemed useful by the author of a change. However, reviewers typically learn very fast. When reviewing the same part of code base for the third time, the usefulness ratio increases to about 67 % of their comments. By the fourth time, it is equivalent to the project’s long-term average.

Granted, they don’t say enough to determine whether “the project’s long-term average” is at a comparatively low or high familiarity level, but it sounds to me like only four code reviews are enough to get people relatively familiar with a completely new part of the codebase. Assuming a reasonably-sized change (50–100 lines of code?) and a reasonable pace of reviewing for a beginner (10–50 lines of code per 10 minutes?) this would mean about 1–2 hours of code review is enough to get familiar with a codebase.

That’s a cheat code. Exploration and exploitation are always in conflict; for software engineering, this means knowledge sharing and speed are in conflict. If we put someone experienced with a component on a task, they will finish quickly but we won’t spread knowledge about that component. If we put someone inexperienced on the task, the task is likely to take much longer but we do get knowledge sharing. Or! We ask the inexperienced person to review the code of the experienced person. This takes virtually no time (increases development time by 15 %) and they become familiar with the component as a side effect. We get both knowledge sharing and speed.5⁵ Not to mention that sometimes we want to share knowledge about a component which does not need changes made to it. We can then have people review past changes and get feedback on their reviews from an experienced member of the team.

The main thesis of the Microsoft paper seems to be that code review is not worth the time spent on it. We have already seen that when the review load is properly managed, code review is highly effective both in terms of finding defects and learning the codebase.

The Microsoft paper goes on to say other things that appear to contradict their thesis, like

Feedback on long-term code maintainability covers 50 % of code review comments. Sure, but isn’t this a good thing? Long-term code maintainability is in my experience a large problem in organisations that don’t do peer review.
The review is an opportunity for both author and reviewer to prove themselves. This seems like a potential motivator, and not a problem.
Code review usefulness is negatively correlated with change size. This is a feature, not a bug. It is an incentive to keep changes small and atomic.
Developers spend six hours per week reviewing. This is a bit too much (remembering the “first hour of the day” rule from before) but it should only result in a marginal decrease in productivity – code review ought to still appear to be a clear win.

I suspect the real problem with code review in the authors’ experience is one of the last points they raise:

The median review turnaround time is 24 hours. The research on code review has been mostly done in lab settings and not in integrated on-the-job situations, so it has not emphasised turnaround time. But having half of the reviews take more than 24 hours seems – in my experience – like way too much. When I did the maths on this at a previous job, the median turnaround time was 6 hours, and that was borderline painful (which was why I ran the numbers in the first place.)

In the end, I agree with the authors that we should not blindly copy best practises (i.e. cargo cult practices) but the reported experience in this case does not appear to follow best practices, so before judging best practices we should first make sure we have followed them.

If they are impractical to follow in a large organisation, that would be an important observation worth publishing, rather than claiming they don’t work when followed.