Skip to content
Go back
✍️ 에세이

Code Review in the AI Era: How Should We Do It?

by Tony Cho
15 min read 한국어 원문 보기

TL;DR

Code review was always a problem. No time, no people, no process. AI tripled code output but our review capacity stayed the same. Some say kill code review, some say redesign it, some say keep it no matter what. I lay out each camp's arguments and ask what we really need to review in the AI era.

Intro

When I worked as CTO, one of the first “official duties” to disappear from my plate was code review. Before the team got big enough, I ran code review for the backend team. But once I was CTO of a 20-plus-person org, things shifted. The positional title of R&D Department Head started to outweigh the functional title of CTO, and my management moved from code to people.

That doesn’t mean I stopped doing code review entirely after that. The data engineering team was small and had no lead, so for over a year I personally ran their code reviews and study sessions. With newly hired juniors, regardless of their position, I sat down for study and review with them. It was the kind of thing senior engineers, buried in their own workload, couldn’t easily pick up. Sometimes I’d spend several weeks doing 1-on-1 reviews and feedback with a junior who was struggling to write code. Since none of this counted as “official work,” it was hard to fit any of it into my actual working hours.

Honestly, even calling code review an “official duty” is a stretch. We never really nailed down the process for how to do it, and that part is still hard today. Back then I tried to build a culture where developers rotated through reviewing each other’s code. Some people couldn’t see why they should review code that wasn’t their own. Others were enthusiastic to a fault, but reviewed so aggressively that, despite all their energy, they made the room flinch.

Review styles were all over the place. For backend (and the same went for data and frontend), I focused on the application layer: code architecture, whether the code was sufficiently human-readable, whether it broke our internal conventions. I gave feedback on better ways to write the code and offered guidelines around unnecessary duplication or design issues with real tradeoffs. But other team leads went further: they pulled the branch locally, ran it, and reviewed the final quality end to end. Most of that was on the client side. When you have time, the latter is usually the better review. But it costs so much time that as workload grows, doing it properly becomes impossible.

The problem, in the end, was time. I’ve seen plenty of companies with great review processes, but in a tech startup where survival means a deadline every single day, we’d often code in the morning and ship in the afternoon. Skilled developers who were both willing and capable of reviewing code (and kind enough to do it) were a small group. Eventually skipped PRs got merged for the unavoidable reason of “schedule,” and between juniors who hadn’t built up a strong mental model of code architecture and seniors who’d handed their code quality over to the monster called scheduling, the codebase itself slowly became a monster too.

The problem, in the end, was people.


We’ve now entered the era of AI agentic engineering, or as I called it earlier, the era of not reading code, and a wave of tools has shown up to solve the problems of the previous era. It started with CodeRabbit, then Codex and Claude Code began posting PR messages and line-by-line comments. In other words, code review without human dependency became possible.

The catch is that the PRs going up are also written without human dependency.

There’s a ghost story you hear sometimes: a junior who doesn’t really know what they’re doing pushes AI-generated code, and a senior pulls an all-nighter touching every line to fix it. I can’t verify the story, but it shows up often enough. It sounds like distrust of AI, but really it’s distrust of juniors who don’t know what they’re using. A sorcerer’s apprentice who can normally cast a fireball suddenly gets an infinite mana source (AI), starts throwing hellfire around with no experience or magical knowledge, and falls into a kind of magical possession(?).

Either way, this conversation keeps coming up because in any organization where individuals are personally responsible for their code, even AI-produced code carries that same personal responsibility. So plenty of teams are losing sleep over this. Individual productivity has been amplified massively by AI, but team-level or company-level output hasn’t visibly jumped in many cases. Part of the reason might be that team collaboration is, at the bottom, a question of accountability.

There’s an interesting data point. In a study of more than 10,000 developers, teams with high AI adoption merged 98% more PRs but spent 91% more time on review. The individual got faster; the team got slower. The code review bottleneck didn’t disappear. It just got bigger.

So one question remains. In the AI era, how should we do code review?


We Still Don’t Trust Code That No Human Has Reviewed

Let’s start with the most intuitive position. “No matter how good AI gets, we can’t trust code that a human hasn’t read.”

Simon Willison recently put together a list of agentic engineering anti-patterns, and the very first one was exactly this. “Don’t dump unreviewed code on your collaborators.” Just because an agent generated hundreds or thousands of lines of code for you doesn’t mean you should send it up as a PR. That’s offloading the actual work onto your teammates. In his words: “They could just prompt an agent themselves. So what value are you actually adding?”

That’s a sharp question. And it lands for me. The most frustrating moment back in my CTO days was when a PR came up and the person who pushed it couldn’t explain their own code. The same is true in the AI era. Actually, it’s worse in the AI era, because the agent will write a plausible-looking PR description for you. If someone pushes a PR with an agent-written description they haven’t even read themselves (and I sometimes feel that temptation too), that’s just rude to the reviewer.

Kent Beck, one of the engineers I respect most, takes a similar stance. I introduced his Augmented Coding philosophy earlier in How a 15-year CTO does vibe coding, and the core idea is the same. The faster AI generates code, the more important testing and review become — not less. As the cost of generation approaches zero, the source of value shifts from generation to verification.

Addy Osmani put his finger on the same point. “The unsolved problem isn’t generation, it’s verification. That’s where engineering judgment becomes your highest-leverage skill.” AI is good at making code. Whether that code is correct, whether it fits your system, whether it’s still maintainable in six months: that judgment is still on people. At least for now.

The core of this position is clear. No matter how well AI generates code, responsibility for that code lands on a human in the end. If you’re responsible, you have to verify. If you verify, you have to review. Logically airtight.

That said, there’s an uncomfortable truth here. Are there enough people with the time and skill to do that “review”? The way I lived it as CTO, code review getting pushed out of “official work” wasn’t a question of willpower. It was a question of reality. If AI tripled code output and review capacity stayed flat, this position is correct, but whether it’s actually sustainable is another question.


The Era of Humans Reviewing Code Is Over

On the other side, there’s a much more aggressive claim. “The era of humans reviewing code is over. Or rather, it has to be over.”

Bryan Finster recently applied the Nyquist-Shannon Sampling Theorem to this problem, and the analogy is more persuasive than it sounds. The original is a communications theorem: to accurately reconstruct a signal, you have to sample at more than twice the highest frequency present. Apply that to software and you get this. If your defect-detection rate can’t keep up with your code-production rate, you don’t miss problems occasionally. You miss them systematically.

AI produces code at a high frequency. Manual code review is a low-frequency sampling mechanism. We’ve raised the production frequency without raising the feedback frequency. That’s the definition of undersampling, and undersampling means you miss things. Not occasionally. Reliably.

The data from SmartBear’s analysis of Cisco Systems teams backs this up. Human reviewer defect-detection rates fall off a cliff once you cross 400 lines. Yet a single AI prompt can spit out 600 lines. PRs over 400 lines aren’t reviews. They’re rubber stamps. This matches my CTO experience exactly. Under deadline pressure, PR reviews became a formality, and even strong developers fell into “skim it, LGTM” mode. The AI era only made it worse.

A company called StrongDM pushed this logic to the extreme. In their “Software Factory,” humans don’t write code, and humans don’t review code. What humans do is define intent, curate scenarios, and set constraints. After that, agents do everything: they generate code, validate scenarios in a behavior-clone environment from a third-party service called Digital Twin Universe, and iterate until the scenarios pass. Validation has replaced code review.

When I first saw this, I’ll admit my reaction was “does this actually work?” But Simon Willison watched this team’s demo and wrote it up on his blog, and Wharton’s Ethan Mollick and Y Combinator’s Garry Tan both took notice. Stanford Law School’s CodeX even published an analysis titled “Built by Agents, Tested by Agents, Trusted by Whom?” The title is direct. If agents build it and agents test it, who can trust it? When the same kind of AI writes the code and the same kind of AI tests it, both can miss the same thing. And when this software blows up in production, with no human author, who carries the responsibility? Nobody has answered that question yet. StrongDM is using this approach in production anyway — and they’re a security infrastructure company. That’s why this experiment is hard to dismiss.

If StrongDM is the extreme, Salesforce went for a realistic middle ground. After adopting AI-assisted coding, code volume rose roughly 30%, and PRs touching 20 files and 1,000 changed lines became routine. More worryingly, review time on the largest PRs actually started dropping. That was a signal that reviewers had stopped meaningfully wrestling with the changes. Salesforce built a system called Prizm and rearchitected the review process itself. Not “let’s add an AI reviewer,” but an admission that the diff-centric review model itself doesn’t work in the AI era. They introduced a new approach called Intent Reconstruction.

People in this camp share a common line. “AI didn’t remove the safety net. AI just exposed that the safety net was always relying on individual heroics.” That’s Bryan Finster’s framing, and it stings. Letting code review fall off the official duties list as CTO, depending on one enthusiastic reviewer, merging PRs because the schedule said so — all of it was evidence that the safety net was, in fact, riding on heroes.


So What Should We Be Reviewing?

“We have to keep code review” is true. “Humans can’t review everything” is also true. So what exactly are we supposed to do?

latent.space’s Ankit Jain gave the cleanest frame on this question. Shift from code review to intent review. Instead of reading a 500-line diff line by line, review the spec, the acceptance criteria, and the constraints.

In this frame, the spec becomes the source of truth, and the code becomes the artifact of the spec. The human role moves from “did we write this correctly?” to “are we solving the right problem under the right constraints?” The most valuable human judgment gets applied before the code is generated, not after.

This isn’t a new concept. It’s the same thing Behavior-Driven Development has been arguing for years. Before coding, the team gets together and defines “how this feature should behave” in executable scenarios, and those scenarios become the acceptance tests. The reason it never went fully mainstream is that writing the spec felt like extra work. With agents, the equation flips. The spec stops being extra work and becomes the default artifact.

Ankit Jain says trust has to be stacked in layers. Like a Swiss cheese model, where no single gate catches everything, so you stack imperfect filters until the holes don’t line up. Letting multiple agents try different approaches and picking the best is layer one. Deterministic guardrails like tests and type checks are layer two. Humans defining acceptance criteria up front is layer three. Stack on top of that fine-grained agent permissions and adversarial validation (one agent makes it, another tries to break it), and you get five layers of filters.

On the practical side, Qodo’s predictions for AI code review patterns in 2026 are worth a closer look.

First, context-first review. Before opening the diff, pull cross-repo usage, prior PRs, and architecture docs automatically and treat context as required input. Context was the hardest part of review for me as CTO. Sometimes I’d spend half the review time just figuring out “why is this code shaped this way?”

Second, severity-driven review. Findings get classified as must-fix, recommended, or minor suggestion. If you’ve ever had a bot drop 37 comments about whitespace while missing a null check that would take down production, you understand instantly why this matters.

Third, specialist-agent review. Asking one generalist model to play security expert, performance engineer, and staff SWE simultaneously is too much. You separate out a security agent, a performance agent, and a correctness agent, each analyzing in its own domain, and a coordinator builds the unified report. This connects directly to “decomposition” from the 9 skills of agentic engineering. Breaking one giant task into specialist domains.

Bryan Finster reached a similar conclusion. After automation handles everything else, the list of things humans should genuinely block a merge on comes down to two.

One is tribal knowledge. Integration quirks, historical decisions, the “we tried that and it broke billing” context. The kind of thing that lives only in people’s heads and isn’t documented anywhere. Long term, this should also move into docs and Architectural Decision Records and be enforced by tooling. Short term, you need “the person who knows where the bodies are buried,” and their job is context review, not syntax review.

The other is regulated paths. In environments where separation of duties is a compliance requirement, changes to sensitive areas need a second human approval. That’s not negotiable. But it’s no reason to apply the same bar to every PR.

The tooling is shifting too. CodeRabbit supports GitHub, GitLab, Bitbucket, and Azure across multiple platforms, broadening reach. Greptile indexes the entire codebase to attempt the deepest level of bug detection. GitHub Copilot Code Review hit one million users within a month of launch. If you’re already on Copilot, friction is near zero, but because it’s diff-based, it tends to miss architecture-level issues. Whatever tool you pick, the principle is the same. Hand off what AI can catch (syntax, style, simple logic bugs, security patterns) to AI, and keep humans on what AI absolutely can’t catch (intent, context, business judgment, tribal knowledge).

So the answer to “what should we be reviewing?” lands here. Review intent, not code. Review the spec, not the diff. Review the context, not the syntax.


Closing

If I’m being honest, I don’t personally review the code I write with AI right now.

I play the QA role instead. I test mostly for whether the thing actually behaves as intended, and I only look at the code when something goes wrong. I run manual QA tests myself, and if a problem is reproducible in code, I amplify the case into an integration test so it doesn’t recur. API scenario tests, external integrations, and UX testing still have real limits, and the truth is that quality varies with how diligently I get my hands on it.

As I laid out above, the era of reading code line by line is already gone. But responsibility for the code hasn’t gone away. The shape of the responsibility has changed. From the person who writes the code to the person who confirms the code does what it’s supposed to. From reviewer to verifier.

So what does that “verifier” role concretely look like?

I think today’s AI-native engineer or full-stack builder might need to play the role the previous era’s PM played, by themselves. Especially on product quality. Define the requirements, set the acceptance criteria, hand the implementation to the agent, verify the result, and monitor in production. That’s clearly different from the traditional developer role. The reason I emphasized “definition of done” and “observability” in the 9 skills of agentic engineering was the same context.

This isn’t entirely new either. Even in the previous era, there were two kinds of developers. Developers who treated their work as done once the code merged, and developers who deployed afterward, monitored, and tested with their own hands in production. The latter were much better engineers. More importantly, they were responsible engineers.

As the era shifts, the cost of producing code is approaching zero. In a world where an agent can build three versions of a feature in an hour, “the ability to write code well” no longer differentiates anyone. What differentiates is the ability to judge whether the code actually solves the real problem, the ability to respond when production breaks, and most of all, the willingness to stand by code that goes out under your name, all the way through.

Can you take responsibility for your own code?

References

FAQ

Can we eliminate code review entirely in the AI era?
It's hard to eliminate completely. StrongDM removed human review from their pipeline, but tribal knowledge and regulated paths still require human judgment that no agent can replicate.
What matters most when reviewing AI-generated code?
Review the intent, not the code. Instead of reading the diff line by line, review the spec, the acceptance criteria, and the constraints. Focus on the business context and architectural calls that AI struggles to catch.
What works in practice to reduce the code review bottleneck?
Hand off syntax, style, and trivial bugs to AI review tools (CodeRabbit, Greptile, Copilot, etc.) and keep humans on intent, context, and business judgment. Stack layers like a Swiss cheese model: automated tests, specialist-agent reviews, and human-defined acceptance criteria.
Tony Cho profile image

About the author

Tony Cho

Indie Hacker, Product Engineer, and Writer

제품을 만들고 회고를 남기는 개발자. AI 코딩, 에이전트 워크플로우, 스타트업 제품 개발, 팀 빌딩과 리더십에 대해 쓴다.


Share this post on:

반응

If you've read this far, leave a note. Reactions, pushback, questions — all welcome.

댓글

댓글을 불러오는 중...


댓글 남기기

이메일은 공개되지 않습니다

Legacy comments (Giscus)


Previous Post
Between a Working Feature and a Trustworthy Product: Building ToC Recognition
Next Post
How the World's Largest AI Company Trains the Next Generation of Engineers: A Review of the Anthropic x CodePath Curriculum