Skip to content
Go back
✍️ 에세이

In an Era That Doesn't Read Code, What Should an Engineer Read?

by Tony Cho
23 min read 한국어 원문 보기

TL;DR

An Anthropic study found that developers using AI scored 17 percent lower on learning. The real point of the research, though, is that how you use AI determines the outcome. From Ben Shoemaker's harness engineering to Steve Yegge's end-of-hand-coding, to Jeremy Utley's AI-as-mirror theory: in an era that doesn't read code, what is the engineer actually supposed to read?

Developers Who Used AI Learned 17 Percent Less

The more you use AI, the dumber you get. That’s roughly what one piece of research suggests. In a paper from Anthropic, developers who completed a coding task with AI assistance scored 17 percent lower on a quiz than those who worked without AI. The experiment was about learning a new library, and the AI-assisted group finished the task but never really understood the concepts behind that library.

When I first saw the number, I wasn’t uncomfortable so much as curious — and intrigued. The fact that it came straight from the team behind Claude Code was interesting on its own, and I wanted to see what kind of experiment led them to that conclusion.

You do have to read this study with some context, though. The experiment was a short 35-minute coding task, and the model used was GPT-4o, which by today’s standards already feels like the previous generation. The setup was also far from real work. Using AI on a project that runs for months in a real job is a completely different situation from finishing a task with a brand-new library inside 35 minutes.

But the deeper point of the research lies somewhere else. What the team actually found was not simply “AI use reduces learning.” It was that the same AI produced wildly different outcomes from one person to the next. Some people offloaded the entire code to AI and copy-pasted, while others asked AI only about the concepts and wrote the code themselves. The first group finished fastest but learned the least. The second group ran into more errors but scored far higher on the quiz.

Used the wrong way, we may stop growing.

That, I think, is what this research is really trying to say.


From Reading Code to Directing It

While I was sitting with these results, I read an interesting piece. An engineer named Ben Shoemaker wrote “In Defense of Not Reading the Code,” and the title alone was provocative. It set off a heated debate on Hacker News with more than 200 comments.

His core point goes like this. “I don’t read code line by line anymore. Instead, I read specs, I read tests, I read architecture.” The way he checks correctness has changed. He writes specs first, tags each requirement with a verification method, layers automated tests, linters, and security scans into a harness, and then leaves code generation to an AI agent. Instead of code review, he proposed a new approach he calls harness engineering.

Looking back, I had been moving in the same direction. On a recent project where I leaned on AI for code, the things I poured the most effort into were not the code itself but the test harness, context files like AGENTS.md, custom commands, and skill definitions. I wrote about this process in How a 15-year CTO Vibe-Codes. Instead of reading every line, I had naturally shifted toward checking whether the tests passed and whether the architectural constraints were honored. Reading Ben’s piece, I felt a small relief: this wasn’t just me.

The interesting part is that around the same time, OpenAI was saying something similar. Three engineers, with Codex agents alone, produced a million lines of code and shipped a product that hundreds of internal users now depend on, and what they invested in wasn’t code quality but the harness around the code: documentation, dependency rules, linters, test infrastructure, observability. What they did not invest in was reading code line by line. Watching this, I felt the direction was set. We’re moving from reading code to reading the environment that makes the code come out right.

So what does this actually change? Evan Armstrong frames the shift in a bigger picture. In his words, code itself is becoming a commodity. Commoditized here means code generation is no longer a scarce specialist skill but a general resource anyone can reach. A large share of GitHub commits are already AI-generated, and that share is growing fast. Generating code has been commoditized, but governing code in production (knowing what should exist, what data it connects to, who is allowed to change it) has not. He calls this the context layer. It’s organizational tacit knowledge becoming software. It’s the kind of organizational knowledge that tells the agent what to do, in what order, and whether it’s allowed. Building software is no longer the hard part. Telling the system what to build is.

I feel this in my own work. When I work with AI these days, the hardest thing isn’t writing the code, it’s defining clearly “what we should build.” When the spec is fuzzy, no matter how smart the AI is, the result drifts off. The quality of instruction sets the quality of output.

The Codex deep-dive showed the same pattern. Engineers on the Codex team have effectively become agent managers. In one tab, a code review runs; in another, a feature is being implemented; in a third, a security audit is in progress. They manage four to eight parallel agents at once, and they use a file called AGENTS.md to teach each agent how to find its way around the codebase, what test commands to run, and what the project’s standards are. If the README is for humans, AGENTS.md is for AI.

And Steve Yegge gave this whole movement its bluntest name. As a 40-year veteran software engineer, he declared that “the era of hand-coding is over” and laid out an eight-level scale of AI adoption. Level 1 is not using AI at all. From Level 4, you stop reading the diff. At Level 6, you spin up multiple agents. At Level 8, you build your own orchestrator to coordinate them. In his words, when he sees someone open the IDE, carefully review the code, and then check it in, he feels sad for them. They’re some of the best engineers he knows, but that way of working will leave them behind.

If I’m honest, I tried to place myself on the scale. Somewhere between Level 6 and 7. I don’t completely skip the diff, but for anything that isn’t core logic, I judge by whether the tests pass. Six months ago I needed to verify every line myself before I could relax. These days I more often trust the harness and move on. My field of view has shifted from reading every line to validating the system as a whole.

Yegge’s framing is provocative, but looking back, I had already been moving in that direction. From reading and writing code by hand, to defining specs, directing agents, and verifying results. The role of the engineer is clearly shifting.

But Is Writing Specs Well Enough? Finish-Line Game vs. Compounding Game

Here’s where we have to push one step further. All of this (defining specs, building harnesses, writing AGENTS.md carefully, handing things to agents) sits on top of a hidden assumption. The assumption is that “if the spec is good, the result will be good.” When you stop and look at it, that’s a fairly risky assumption.

Kent Beck calls this The Finish Line Game. You need software that does X, you reach X, and you’re done. Spec-driven development hides exactly this assumption inside it: that we are playing a finish-line game.

Are we, though? What we usually play is The Compounding Game. The first thing you build becomes a resource for the next thing, and that next thing becomes a resource for the one after. The product keeps evolving, the codebase keeps stacking up, and today’s architectural decision opens or closes possibilities six months out. Unless it’s a one-off script, software development is fundamentally a compounding game.

That distinction landed hard for me. I’ve recently fallen into the illusion of “going great” while quickly stamping out features with AI. The feature was done, but when I tried to put something on top of it later, the structure couldn’t carry it. I’d crossed the finish line, but the compounding wasn’t there. A textbook case.

A line from Kent Beck stuck with me: “You can’t win the compounding game with a better agent.md file.” No matter how carefully you write AGENTS.md, no matter how well you orchestrate agents, at some point system complexity exceeds AI’s capacity. At that moment, with so much value still left to earn, the game ends. Sharpening the tools of the finish-line game doesn’t change the nature of the compounding game.

Whether it’s harness engineering or agent management, the point isn’t simply “to build this feature well right now.” The point is to design the system so it compounds. To make today’s code a resource for tomorrow, and today’s architecture the foundation for the next feature. That’s the engineer’s role, and you can’t delegate it to an agent.


AI Is a Mirror

A real question shows up here. Everyone is using the same AI, so why do the results diverge so wildly?

Stanford professor Jeremy Utley, who has taught creativity for 16 years, hits exactly this point.

“AI is a mirror. For the person who wants to be lazy, it will help them be lazy. For the person who wants to be sharper, it will help them be sharper.”

That single sentence sums up everything I’ve experienced with AI.

Let me give my own example. I’ve practiced TDD (test-driven development) for a long time, and I’m someone who cares about DDD (domain-driven design) and architecture. When I work with AI on code, that background shows up directly. When I tell the AI, “Write the test first. Follow the Red-Green-Refactor cycle,” the AI follows the TDD flow. When I say, “Let’s define the bounded contexts of this domain first,” the AI starts from domain modeling. When I make the architectural decisions first and then ask for an implementation inside those constraints, the quality of the result is visibly different.

What about the opposite? When I just toss out, “Make this feature for me,” AI gives me code that runs. The structure, though, is a mess. There are no tests, the error handling is sloppy, and the code shows zero thought for future maintenance. The AI isn’t being dumb. It just doesn’t care about the parts I didn’t care about.

It’s a kind of pair programming. AI is at the keyboard and I’m next to it, steering. If I don’t know where we’re going, AI goes anywhere. AI only finds a real path when I’m able to say, “No, not that way.”

Looking at the six AI-use patterns Anthropic found, the picture gets even sharper. The three lowest-scoring patterns (AI delegation, gradual AI dependence, and iterative AI debugging) all shared “cognitive offloading.” People handed over the act of thinking itself. They delegated entire code generation, started by asking questions and then handed everything over, or even leaned on AI for debugging. They finished fastest and learned the least.

The three highest-scoring patterns were different: understand-after-generation, hybrid code-and-explanation, and conceptual exploration. The conceptual-exploration pattern in particular stood out. This group asked the AI only about concepts and wrote the code themselves. They ran into more errors, but they solved them on their own, and they were the fastest of the high-scoring patterns. The key detail is that they were the second-fastest overall, right after AI delegation. You can understand and still be fast.

The understand-after-generation pattern was also striking. This group had AI generate the code, then asked follow-up questions about the code. “Why does this part work this way?” “What’s the intent of this pattern?” On the surface they look almost identical to the AI-delegation group, but a single step (a check on understanding) split the outcomes completely.

Same tools, same model, same task. Different results. The problem isn’t the tool, it’s the person using it.

If AI works like a mirror at the individual level, what happens at the org level? A Berkeley study gives an interesting answer. UC Berkeley researchers observed a 200-person tech company for eight months, and once AI made it possible for non-developers to write code, something funny happened. PMs wrote code. Researchers did engineering. The result? Engineers spent more time reviewing and fixing their colleagues’ AI code. They spent more time coaching the colleagues who were “vibe coding” and finishing the half-done PRs.

When you think about it, this is the mirror again. AI looks like it’s filling in someone’s gap, but in practice that gap simply shows up in another form. PMs gained the ability to write code with AI, but judging the quality of that code, and cleaning it up when needed, still came down to engineers who knew code deeply.

One more story from my own experience. The more carefully I write AGENTS.md or the project’s context files, the visibly better the AI’s output gets. When I explicitly write down the reasons behind the project’s architectural decisions, coding conventions, and definitions of domain terms, AI produces remarkably consistent code inside that context. Without that context, AI defaults to internet-average code. With rich context, AI behaves like a member of my team.

In the end, AI is a tool that amplifies what you already have. If I bring a good architectural sense, AI amplifies that. If I have a sense for testing, AI amplifies that too. But it does not give me what I don’t already have.


The Mirror’s Limit: It Can’t Reflect What You Don’t Have

There’s an uncomfortable truth to face here. If AI is a mirror, there has to be something for it to reflect.

I know TDD, so I can ask the AI to do TDD. I know how to model a domain, so I can ask the AI to define bounded contexts. But the areas I don’t know? No matter how good AI gets, if I don’t know what I don’t know, I can’t even ask for it.

Without a deep understanding of security, even if I have AI run a security review, I can’t judge whether the result is sufficient. Without a feel for performance optimization, I won’t catch the performance issues in the code AI produces. Without knowing database design principles, I can’t evaluate whether the schema AI proposes is appropriate.

AI maximizes my strengths, but it leaves my blind spots untouched. It can even get more dangerous. Because AI ships results so quickly, I might race in the wrong direction faster while never realizing the problem sitting in my blind spot.

The “gradual AI dependence” pattern from the Anthropic study is exactly this. People start out trying to understand by asking questions, but as they get more comfortable, they end up delegating everything, and they reach a state where they don’t know what they don’t know. That’s why concept retention failed completely on the second task. The thought “AI will handle it anyway” is the same “AI delegation” pattern that scored lowest in the study. Fastest to finish, least to learn.

In the Berkeley study, when non-developers wrote code with AI, the same thing happened. PMs gained the ability to write code with AI, but with no eye for code quality, engineers had to clean up. AI lowered the bar to writing code, but it did not hand over the ability to judge code quality.

This applies to me too. In backend architecture, where I’m strong, AI becomes a genuinely powerful colleague. In frontend, where I’m weaker, I catch myself unable to evaluate AI’s output properly. A React component AI built will “work,” but I sometimes can’t tell whether the pattern is good or bad.

The Dracula Effect: What AI Drains

The mirror’s limit isn’t only about blind spots in knowledge. The Dracula effect that Steve Yegge talks about also makes sense in this context. Coding with AI lets you accomplish a huge amount, but it drains your mental energy in proportion. Simon Willison said the same thing: “The productivity boost from LLMs is exhausting. If you run two or three projects in parallel, even one or two hours of work uses up nearly a full day of mental energy.” Steve put it more directly. For someone running vibe coding at full speed, you cap out at about three productive hours a day. Even so, that’s 100 times more productive than working without AI.

I have a similar experience. With AI, in two or three hours of focused coding I can finish work that used to take days. After that, my head genuinely stops working. The cognitive load is a different kind from the usual one, and it hits harder. When I’m writing code by hand, the act of typing while thinking gives me a natural rest. With AI, I’m constantly judging, verifying, and steering. AI does the producing. The thinking is all on me.

A mirror reflects only the person standing in front of it. With no one in front, it reflects nothing.


So How Do We Use It

If the question is how to use this mirror well, Jeremy Utley’s core principle is simple. “Don’t demand answers from AI; have a conversation with it.” Better yet, don’t ask AI questions; let AI ask the questions.

There’s a prompt he recommends.

“You are an AI expert. Ask me one question at a time until you have enough context about my workflow, my responsibilities, my KPIs, and my goals.”

Why this works: most people use AI like a Google search box. You type a keyword and expect an answer. But an LLM is not a search engine. It’s a partner in a conversation. The richer the context I provide, the richer the response.

Jeremy especially emphasizes voice input. Our brains, trained on the Google search box, automatically switch into “keyword mode” when we see an input field. The moment you type, “What should I write first?” creates pressure, and the need to organize your thoughts ends up limiting the possibilities. When you speak, you can ramble. Putting down the burden of needing to be smart is where a real conversation begins.

I tried this myself, and the difference is real. When typing, I first start asking, “How do I phrase the question to get a good answer?” When speaking, “I have this problem, here’s the context, and I want to try this” comes out naturally. You shift from keyword mode to conversation mode.

In coding, context engineering is the heart of it. What you’re really designing is “what information AI needs in full to do my request properly.” Jeremy proposes a simple test. Take your prompt and your documents, and hand them to a colleague across the hall. If that colleague can’t do the task, it shouldn’t surprise you that AI can’t either. AI can’t read your mind. What most people realize is, “Oh, I was expecting AI to read my mind.”

I felt this when I applied it to my project. When I write down the reasons behind a project’s architectural decisions in AGENTS.md, document coding conventions concretely, and build a glossary of domain terms, AI produces remarkably consistent code. When I explain to AI why this project uses event sourcing or why this layer doesn’t access the DB directly, AI generates code that fits inside that context. Without context, AI imitates the average of the internet. With rich context, AI works like a member of my team.

Anyway — back to the point. Kent Beck has another idea worth bringing in here. He says “invest in futures as much as in features.” Futures means the set of all the things you’ll be able to build next. If the feature is what you’re building right now, futures are the expansion possibilities the system has after that feature ships. Whether the code structure makes the next feature easy to bolt on, whether the architecture leaves room for new requirements: those are futures.

I think the same applies to context engineering. If you only put the context for the feature you’re building now into AGENTS.md, AI will build that feature. But that’s where it ends. Where the system can go next, in which direction it can grow: that field of view also has to be in the design, or futures don’t survive. If you stay buried in only what you know and what you’re building right now, the feature gets done, but extensibility dies. Providing rich context means putting both the present context and the system’s future possibilities in your line of sight.

In the end, all of this converges on a perspective shift: “not as a tool, but as a teammate.” In Jeremy’s research, when low-performers and high-performers in AI use were compared, the biggest gap wasn’t technical skill. It was attitude toward AI. Low-performers treated AI as a tool. High-performers treated it as a teammate. Treat it as a tool and you stop at average results. Treat it as a teammate and you give feedback, you coach, you pull better results out of it.

When you hand work to a junior, you say, “Ask me anytime if you have questions.” You have to give AI the same permission. If you ask, “Before you start, let me know if there’s information you need to do this well,” AI will say, “I need recent sales numbers to write this email. Could you tell me how many of this SKU sold in Q2?” instead of immediately drafting a sales email. That’s the difference between a tool and a teammate.

The same goes for coding. Instead of “Build this feature,” try, “I’m going to build this feature. Let me describe the current architectural context, and you propose an approach first. If there are edge cases I’m missing, point them out.” The result changes visibly.


What Doesn’t Change

Reading this far, you might ask, “So we don’t have to read code anymore?” The answer is complicated.

Ben Shoemaker also admitted as much. For safety-critical systems, security-sensitive services, and major architectural decisions, you do have to read the code. The analogy he gave was good: the “children of the magenta line” story from aviation. Pilots who came to depend on the automated flight path (the magenta line) lost their ability to judge when to switch to manual. The lesson wasn’t “Don’t use the autopilot.” The lesson was, use the autopilot, but keep the ability to intervene.

Reading code, I think, is the same. The need to read every line is dropping. But the ability to read matters more than ever. When something goes wrong, when all the tests pass but the product behaves strangely, when multiple agents fail to debug a failure, the moment comes when you have to read and understand the code yourself.

Knowing how to read and choosing not to is a completely different story from not being able to read.

Think again about the highest-scoring “conceptual exploration” pattern in the Anthropic study. This group did not have AI write the code. They asked about concepts and wrote the code themselves. They ran into more errors but solved them on their own. Why was this pattern the most effective? Because they could read and write code. That ability gave them the option to ask AI about concepts and check their own understanding.

Not long ago I hit a strange bug in production. The code came out of a plan I’d refined dozens of times with AI feedback. All the tests passed, and when I asked AI to debug, the answer came back: “Looks fine to me.” So I opened the code and walked through it line by line, and I found a bug in the exception-handling logic where the fallback value was being replaced by an unintended default. When an exception fired, the code was supposed to fall through to a safe default, but the default itself was set wrong, so under specific conditions the wrong result came out. AI had looked at this code dozens of times and missed it. That’s when I realized something: the ability to look at AI-generated code and ask, “Is this really right?”, the ability to hold the system’s full flow in your head, the sense that catches the gap between AI declaring “all done” and the thing actually not working — those abilities all connect. Critical thinking, logical thinking, attention to detail. You can’t grow them in isolation. They grow together inside the experience of working deeply with code.

Honestly, as AI advances faster, I feel the value of these fundamentals only goes up. Reading code used to be “a given.” Now that AI writes code for us, the ability to read and judge code properly has become a differentiator. Model half-life has dropped from four months to two, and every time a new model lands, some people say, “This is the limit.” The curve doesn’t stop. In a world where the tools change this fast, the ones who survive aren’t the people who are fluent in a specific tool. They’re the people who can quickly evaluate and use whatever tool shows up. Critical thinking, the eye to see the system as a whole, a sense for quality: none of these change whether the model becomes GPT-10 or Claude 20.

Even the Codex team merges non-core code on AI review alone, but for core agents and open-source components, they insist on careful human code review. The fact that we’re in an era where we don’t write code doesn’t mean the ability to read code has become unnecessary. The ability to read it properly when reading is required has become rarer, and more valuable.


Closing

Let me return to the original question. In an era where AI writes the code, what is the engineer supposed to read?

Time spent reading code will shrink. In its place, you read specs, architecture, test results, and domain context. Code is becoming “implementation detail,” and our attention is shifting to higher abstraction layers.

In the middle of this shift, though, an essential thing doesn’t change. AI maximizes a person’s temperament, tendencies, and abilities. The lazy become lazier, the critical become more critical, the creative become more creative. The person who deeply understands code uses AI to build a deeper system. The person who doesn’t know code uses AI to build something that runs but breaks easily.

Just because we live in an era that doesn’t read code doesn’t mean we can put down the ability to read. Even in a world where AI does the reading, knowing what to read is still on us. To be someone with something worth reflecting in the mirror: that, I think, is the heart of what an engineer is in this era. AI will faithfully amplify the image, for better or worse.


References

FAQ

Does using AI lower a developer's ability to learn?
According to Anthropic's research, it depends on how you use it. If you hand entire pieces of code over to AI, learning drops. But people who only asked the AI about concepts and wrote the code themselves (the 'conceptual exploration' pattern) were both fast and got the highest learning scores.
What is context engineering?
It's the practice of designing what information AI needs in order to do a request properly. When you explicitly write down things like AGENTS.md, the reasons behind architectural decisions, coding conventions, and domain vocabulary, AI starts producing consistent output that fits your team's context.
Do we still need to be able to read code in the AI era?
Yes. The need to read every line is shrinking, but the ability to read code matters more than ever. When all the tests pass but the product behaves strangely, or when AI fails to fix a bug, you eventually have to open the code yourself and judge it.
Tony Cho profile image

About the author

Tony Cho

Indie Hacker, Product Engineer, and Writer

제품을 만들고 회고를 남기는 개발자. AI 코딩, 에이전트 워크플로우, 스타트업 제품 개발, 팀 빌딩과 리더십에 대해 쓴다.


Share this post on:

반응

If you've read this far, leave a note. Reactions, pushback, questions — all welcome.

댓글

댓글을 불러오는 중...


댓글 남기기

이메일은 공개되지 않습니다

Legacy comments (Giscus)


Previous Post
AI Is Only as Smart as You Are
Next Post
How My AI Agent Jarvis Became My Second Brain — A Real OpenClaw Story