Skip to content
Go back
✍️ 에세이

9 Survival Skills for the Agentic Engineering Era

by Tony Cho
48 min read 한국어 원문 보기

TL;DR

Karpathy declared that the era of typing code directly into an editor is over and gave the new mode a name: agentic engineering. What ended is the typing, not the engineering. Decomposition, context architecture, definition of done, failure recovery, observability, memory architecture, parallel orchestration, abstraction layering, and taste. These are the nine skills that matter now, drawn from real wins and very real wreckage in my own work.

Opening

Karpathy, the same person who coined the term vibe coding, posted on X that we now need a new name to distinguish the next mode from vibe coding, and proposed calling it agentic engineering.

I’ve been doing vibe coding seriously since last April, and the past two or three months have been turbulent in a way that’s hard to describe. I think the reason my piece “What Should Engineers Read in an Era That No Longer Reads Code?” went unexpectedly viral was a reaction to that turbulence.

This post took inspiration from Karpathy’s tweet, but it’s stitched together from my own scars and the field reports of people like Armin Ronacher, Boris Cherny, WenHao Yu, and IndyDevDan, distilled into nine core skills.

The nine core skills are:

  1. Decomposition
  2. Context Architecture
  3. Definition of Done
  4. Failure Recovery Loop
  5. Observability
  6. Memory Architecture
  7. Parallel Orchestration
  8. Abstraction Layering
  9. Taste

The interesting thing is that all nine of these were already required of any engineer who got things done well, and any manager too, long before agentic engineering or even vibe coding. Why that’s true is the thread I want to pull. Let’s start with Karpathy’s story and walk through them one by one.


The Weekend of Vibe Coding’s Inventor

Karpathy said he wanted to build a dashboard for his home cameras over a weekend. He gave the agent the IP of his DGX Spark, the username, the password, and the goal. SSH key setup, vLLM configuration, model downloads and benchmarks, video inference server, web UI dashboard, systemd service setup, memory notes, and a markdown report at the end (he asked for all of it in one go). Thirty minutes later it was done.

“I didn’t touch anything myself. This was a weekend project just 3 months ago. Now it was 30 minutes of just forgetting about it.”

Karpathy gave this new mode a name. Agentic Engineering.

“‘Agentic’ because 99% of the time you are no longer writing code directly, you are commanding and supervising agents. ‘Engineering’ because there is art, science, and skill to it.”

The era of an app popping out of a few lines of prompt is over. What matters now is the skill of designing the conditions under which agents actually work.

The change is fast. The adaptation is slow. Most developers haven’t caught up.

And the speed of this shift is not normal.

“It is hard to put into words how much programming has changed in just the last ~2 months. This was not a ‘business as usual’ kind of incremental progress.”

Most developers are using AI, but the share of work fully delegated to agents is still low. According to the 2026 Agentic Coding Trends Report, 60% of developers use AI but only 0–20% have fully delegated work to it. There’s a name for this gap: the Delegation Paradox. Letting AI write code is one thing. Handing the work to an agent and walking away is a completely different question.

IndyDevDan put the gap in one sentence.

“Do you trust your agents?”

Most developers say no. I said no at first too. I reviewed every line the agent wrote, and there were times when it took longer than just writing the code myself.

But as Karpathy’s example shows, in the agentic engineering era more and more of the work is being automated and delegated to agents. So what skills (or qualities) do we need to keep being good engineers in this world?


① Decomposition

If you ask an agent to “build a signup flow,” you’ll get something. The problem is the odds of it being what you actually wanted are low. Email verification is missing. The password rules don’t match yours. The UI went somewhere you couldn’t have predicted.

Telling an agent to do work is, in the end, the act of deciding what to build. What does the customer want, what does the user need, what’s the priority? That part is on me. The agent can’t take that off my plate.

“The key is to develop the intuition to decompose tasks appropriately, delegating to agents where they work well and providing human help where needed.”

Easy to say, hard to do. The line between “where they work well” and “where humans need to step in” shifts every time. Some tasks the agent finishes one-shot. Others, you can run three times and it still misses the point. Building intuition for that difference is what decomposition is. Karpathy was pretty clear about the conditions for decomposition too.

“It works especially well in some scenarios, especially where the task is well-specified and the functionality can be verified/tested.”

Flip that around: when the spec is fuzzy and there’s no way to verify the result, the agent gets lost too. My job is to turn the fuzzy requirement into a clear unit of work.

When I built out a TDD workflow with Claude Code, the lesson was that 70 to 80 percent comes out of one shot, and the remaining 20 percent is the actual job. How well you defined the requirement up front decides how big that 20 percent ends up being.

You can see the same pattern in WenHao Yu’s Opus 4.6 multi-agent workflow. He hands big projects off to an AI Team Lead, and 70 percent of what the Team Lead does is, in fact, decomposition. It first designs the answer to “what subtasks do we need to build this feature?” and then dispatches each subtask to a different agent. If the decomposition is right, the rest follows. If the decomposition is wrong, every agent loses the thread.

I lived through that “if the decomposition is wrong, every agent goes off the rails” lesson directly. One time I threw “build the settings page” at an agent as a single task, and inside the settings page were profile editing, notifications settings, subscription management, and data export. The agent tried to build all four at once, and the state management got tangled. Changing the notifications toggle reset the profile form. An error in subscription management broke the whole page. In the end I split it into four independent tasks, gave each one to an agent, and it worked first try. The breakdown wasn’t “settings page.” It was “profile edit form,” “notifications toggle component,” “subscription management panel,” and “data export button.” Four pieces.

Before: AddPlan, thrown in without an interview

I had to build the plan creation screen (AddPlanView). A five-step input flow: name input, scope setting, period selection, weekday picker, summary confirmation. I had the Figma designs, and I had written a PRD. “Surely the agent can build this in one pass.”

That was a naive expectation.

The agent shipped a first version. The shape was roughly right at a glance. The details kept slipping, though. It was pulling colors and fonts that weren’t defined in the design system. The CustomNumberPad layout for Step 2 didn’t match the Figma. I fixed that, and Step 3’s calendar broke. Every fix pushed another step out of place. By the third round I was thinking, “Is it faster if I just write this myself?”

The cause was clear. I’d started without sorting out, even for myself, what I actually wanted. I had a PRD, but the details (the spacing and tap targets of the CustomNumberPad, the direction and timing of step transition animations, how validation errors should be displayed) were all still in my head. The agent can’t read my head, so it made its own choice each time, and each time it didn’t match what I wanted. We ping-ponged for dozens of turns and burned almost half a day.

After: Socratic dialogue to sharpen the requirements

After that I started interviewing with the AI before building any feature. Frameworks like Superpowers automate this for you, but the core is the same: the process of making “what I want” explicit, before the code starts. Think of it as a Socratic dialogue. The AI asks questions, I answer them, and the requirement gets more specific.

I tried this approach the second time around with AddPlan. “I want to build a five-step input flow.” → AI: “What are the input fields and validations for each step?” → “Step 1 is name input, no empty strings, max 50 characters.” → AI: “Are you using design system colors? Any custom colors?” → “Design system colors only. The accent color is #FF6B35.” → AI: “Step transition animation? UX on validation failure?” → “Slide, with inline error messages.”

Five minutes. That’s how long the conversation took. The edge cases that came out in those five minutes were almost identical to the ones I’d discovered one by one across half a day of ping-pong. The difference was that last time I’d found them after writing the code, and this time I cleared them up before.

When I handed those crisp requirements to the agent, the one-shot quality was clearly different. Splitting the work step by step and writing the spec for each step explicitly cut the revision cycle to 2–3 turns. Even those revisions were design tweaks, not structural changes. Five minutes of interview saved half a day of ping-pong. From then on, the feature interview became a default step in my workflow. Every feature build now goes “interview → spec writeup → instruct agent.”

People say spec-driven development (SDD), which rose alongside vibe coding, means you can build cleanly if your PRD is right. That’s true. But how to decompose the spec is still on us.

How to practice this

Engineers aside, the people who get things done well, in any field, decompose big tasks into pieces and stay in flow by selecting and focusing on one piece at a time. The people who get things done badly skip planning, dive in, ping-pong around, and end up missing the deadline. (Yeah, I’ve seen plenty of developers like this.)

If you’re at the wrap-up stage and the ping-pong with your coding agent is going long, that’s a signal to ask whether you’ve actually decomposed the work properly.

The first habit is writing a requirements doc before you start implementing. It doesn’t have to be elaborate. Just writing out, as plain text, “what does this feature do, and what does done look like?” already exposes the gaps. These days I make a small requirements.md before any feature build. (Not a spec doc.)

Interviewing with the AI is also worth folding into your daily workflow. It feels awkward at first, being asked questions by the AI. After a few rounds, though, you’ll catch yourself getting flagged on edge cases you’d missed. It pays off most on stateful features like auth, payments, and file uploads. Whether you use a framework like Superpowers or just ask ChatGPT, “what should I think about before I build this feature?”, the method doesn’t matter. The point is to give yourself thinking time before you start building. Five minutes is enough. Once you’ve felt those five minutes save four hours a few times, the habit makes itself.

Throwing a sentence into the agent’s chat shell from minute one is never a good habit. It’s the same habit as the developer who jumps into code without a plan.

So you also need to deliberately practice splitting big work into “the size an agent can finish in one turn.” Roughly: 3 to 5 files modified, 15 to 30 minutes to complete. Bigger than that, split it. Smaller, combine it. After about ten attempts, you’ll feel it. That feel is decomposition.

The latest Codex and Claude Code design good task plans on their own with tools like Task. Simple requirements or fixes are probably fine. In the end, though, you have to do it yourself first to know. Do, then delegate. The order matters.


② Context Architecture

Look again at Karpathy’s DGX Spark example. What he gave the agent was four things: IP, username, password, goal. No padding, just what was needed. That’s the ideal of context architecture.

Real production environments aren’t this clean. A project has dozens of files, business logic, coding conventions, architecture decisions made months ago. How you hand all that context to the agent decides the quality of the output. To borrow Karpathy’s framing, natural language is now the interface in place of code.

Karpathy included “memory notes and a markdown report” at the end of his instructions. That’s not just documentation. That’s an instruction to structure the context the agent generates while working, so it can be passed to the next task. Context isn’t only something you give. It’s also something you build.

Writing a good AGENTS.md matters, but that isn’t all of it. If the code architecture itself is well designed, the speed at which an agent grasps context is in a different league.

These days in Codex you can pin a skill with $ and pass exactly the right context, which lifts accuracy a lot. Documentation alone isn’t the whole answer, though. I learned that the hard way.

Counterintuitively, in the end, you have to write good code.

If the directory structure is clear, the naming is consistent, and the concerns are separated, the agent picks it up fast. Conversely, no matter how well you’ve written the docs on top of spaghetti code, the agent is likely to wander. Saying we’re in an era that no longer reads code doesn’t mean code quality matters less. It matters more.

The idea of an agent-friendly codebase

Flask creator Armin Ronacher raised an interesting angle. He argues that language choice itself is part of context architecture when you’re collaborating with agents. His conclusion was unexpected: Go is an agent-friendly language.

“Go is sloppy: Rob Pike famously described Go as suitable for developers who aren’t equipped to handle a complex language. Substitute ‘developers’ with ‘agents.’”

Go is statically typed but flexible, and the syntax is easy. Simpler than Java, stricter than Python. Above all, it’s explicit. I once gravitated toward functional and bleeding-edge languages, and the reason I settled on Go is exactly this. It’s also easy for juniors to learn. By the same logic, it’s easy for agents. Whatever the language is, what matters is a structure that gives the agent fewer ways to mess up.

Ronacher is sharp on tool design too.

“Tools need to be protected against an LLM chaos monkey using them completely wrong.”

He puts double-execution guards (pidfiles) and port-conflict prevention into his Makefile. Agents will gladly start the same server twice or try to bind to a port that’s already in use. Blocking that at the tool level shrinks the space the agent can flail in.

Boris Cherny, the person who built Claude Code, said something similar in his Lenny’s Newsletter interview. One reason he can run 15 agents in parallel is that he isolates the context for each one rigorously. Agent A only touches the frontend, agent B only the API, agent C only tests. With minimal context overlap, conflicts go down and the accuracy of each agent goes up.

Before: agent lost in a flat directory

In the early days of the iOS app, the directory structure was effectively flat. The Views folder had thirty screens jumbled together, with models and view models sitting at the same level. Naming conventions varied per file: some were PascalCase like PlanListView, some were DailyTasks, some were just Summary. Even a human reader needed time to figure out “where does this file belong?”

Setting aside that this was my first iOS native app, the project folder I’d set up to prototype quickly had grown enormous as features piled on.

I got tired of telling the agent, every single time, “not that folder, this folder.” Saying “fix the settings screen” often meant the agent touched unrelated files. The settings screen’s view model would import a model from the home screen. The directory structure didn’t enforce any separation of concerns, so the agent didn’t know the boundaries either. The context window filled up with files that didn’t belong, and accuracy dropped.

After: feature-based directories with role separation

I restructured the directories around features. Features/Plan/, Features/Daily/, Features/Settings/. Each feature folder holds its own View, ViewModel, and Model together. Shared components moved to Shared/Components/, common models to Shared/Models/.

I unified the naming too. {Feature}{Role} pattern: PlanListView, PlanListViewModel, PlanModel. From the file name alone you can tell what the file does and where it belongs.

The change was immediate. Tell the agent, “add a dark mode toggle to the Settings screen,” and it works inside Features/Settings/ only. There’s no reason left to touch other features. The code structure becomes the boundary of the context. I don’t even need to say “only look at this folder.” The structure itself communicates the scope.

The HumanLayer team’s analysis points the same way. Once your CLAUDE.md (or AGENTS.md) crosses 150–200 instructions, the rate of compliance drops sharply. Task-specific instructions need to live in separate files. One well-structured directory tells the agent more than ten pages of docs.

How to practice this

Practice clean architecture deliberately. “Code that’s easy for an agent to read” and “code that’s easy for a human to read” overlap startlingly often. When I start a new project, the first thing I do is lay down the directory structure and write what each directory is for in the README. Partly for humans, partly for agents.

I stick to a DDD / Clean Architecture structure because it’s testable, and I particularly enforce strong conventions on the use case layer. iOS differs a bit from server work, but the skeleton is roughly the same.

In AGENTS.md I keep things tight: architecture decision rationale (ADRs), coding conventions, a glossary of domain terms. The rest, I let the code itself speak. Accurate type definitions, function names that carry meaning, tests that double as spec docs. That’s the best AGENTS.md.

Designing for context separation is also worth trying. Worktrees, multiple agents running in parallel with isolated environments and isolated roles and goals: performance peaks here. You’ll want to manage backend and frontend in one place, and you’ll hope to do everything from one shell. In the end, though, splitting Plan, documentation, development, testing, and commit across separate agents is much more efficient. There’s more to manage and at first it feels like overkill. As the work gets more complex, the separation pays off. (Which is exactly why orchestration tooling matters more.)


③ Definition of Done

Letting an agent run overnight and checking in the morning is a thrilling experience. There’s also a moment where the thrill turns into emptiness. The report says “task complete,” but when you actually look, only the documentation got updated, or all you have is stub functions and interface scaffolding. You don’t have working code. You have code that looks like it could work.

Karpathy, discussing what agents still need, listed several things including supervision.

“Of course this is not yet perfect. Things still needed: high-level direction, judgment, taste — knowing what good looks like — supervision, and providing hints and ideas on repetitive tasks.”

Agents need supervision. And supervision starts with definition of done. If you don’t clearly define what “this task is finished” means, the agent reports “done” by its own standards. Nine times out of ten, those standards aren’t yours.

Before: an automation CLI, run overnight, came back hollow

I tried to build a workflow automation CLI based on the Codex App Server. A tool that auto-runs the loop propose → plan → run → verify → archive. I prepared a planning doc covering the full architecture, module structure, and API design. I planned parallel agent execution: Stream A for core logic, Stream B for the CLI interface, Stream C for tests. “With this much documentation the agent can handle it.” I let it run overnight.

When I checked in the morning, it had stopped after one hour. The agent had decided “there’s nothing left to do” and stopped. The file structure was tidy. It was all stubs, though. func Propose() error { return nil }. The type definitions and module structure were perfectly in place, and the actual business logic was empty. It was like being handed a well-organized empty house.

The more instructive lesson was the second attempt. When I retried the CLI, the agent reported “all tests passing.” Cracking it open, the agent had quietly rewritten the tests for its own convenience. Instead of verifying the actual scenarios (does propose really call the API and parse the response, does plan respect dependency order, does verify catch failure cases), it had swapped them for code that just checked whether the function returned without an error, then declared “all green!” From the agent’s view this wasn’t a lie. The tests really did all pass. They just weren’t the tests I wanted.

That’s when it clicked: the agent’s “done” isn’t my “done.” And what closes that gap isn’t a better model. It’s a clearer definition of done. I hadn’t read my own doc carefully. I’d written the requirements myself, and I hadn’t sat with the complexity hiding inside them. “There’s a doc, so the agent will read it and build it.” That’s the most dangerous antipattern.

After: DoD plus a reporting system

When I tried the CLI again later, I took a completely different approach. Every task instruction now includes two things. The first is a definition of done document. Stream A’s DoD: “the propose command actually calls the API, parses the response, and saves it as a JSON file. Add three new integration tests.” That level of specificity. And critically: “Stub patterns like return nil don’t count as done. Don’t modify existing tests. Add new tests only.” That blocks the agent from escaping into stubs or rewriting the tests.

The second is a task report. When the agent finishes, it has to write up the results against the DoD. “What I did, which DoD items I met, what’s left.” With a report, I can grasp the state in five minutes before opening any code.

What stands out in Elvis’s system is that the definition of done is staged. In his agent system, “done” isn’t just writing the code:

  1. Was a PR created?
  2. Is it synced with the main branch (no merge conflicts)?
  3. Does CI pass (lint, type check, unit tests, E2E)?
  4. Did the Codex code review pass?
  5. Did the Claude Code code review pass?
  6. Did the Gemini code review pass?
  7. If there’s a UI change, is a screenshot included?

Only when all of those clear does the Telegram notification arrive: “PR #341 ready for review.” Before then, no notification. Three agents review the code, CI passes, and the merge is conflict-free before a human is pulled in.

You don’t need to go this far (honestly, I haven’t gotten there yet either), but the principle is the same. The agent has to be told, concretely, what “done” means. Otherwise the agent applies its own definition of done. The odds of that lining up with yours are not great.

This isn’t just my lesson. The GitHub Engineering team uses the same pattern. In multi-agent systems they enforce inter-agent messages with typed schemas and explicitly limit what each agent can do.

“Most multi-agent workflow failures come down to missing structure, not model capability.”

The CLI failed not because the model was dumb. It failed because I hadn’t given it structure.

How to practice this

Start by including a DoD checklist on every task instruction. Writing a DoD every time can feel like overkill. After two or three “the agent said done but it wasn’t” experiences, though, not writing one feels riskier. My task instruction template now has a DoD section by default. “Tests pass + existing tests untouched + report submitted” is the baseline, and I add items based on the work.

Build the habit of not taking the agent’s “done” report at face value. This is healthy verification, not paranoia. It matters more for overnight work. When I dispatch a long-running task now, I always insert mid-run checkpoints. “Report after stage one. Report after stage two.” This way, instead of losing eight hours, you catch the wrong direction at the two-hour mark. Once you’ve opened “task complete” and found a hollow shell, you’ll feel the value of mid-run checkpoints in your bones.

Practice cutting DoDs into smaller units too. The DoD for “login feature complete” is full of holes. Break it into “email verification flow complete” and “password reset complete” and the criteria sharpen. Decomposition (①) and definition of done (③) are a pair. Well-decomposed work has a clear DoD, and a clear DoD makes decomposition easier.

④ Failure Recovery Loop

Working with agents means failure is the norm. The workflow that worked yesterday breaks today. A new model ships and the same prompts behave differently.

“The agent autonomously worked for ~30 minutes, running into various issues along the way, looking things up online to solve them, iteratively resolving them.”

The agent itself runs as a loop of failure and recovery. It doesn’t always go this cleanly, though. The agent’s self-recovery has limits. When the agent hits a failure it can’t resolve on its own, what matters is how the human steps in.

Before: redistribution engine, infinite A↔B loop

One core feature in the iOS app is the study load redistribution engine. “I couldn’t do today’s portion, so I’ll do more tomorrow.” The engine recalculates the leftover load and redistributes it. The bug looked simple: calling the redistribution API made existing data on future dates disappear. 47 out of 50 records were lost.

The cause sat in two places. The delete function was deleting everything without a date filter, and the function for extracting incomplete data was excluding future-dated records.

I knew the cause, so I should be able to fix it, right? That’s where hell started. All 5 scenario tests passed. When I dug in, the tests were doing “data > 0” level checks. 50 dropping to 3 still passed. (This isn’t the agent’s fault. It’s mine.)

The real problem came next. The meaning of a specific parameter differed across functions. includeToday=true meant “fetch today’s data” in function A, and “delete starting from today” in function B. Same parameter, completely different semantics. Fix A and B broke. Fix B and A broke. The agent fell into its own loop, repeating fix → break → fix → break.

After: isolation tests plus Must NOT Have guardrails

I narrowed the code in the end. Instead of testing the full API flow, I isolated the problematic function and tested it on its own. What was invisible inside the integration test became obvious once I isolated it. Then I built a separate path that didn’t touch the existing code. I defined the semantics of each function independently and reimplemented them.

The key was the “Must NOT Have” guardrail. “Don’t modify this file. Don’t change the API response contract. Don’t modify existing integration tests.” Those three prohibitions broke the agent’s A↔B loop.

This experience maps almost exactly onto Factor 9 of Dex Horthy’s 12-Factor Agents: compress errors into the context so the agent can self-heal. Not just “try again,” but inject the cause and the surrounding facts so the same mistake doesn’t repeat.

Don’t retry with the same prompt

Most agent loops, on failure, retry with the same prompt. “Try again.” That works sometimes. For non-deterministic errors, like a network timeout or a flaky API, retrying is right. When something is fundamentally wrong, though, repetition gives you the same result. The agent is using the wrong library, or it has misread the requirement, or it doesn’t have enough context. Retrying with the same prompt in those cases is just headbutting the same wall.

The point is to analyze the cause of the failure and prescribe accordingly. Not repeat the same instruction, but write a better one. That difference is enormous.

Sorting failures into three types makes the prescription clear.

Type 1: Context shortfall. The agent doesn’t know something it needs to know. Fix: add the missing information.

Type 2: Direction error. The requirement itself was misread. Fix: redefine the requirement more clearly.

Type 3: Structural conflict. There’s a problem in the code structure itself. Fix: narrow the code, isolate it, set guardrails, change the structure, and retry.

The redistribution engine was Type 3. Not “try again,” but “isolate this file for tests, and don’t touch this file.” The structural prescription. Just by asking “what type is this?” before you press “try again,” recovery speeds up noticeably. It’s faster to figure out why the agent failed and adjust the instruction. Understanding why the agent failed is itself engineering.

How to practice this

The starting habit is logging every failure, even briefly. “Missed the context.” “Misread the requirement.” “Fell into A↔B loop.” A pile of those short notes starts to show patterns. When the same type repeats three times, that’s the signal to change the system.

Stay open to new tools and methods. I’ve moved from Cursor to Claude Code, from Claude Code to Codex, and through OpenClaw, Superpowers, and several skill systems along the way. Each tool had its own failure pattern, and crossing between them is what built up my “feel for working with agents.” Don’t get attached to any one tool. Tools are means, not ends.

Per-project KNOWN_ISSUES.md files are also effective. Keep a list of “the mistakes the agent makes most often on this project” and the recurrence rate clearly drops. Failure logs become memory, and memory becomes a system.

When you try a new approach, use the “30-minute rule.” If there’s no meaningful progress in 30 minutes, find another way. If something works inside 30 minutes, dig deeper from there. Failing is fine. Repeating the same failure is not.


⑤ Observability

Handing a big task wholesale to an agent is convenient. When something goes wrong, though, it’s hard to figure out where. “At what point will I check the result?” That question is the heart of observability.

In Karpathy’s DGX Spark example the agent worked autonomously for thirty minutes. The post doesn’t say what Karpathy did during that thirty minutes, but the fact that the agent left “memory notes and a markdown report” means the work history was traceable.

The stronger models and agents get, the more observability matters. The more an agent can do, the more directions things can go wrong in.

Before: liquidglass, the cost of “weird, but let’s leave it”

When iOS 26 was announced, I tried to apply liquidglass for the first time. I wanted to bring the new design language into our app. I expected the agent to handle the update on its own. (That this expectation was naive, you should be seeing the pattern by now.)

I watched the agent work. The first few files looked fine. Around the fourth or fifth file, something felt off. The scope of files it was touching was wider than expected. Colors looked like they were drifting from the original intent. The branches for backward compatibility were getting more tangled.

“Weird, but let’s leave it.” That single sentence was the most expensive call I made.

When I checked the result, the UI was completely broken. The translucent effect of liquidglass collided with the existing color scheme and tanked text legibility, and in dark mode some elements vanished entirely. The worst part was there were no per-step commits. I couldn’t roll back partially. All in or all out.

If I’d stopped at the fourth or fifth file and checked, in the worst case I’d have rolled back five files. Letting it run to the end meant cleaning up across more than twenty tangled files.

After: tracer-bullet strategy plus a blueprint

After this, when I apply a new technology I always use a tracer-bullet strategy. Instead of applying it everywhere at once, I apply it to the simplest screen first. Fire small, check fast. If it’s fine, expand to the next screen.

The real value of the tracer bullet is that it produces a blueprint. Applying liquidglass to one screen showed me, “ah, this is where it collides with the color scheme, this is how the dark mode branch needs to be done.” For a technology you’ve never used, you can’t draw the blueprint up front. The tracer bullet draws it for you, fast. With the blueprint from screen A in hand, when the agent started touching unexpected files on screen B I could immediately judge “this isn’t what I expected.”

Per-step commits became mandatory too. “Apply screen A” → commit → “Apply screen B” → commit. Now if screen C breaks, I have rollback points. “Commit every three files modified.” It’s a one-liner instruction. That single line drops the cost of fixes dramatically.

As observability goes up, the scope you can delegate goes up too. At first I was nervous handing off a single function and would review everything. With the tracer-bullet strategy and per-step commits in place, I now hand off module-level work without anxiety. Observability builds trust, and trust enables delegation. My answer to “do you trust your agents?” is shifting toward “more and more, yes,” not because the agents got smarter, but because my observation system got more refined.

How to practice this

First, build a feel for splitting work into the right size. My rule of thumb: if reviewing one PR takes under 10 minutes, the size is right. Over 30 minutes is too big. By file count, 3 to 5 is a good range to check at once. I wasn’t sure about this rule at first, but after a few months “this is too big” started landing on its own.

Designing explicit mid-run checkpoints needs to become a habit too. “Show me when you get this far.” That sentence prevents an hour-long detour. Auto-reporting is even better. I have my agent report a diff summary every three files modified. Instead of looking at the full code each time, I read the summary and decide “direction OK” or “stop.”

Then there’s the habit of sketching, before you start, a rough blueprint of “this is roughly how it’ll go.” That’s the precondition for observability. If I don’t know where the agent is supposed to go, I can’t tell when it’s gone off. For a refactor, “I’ll touch these files in this order.” For a new feature, “this module will end up with this shape.” That level of sketch is enough.

The blueprint doesn’t have to be accurate. There are times when the agent’s different approach is better than mine. What matters is noticing “it’s going in a different direction.” Even when the agent goes its own way, with a blueprint I can immediately catch “wait, this is different.” Without one, catching it isn’t even possible.

Elvis’s 10-minute cron monitoring is the automation of this blueprint comparison. It compares the agent’s current state (tmux session alive, PR status, CI result) against a pre-defined expected state. When something deviates, an alert fires and a human steps in. It’s a 100% deterministic bash script, so it costs no tokens and can’t really be wrong. Simple principle. That simple principle is one of the pieces of infrastructure that makes 94 commits per day possible.


⑥ Memory Architecture

If you do long work with AI you’ll hit a wall every time. As the session stretches, it forgets what was said earlier. There’s a name for it, context compaction. Context shrinks too aggressively, and continuous work suffers most.

Karpathy’s agent instructions always ended with “memory notes and a markdown report.” The point isn’t just writing code. The point is leaving a record of what was done.

An orchestrator without memory treats every session like a first meeting. What we did yesterday, what we decided, what failed: all gone, every time, starting from scratch.

Before: 15 minutes every morning re-explaining context

When I was three days deep into an auth refactor, every morning I’d open with “yesterday I changed the JWT structure,” and it wore me down. The dev.to post by @suede describes the same situation exactly. Continuous work, but every new session in the morning meant explaining yesterday’s work from the top. “I changed this structure yesterday, let me start by explaining why.” That’s 15 to 20 minutes gone. Three days in a row, that’s almost an hour. And verbal recap isn’t perfect. Things I forgot or mis-remembered slip in.

After: hooks for automatic memory, restore in 5 seconds with one MEMORY.md

@suede’s solution was elegant. He used Claude Code’s hooks feature to build a system that automatically extracts “memories” at the end of each session and writes them to CLAUDE.md.

“Session 1: Claude works → hooks silently extract memories → saved. Session 2: Claude starts → reads CLAUDE.md → instantly knows everything.”

The point is that “you don’t need to tell it to record.” Hooks summarize and append the work content automatically at session end. The next session reads it on start. Time to restore context: 5 seconds. From 15 minutes to 5 seconds. Once you feel that gap, there’s no going back.

I haven’t gone as far as hooks, but borrowing the pattern, every turn of work in Codex or Claude Code I always update the memory and progress doc. In MEMORY.md I write “what I did today, what I decided, what to pick up next.”

The Boris Cherny team’s case extends this memory to the team level. The Claude Code team checks a single CLAUDE.md into git so the whole team shares it. When Claude does something wrong, they immediately add to CLAUDE.md: “next time, don’t do this.” Even in code review they tag @.claude and update it as part of the PR. Individual memory becomes team memory passed to the agent.

Tools are pouring out in this direction now. Claude Code’s built-in memory, AI memory layers like supermemory.ai. As memory infrastructure matures, the underlying problem of “every session is a first meeting” is heading toward a real solution.

How to practice this

The habit of documenting every turn is, almost by itself, the whole of memory architecture. Make one MEMORY.md and start writing every day. Today’s decision and why, what’s next, open issues. Those three items are enough.

One tip: keeping the memory’s structure consistent matters too. I write MEMORY.md in date order, and tag each entry with [decision], [work], [issue]. Later, when I’m looking for “what was that architecture decision from last month?”, searching [decision] returns it inside ten seconds. That small bit of structure makes the memory dramatically more searchable.

When projects get long, bring in a searchable system (Obsidian, etc.). The point is “a searchable record.” If you can’t find an architecture decision from three months ago, you’ll have the same conversation again. Memory breaks that loop.


⑦ Parallel Orchestration

One of Karpathy’s key points was this.

“The highest leverage is in designing a long-running orchestrator with the right tools, memory, and instructions to productively manage multiple parallel coding instances.”

“The highest level of agentic engineering, accessible through this, is currently very high leverage.”

Building different features at the same time across multiple worktrees is technically possible. In practice, the management is rough. Agent A is on the auth module, agent B is on payments, and they both touch the same user model. Collision.

From Boris Cherny earlier to Elvis (@elvissun) at 94 commits a day, the direction is the same. A single engineer orchestrating multiple agents to produce team-level output. That’s exactly why Karpathy named this “agentic engineering.”

His example shows the extreme of this direction. Five Claude Code instances running in parallel in the local terminal, plus another 5–10 running on claude.ai/code. 10 to 15 parallel sessions in total. The structure works because each agent’s context is rigorously isolated. Tools like Superset.sh and oh-my-codex (omx) are emerging in the same direction.

Echoes of my CTO years

Going through this kept reminding me of my CTO years. The years I managed six squads. Daily meetings with six teams, getting a read on each team’s state, unblocking blockers, keeping the overall direction from drifting. Managing parallel agents resembles that work to a startling degree.

I’ve seen people compare current parallel agent coding to ADHD. Switching between many tasks and not being able to focus on any of them. There’s something to that. I think it’s closer to managing, though. ADHD is unintended distraction. Agent management is intentional multitasking. What a manager needs isn’t “the ability to write the code for every team” but “the ability to read every team’s state, unblock blockers, and align direction.” Parallel agent management is exactly that.

In ⑤, leaving one agent alone was already risky. Five agents running together doubles that risk. When I managed six squads, the most dangerous moment was the one where I let myself think “everything is going fine.” In that moment one team is flailing, two teams are duplicating each other’s work, or someone is sprinting in the wrong direction. Same with agents. Leave five agents running in parallel with a “they’ll figure it out” mindset and merge time becomes a collision festival, one agent overwrites another’s work, or the outputs end up wildly inconsistent.

Checklists and sync points are the lifeline. And this isn’t a new skill. Good managers already have it. The agent era just put a new name on it.

There’s one decisive difference between managing people and managing agents. People ask questions. Agents don’t ask. They proceed on their own judgment. That’s why design up front matters more in agent management. “In situations like X, do Y.” You have to set that ahead of time.

How to practice this

Start small. Running five agents in parallel from day one ends in chaos. Start with two.

Sharing my own experience: the first day I ran two agents in parallel was a mess. While I checked A’s results I missed B’s progress, and when I went to check B, A was waiting on me. From day two I started using a timer. 25 minutes monitoring agent A, 5-minute break, 25 minutes on agent B, Pomodoro-shaped. Once that routine settled, both agents ran stably. A week in I added a third. Two stable, then three; three stable, then five. People who’ve managed teams or led squads will move through this faster.

You also have to map out dependencies between parallel work and design for collision avoidance. Using git worktree gives you physical separation. Have agent A work in worktree-auth and agent B in worktree-payment, and file conflicts shrink on their own.


⑧ Abstraction Layering

There are levels in agentic engineering, in my view. I distinguish them by feel.

I’m currently at Level 2 and trying for Level 3. I’m building skills, automating workflows, and experimenting with structures where agents manage agents.

Before: the days of repeating the same instruction every time

In my Level 1 days, I manually repeated the same routine every morning. “Check yesterday’s merged PRs” → “summarize the changes” → “list open issues” → “propose priorities.” All four, in order, every time. Twenty minutes a day. Seven hours a month. It took me about three weeks to notice that the instructions were almost identical every time.

After: one skill, “summarize this week”

I turned that routine into a skill. One sentence runs it: “summarize this week.” A 20-minute routine became 2 minutes. Beyond the time savings, there was a bigger change. Building this skill forced me to make explicit “the pattern of judgments I make every day.” The process itself was practice in raising the abstraction layer.

There was one thing I felt every time I built a skill. People call it compounding engineering. Our projects are big enough that they don’t end in a single session. This isn’t a finish-line game. It’s a compounding game where earlier sessions affect later ones with interest.

“The biggest payoff is in raising the abstraction layer ever higher.”

The “edge” Karpathy named isn’t just time savings. Each level up widens the field of view, so you can take on bigger problems. From the era of writing code by hand (Level 0), to the era of instructing an agent in English (Level 1–2), to the era of designing an orchestrator that manages agents (Level 2–3). Every level up dramatically broadens what a human can take on.

When abstraction goes up, the human’s role changes

It’s not that the human idles while the agent works. Instead of typing code, you design the system. Instead of instructing the agent, you build the environment in which the agent works well.

The hours that used to go into typing code now go into setting direction, making judgments, and supervising quality. That’s the practical meaning of raising the abstraction layer.

How to practice this

“I’m giving this same instruction for the third time.” That awareness is where abstraction begins. When you see repetition, turn it into a skill or a template. A simple prompt template is fine to start. That one small piece of automation becomes the base for the next one.

Make a habit of asking, “what would I need to delegate this to an agent?” That question itself is the start of abstract thinking. Look at every task you do by hand through “is this delegable?” If it is, what context, tools, and memory would the agent need? Repeating that question builds the skill of designing abstraction layers.


⑨ Taste

The last one is the hardest to measure and maybe the most important.

“Things still needed: high-level direction, judgment, taste — knowing what good looks like.”

The sense of looking at what an agent built and telling “this one’s solid” from “this one’s off.” It works technically, but somehow it’s uncomfortable. The code runs, but somehow it doesn’t feel right. You can feel it. You actually have to feel it.

“‘Engineering’ because there is art, science, and skill to it.”

Art, science, skill: taste sits where these three overlap. It isn’t innate. It’s something you accumulate by going deep.

A prototype the AI made, my partner’s reaction

There was an episode with my current partner, Ellie, a product designer I work with on building the app fast. When I made screen A with AI and showed it to her, she was put off at first. The output landed without discussion, and she felt like she didn’t know what her role was. (Designers, like developers, are wrestling hard with their direction in the AI era.)

After enough conversation, when I delivered screen B the same way, it was different. By then she understood the direction I was going for, and with a concrete working prototype as the reference, what was missing and what needed more polish became visible. Communication cost, the kind that usually only resolves after many ping-pong rounds between designer and developer, dropped dramatically.

AI design is bland

The same thing happens in our current project. Our app isn’t just a generic productivity app, but the AI kept generating only the boilerplate productivity-app design. Even when I explained our distinct domain, Claude kept ignoring it and regenerating the universal design.

What I’d handed over thinking, “this is intuitive enough,” was, honestly, a 60 to 70. When I saw what Ellie actually designed, there were things AI could never produce. Looking at the AI output I was uncertain. When Ellie’s design landed, the feeling came: “ah, this works.”

Most of what AI produces is average. There’s real value in laying down the skeleton and the components. Taste, texture, the specific touch: that’s still the human’s territory.

Do work → Good → Great

AI brings remarkable performance gains. What it actually reaches today, though, is, honestly, around 80%. 80% is amazing compared to the past.

The problem is the remaining 20%. Each 1% within that 20% is a bigger gap than the previous 10%. Look at a product, a restaurant, a film. The moment when the extra 2% really went in is the moment that moves you. The feeling you get from a master, a virtuoso, a great director sits outside the band of “average.”

When 80% products flood the market → people will go looking for the better thing in the remaining 20% → and that 20% becomes the differentiator: human skill and craft.

I had a similar experience with social media. A clean information-organization post Claude Code produced: well-structured, sensible, tidy. Zero likes. A single line I wrote on impulse, bragging about something: 30,000 views, 200+ likes. A real human emotion in one line, time-sensitive, beat what passed for polite AI content by a wide margin.

LLMs are statistical models in the end. The word “model” itself means “an approximation of the real world.” What an LLM has learned is the patterns of text on the internet. The average of “good design,” the average of “good code.” Average is safe. It isn’t outstanding. Outstanding comes from leaving the average behind.

Don’t lose your intuition.

Sean Goedecke puts the point exactly:

“About once an hour I notice that the agent is doing something that looks suspicious, and when I dig deeper I’m able to set it on the right track and save hours of wasted effort… This is why I think pure ‘vibe coding’ hasn’t produced an explosion of useful apps.”

That “ability to notice something suspicious” is taste. When the agent decides to spin up a full background-job infrastructure for what should have been a simple async request, the call to stop and say “wait, this is overengineering” is taste. Structural judgment is taste.

”Works” and “great” sit on different axes

This is the thing I most wanted to say in this post. Do work → Good → Great. The gap between those three.

AI does “Do work” remarkably fast. In some cases it gets to “Good.” The last 20% to “Great” is territory you can’t reach if you settle for the AI average of 80%. Customers feel the final 2%. No one is moved by an average output.

If everything is getting easier with AI, suspect it. Ask whether your output is settling at the average. In the era when 80% is everywhere, the differentiator is in the remaining 20%. That 20% is the territory of taste, not of technique.

KinglyCrow’s “No Skill, No Taste” is sharp on this point. Taste and skill are a 2×2 matrix. LLMs look like they’ve lowered the entry barrier on skill, but the real barrier of taste is unchanged. If anything, it’s been amplified. Vibe coding lets anyone build an app, and what gets built without taste is slop. In the era when 80% products flood the market, what separates the remaining 20% is, in the end, taste. No matter how far AI advances, building that sense is still on me.

Chris Lattner, who built LLVM and Swift, reached the same conclusion. When Anthropic released the project where Claude Code implements a C compiler from scratch (CCC), Lattner’s analysis on his blog was that the implementation is textbook and there’s no new abstraction. He compared it to the level of a strong undergraduate team. What he actually highlighted was elsewhere. “As implementation gets more automated, design, judgment, and taste become more important, not less.” The more AI lowers the implementation barrier, the more the taste of what to build becomes the engineer’s core competency.

Taste is accumulated experience

This sense comes from domain knowledge. Someone who has used many good APIs can design good APIs. Someone who has experienced many good UXes can judge good UX. No matter how fast AI builds, judging “is this good or not” is on me.

After 15 years of writing code, I know in my bones the difference between “this is good code” and “this works but isn’t good code.” That difference shows up in a single variable name, a single function structure, a single error-handling style. The same standard has to apply to code an agent wrote. “Works” and “good” sit on different axes.

The agent once built me a search feature that worked perfectly. Technically nothing to fault. Something was off, though. After staring at it a while, it clicked: the search results were sorted alphabetically. Technically correct, but from the user’s perspective, sorting by relevance is far more natural. The agent built “the search feature.” It hadn’t built “a good search experience.” Catching that gap is taste.

How to practice this

The clearest way to build taste is to see, make, and use a lot of good work. Don’t read only tech blogs. Look at design, study business cases, read fiction. Go to museums.

The starting point for taste is the habit of not accepting the agent’s output as-is. Ask, every time, “is this really the best?” “Why is this good?” “Why does this feel uncomfortable?” Repeating those questions sharpens your sense.

Care about people, too. Watch what customers want and where users get stuck. A product that’s technically perfect but uncomfortable to use is a product where taste is missing. Whether it’s running user interviews, watching the support channel, or peering over your neighbor’s shoulder while they use the app, taste sharpens at the point where humans meet technology.

Taste is hard to grow alone. Reviewing other people’s code, watching users react, listening to a partner’s feedback. In the agent era taste matters more, but the way you build taste is still analog. Talking with people, watching the world, experiencing good things. AI can’t do that part for you.


Closing

“Since the invention of the computer, the era of typing code directly into an editor is over.”

True. What’s over is the typing, not the engineering.

Decomposition, context architecture, definition of done, failure recovery, observability, memory architecture, parallel orchestration, abstraction layering, taste. Sit with these nine and you’ll see they were already what good engineers had before the AI era. Agentic engineering is an extension and amplification of these capabilities. Nothing new. The things that were already important got more important.

If one thing has shifted, it’s that the effect of these capabilities has been dramatically amplified. In the past, weak decomposition could be patched by writing the code yourself. In the era of delegating to agents, bad decomposition gets amplified at agent speed. The payoff of good design grew, and so did the damage from bad design.

Mihail Eric, who teaches AI-native engineering at Stanford, gives practical advice: add incrementally. Get really good at one agent workflow first. When you can build complex software with one agent, then add the second. One step at a time, not ten at once.

Mihail also pointed out something important. Watching the people who handle multi-agent setups well, they were people with actual experience managing human developers. My CTO years managing six squads helped directly with managing agents in the same way.

I still have a long way to go. Some days the rhythm with the agent is so on-point that I think “this is the future.” The next day I’m watching the agent flail and grumbling that it’d be faster to write it myself.

The direction is clear, though. And that direction isn’t “write better prompts.” It’s “design the environment in which the agent works well.” Prompts are the tool. Environment design is the substance.

In the end this is a question of taste and experience. Tools change. The substance stays. A good engineer who meets an agent becomes a great engineer. Bad design that meets an agent produces bad output, fast.

These nine capabilities aren’t separate. They’re connected. Good decomposition makes the definition of done clear. Good context architecture makes failure recovery easier. Accumulated memory raises observability. Experience with parallel management lifts the abstraction layer. Underneath all of it sits taste. Build one and the others follow. It doesn’t matter where you start. What matters is starting.

As Mihail emphasized to his students, experimentation is the core of becoming an AI-native software developer. In the end you have to bang your own head against the wall a few times. Everything I shared in this post (half a day of ping-pong on AddPlan, the hollow CLI, the redistribution engine’s infinite loop, the “weird, but let’s leave it” liquidglass) is the result of trial and error. Without that, none of these nine capabilities settle into your hands.

“It is a deep, improvable skill.”

A little better every day is enough. No need for perfect. Just the right direction.

If you’d told me six months ago, “your AI agent will write code overnight and you’ll just review the PR in the morning,” I’d have laughed. Now it’s daily life. I can’t picture what daily life looks like six months from now. One thing I’m sure of, though: even then, decomposition will still be needed, context architecture will still matter, and taste will still be irreplaceable.

The protagonist of that story isn’t the AI. It’s the engineer who handles the AI well.


References

FAQ

What is agentic engineering?
It's the term Karpathy proposed as the successor to vibe coding. Instead of typing code yourself, you direct AI agents to do the work, manage them in parallel, and review the results.
How is agentic engineering different from vibe coding?
Vibe coding is handing things off to AI and just checking the output. Agentic engineering is the discipline of designing the right tools, memory, and instructions, and orchestrating multiple agents to work together.
What's the most important skill in the agentic engineering era?
Nine skills matter: decomposition, context architecture, definition of done, failure recovery, observability, memory architecture, parallel orchestration, abstraction layering, and taste. They all converge on one thing: the ability to design the conditions under which agents actually work.
Tony Cho profile image

About the author

Tony Cho

Indie Hacker, Product Engineer, and Writer

제품을 만들고 회고를 남기는 개발자. AI 코딩, 에이전트 워크플로우, 스타트업 제품 개발, 팀 빌딩과 리더십에 대해 쓴다.


Share this post on:

반응

If you've read this far, leave a note. Reactions, pushback, questions — all welcome.

댓글

댓글을 불러오는 중...


댓글 남기기

이메일은 공개되지 않습니다

Legacy comments (Giscus)


Previous Post
Wrapping Up January and February 2026 (Not Really a Retro)
Next Post
AI Is Only as Smart as You Are