Added 2026-02-08
/tdd:go walks through your work step by step. If you want something more automated, see this follow-up:
Opening
Sorry for the bait. I’m not a 15-year CTO. (I have been writing code for 15 years, that part is true.) I’m also not a current CTO. (I left last year.)
The OpenCode (+oh-my-opencode) wave came and went. Lately OpenCode has been on fire over ToS issues and a few other things. As of today it’s blocked, and I do have a personal workaround, but the work I have left should be fine on Claude Code alone, so I’ve decided to stick with Claude Code for a while.
AI coding tools are popping up everywhere lately. Cursor, Windsurf, Copilot, Claude Code. There are countless posts comparing which one is better, and people are constantly sharing their own workflows. Everyone is shouting “vibe coding,” and YouTube and Twitter are full of videos where an app pops out of a few prompts. “I finished a month of work in three hours!” type stories show up every day.
But I kept feeling a strange unease in the middle of all this. Productivity was up, sure, but something felt missing. The code worked, but did I actually understand it? Could I really call it “developing” if I was just copy-pasting whatever the AI generated? Questions like that kept circling. And the worst part was that even when I tried to ignore the unease, my flow broke and my productivity dropped anyway. It just stopped being fun, so I stopped wanting to do it.
While I was sitting with this, I came across Kent Beck’s writing, and that gave me the spark for my own workflow. What I want to share here isn’t the usual technical “which tool to use” or “how to structure your sub-agents” piece. There are plenty of those already. What I want to talk about is tdd-go-loop, a workflow orchestrator. This goes beyond running a single command. It’s a system where multiple sub-agents collaborate to automate the TDD cycle and run code reviews along the way.
Kent Beck’s Augmented Coding
This workflow was inspired by Kent Beck’s Augmented Coding.
There are plenty of Korean translations out there, so search around if you want a read. A piece written six months ago feels almost ancient in AI-coding time, but the workflow I’m using daily right now started here.
The core of Kent Beck’s argument is simple. Even when the AI writes the code, the human has to stay in control. Not just throwing prompts and accepting outputs, but slicing the work small, verifying as you go, understanding what you’re building. TDD (Test-Driven Development) is the tool for that.
When most developers hear TDD, they think “ah, write the tests first, right?” That’s right. But TDD in the AI era is a bit different. In classic TDD, I wrote the tests and I wrote the implementation. In augmented coding, I design the tests, and the AI writes the implementation. That’s the key. Designing the test first means clearly defining what I want, and the AI generates code that matches that definition. Test passes, the implementation is correct. Test fails, it’s wrong. Simple but powerful.
The important part here is “small unit.” You don’t throw “build me a sign-up feature” at the AI. You break it down to “write the failure-case test for the email validation logic.” You watch the test fail, write the minimum code to pass it, and review it yourself. That’s the cycle. Like stacking Lego blocks, you stack small verified pieces.
The concept itself is very simple. Like everyone knows, write a PRD first, then turn that PRD into a plan.md file structured as Phases and checklists. There’s a tool called Spec-kit, but in my experience it tends to over-formalize and grow the work, so I built my own planning skill. Since the unit is TDD, you have to slice as small as possible, and the key is making sure work-by-Phase progress stays visible.
Then I use a command called /tdd:go to run things one at a time, by Phase or by sub-checklist, and review each one myself.
It’s Not Particularly Vibe-y
If you’ve followed along this far, you can already tell this isn’t very vibe-y.
I run every checklist one by one, check the code, and give feedback if I see something off, something to improve, or something that should be extended. If there’s no issue, I move to the next checklist. I’m not the one typing the code, but it’s almost the same thing.
It’s a kind of pair programming. The AI is at the keyboard, I’m next to it saying “hey, that’s not right” and steering the direction. The AI is the driver, I’m the navigator. The one difference from regular pair programming: the AI doesn’t get tired, doesn’t complain, and fixes things the moment I give feedback. (Though when it makes the same mistake repeatedly, I do get frustrated.)
flowchart LR
A[Write PRD] --> B[Generate plan.md]
B --> C["Run tdd:go"]
C --> D[Write Test]
D --> E[Confirm Test Fails]
E --> F[Write Implementation]
F --> G[Test Passes]
G --> H{Code Review}
H -->|Needs Feedback| C
H -->|Pass| I[Next Checklist]
I --> C
When I actually work this way, one cycle takes about 5 to 10 minutes. Write a test, confirm the failure, write the minimum code, watch the test pass, review it. Repeat. Vibe? None. But every line of code is something I understand as it goes in.
Honestly, I know this approach isn’t “hip.” The current trend is to hand everything to the AI and just check the output. The point is finding what works for you.
If you’re a junior developer who wants to learn the code more deeply, I strongly recommend this approach. Especially when I have to keep the codebase in my head for a project, or when I’m starting from scratch, I go through it step by step. Copy-pasting AI-generated code and stacking up code I understood while reviewing are completely different experiences. The first leaves you with “the code exists.” The second gives you the feeling that “this is my code.” The gap is huge.
A Spec Alone Wasn’t Enough
At first, I thought a good spec was all I needed. Tidy up design patterns, DDD, clean architecture, and Claude Code (or Codex) would handle the rest. For simple CRUD apps, that actually worked. But reality wasn’t that simple.
If you’ve worked with juniors, you know what I mean. Unless the requirements are genuinely simple, there are always moments while writing the code where architectural judgment shifts. How to split a function. Which layer the adapter and protocol live on, and in what shape. How far to reuse an object or a service, and where to draw the scope boundary.
No matter how good the spec is, you hit “wait, this isn’t right” moments while writing the actual code. Can the AI make the right call alone in those moments? In my experience, no. The AI works hard at writing the code that the spec describes, but it doesn’t fully grasp the context. “This function is likely to extend later, so let’s pull it out as an interface now.” “This part is simple today, but the domain will get complex, so let’s split it into a separate service.” Calls like these still belong to humans.
Migrating an enterprise-scale SaaS off legacy and running it for a year and a half taught me one thing. The moment you take your hands off the code, no matter how much conceptual guidance you wrote, the actual code always hits moments where the writer’s judgment is needed. Especially when the initial design collides with messy real-world requirements (which keep changing) and starts deforming bit by bit. Handing that to the AI completely is still too early.
Of course, in the vibe coding era you don’t need to follow up on every line of code. Knowing the important parts can be enough. That’s a personal call. If you’re not a developer or you just want to ship a product without touching code, you don’t have to look at the code. I’m rooting for the people who vibe-code without knowing code. There’s a company that already raised a Series B with a 3,000-line API file of raw SQL crammed into a single controller. Customer value is the goal, code is the means. I look at code because I’m a developer. That’s the only reason.
The Anxiety of Vibe Coding
Anyway, putting that aside, the real reason I settled on this approach is something else.
I was running so far from the actual code that I stopped feeling like I was managing the project properly. Handing it to the AI and resting feels weird, but watching the spinner break my focus while I drift into Threads is probably not just my problem.
You’ve all had this. The awkward stretch of time where the AI is busily generating code and you don’t know what to do with yourself. Doing other work feels off because the result is right around the corner. Just waiting feels wasteful. So you end up on Threads or YouTube, and by the time the result lands, your focus is on the floor. You’re supposed to check the output and give feedback, but your head is already somewhere else. Repeat that, and your sense of the whole project slowly fades.
There’s a psychological story here too. Humans get anxious when they don’t feel in control. Especially in their own area of expertise. A developer not knowing the code is a bit like a driver who isn’t holding the wheel. The car might be going fine, but it’s unsettling. You can’t even be sure it’s heading the right place. No matter how good autonomous driving gets, fully letting go of the wheel feels off, and AI coding is the same way.
What an AI agent really does is automate the time-consuming parts of human work. The catch: handing everything over makes debugging harder. The growing sense that “I don’t know this code” actually dropped my productivity. The blank feeling when you can’t answer “what is this code doing, where?” That’s why I eventually settled on keeping a certain amount under my control.
And more than anything, this is about flow. The Flow state Mihaly Csikszentmihalyi wrote about (it’s where my handle Flowkater comes from). When something is moderately hard, moderately easy, and the feedback is immediate, humans drop into flow. Vibe coding doesn’t meet those conditions. Throw a prompt, wait, check the result, throw another prompt. Flow is hard to find inside that loop. The TDD approach of clearing checklists one by one hits the conditions exactly. The work is small enough that it isn’t intimidating, the test result is instant feedback. So it’s fun.
But there’s a big downside. It’s slow.
So I Built My Own Method
I wanted to ease the anxiety I described above without giving up the productivity of the AI era. So I found a middle point. I borrowed Kent Beck’s augmented coding concept and customized it to my situation. The result is tdd-go-loop.
This isn’t a single command. It’s a workflow orchestrator. More than just running /tdd:go repeatedly. It coordinates multiple sub-agents like spec-review, codex-review, sql-review, apply-feedback while managing the entire TDD cycle. Like an orchestra conductor, it defines when each instrument (agent) plays which part.
For the core engine code I need to understand, the API code that holds the actual logic, or when I’m laying out a project structure for the first time, I always go through /tdd:go. Once you do this enough, you notice something. The AI, just like a human, often misreads or guesses wrong about the initial guidelines and writes the wrong code. After you catch those issues and build the foundation yourself, what comes next is mostly developing similar APIs in repetition, or extending logic.
Because Kent Beck’s augmented coding enforces TDD as the development style, once you have a code structure in place (whatever architecture it is), the AI builds on that base, which makes whole-codebase review easier too. For example, when I first build out a Usecase in Go, I sketch the outline with mock tests, then in the Usecase layer I implement the logic by composing domain models, functions, and Repository interfaces. If I tell it to define mutations at the top of the function and keep private functions as pure as possible, the generated code becomes highly readable, which makes review comfortable.
APIs that match a pattern and structure I’ve already built once with tdd:go can move quickly. I only walk through unfamiliar patterns or complex business logic carefully. The rest, I trust.
My Actual Setup
In Claude Code, you define commands and skills as markdown files inside the .claude/ folder. My project structure looks like this:
.claude/
├── commands/ # Single-execution commands
│ ├── tdd/
│ │ ├── go.md # /tdd:go - run one checklist
│ │ ├── batch.md # /tdd:batch - batch by Tier
│ │ ├── fast.md # /tdd:fast - full automation
│ │ └── status.md # /tdd:status - check progress
│ ├── tdd-go-loop.md # Workflow orchestrator (the core!)
│ ├── spec-review.md # Spec review
│ ├── codex-review.md # Code review
│ ├── sql-review.md # SQL review
│ └── final-test.md # Final test
│
├── skills/ # Composite skills (with agents)
│ ├── go-gin-ddd-bun/ # Go project architecture guide
│ │ ├── SKILL.md
│ │ ├── ARCHITECTURE.md
│ │ └── TESTING.md
│ └── api-final-review/ # Final review skill (4 parallel agents)
│ ├── SKILL.md # 6-stage review workflow definition
│ ├── AGENTS.md # Detailed setup for the 4 parallel agents
│ └── templates/
│ └── test_script_template.sh
│
├── templates/ # Document templates
│ ├── plan-template-v2.md # plan.md template
│ └── api-review-guide.md # Code review guide
│
└── agents/ # Sub-agent definitions
├── codex-review.md # Codex review agent
├── sql-review.md # SQL review agent
└── apply-feedback.md # Feedback-application agent
Two things matter here:
-
tdd-go-loop.md: Not a plain command, but a workflow orchestrator. It coordinates multiple sub-agents (codex-review, sql-review, apply-feedback, etc.) and automates the entire TDD cycle.
-
api-final-review/: A skill that runs the final review after API development is done. Four specialist agents run reviews in parallel.
/tdd:go Command Example
Here’s what the actual /tdd:go command looks like:
# TDD: Go (run the next test)
Read plan.md and find the first test marked with `[ ]` (not yet implemented).
## Execution Steps
1. **Identify**: Find the first `[ ]` test in plan.md
2. **Announce**: Tell me which test you're about to implement
3. **Red Phase**:
- Create or update `*_test.go` file
- Write a failing test for that specific behavior
- Run `go test -v ./...` to confirm the test fails
4. **Green Phase**:
- Write the minimum code to make the test pass
- Run `go test -v ./...` to confirm ALL tests pass
5. **Update**: Mark the test as `[x]` in plan.md
6. **Report**: Summarize what was done
## Critical Rules
- Write ONLY enough code to pass the current test
- Do NOT implement features for future tests
- Always run `go fmt` on new files
- If tests fail unexpectedly, STOP and report before proceeding
It’s simple. Find one checklist marked [ ] in plan.md, run the Red-Green cycle, and check it off as [x] when done. That’s all there is.
A Real plan.md Example
A snippet from the plan.md I used to implement the Plan creation API (POST /plans):
# Test Plan: Plan Creation API (POST /plans)
**Dependency**: plan_00_common.md (Domain layer complete)
---
## Phase 1: Application Layer - Input Struct <!-- T1:auto -->
File: `api/internal/application/plan/create.go`
### [x] 1.1 Define CreatePlanInput
- CreatePlanInput struct contains all required fields
- ScheduleInput struct contains Type and Days fields
- CreateItemInput struct contains Name and Quantity fields
### [x] 1.2 Input conversion methods
- ScheduleInput.ToWeeklySchedule() converts valid input to WeeklySchedule
- ScheduleInput.ToWeeklySchedule() returns ErrNoActiveDays when no active days
---
## Phase 5: UseCase Layer - Create Service <!-- T2:deep -->
File: `api/internal/application/plan/create.go`
### [x] 5.1 Service struct
- createService struct depends on Logger, TransactionManager, PlanWriter
- NewCreateService() constructor injects dependencies
### [x] 5.2 Create - validation logic
- Create() creates a Plan from valid input
- Create() returns ErrInvalidTemplate for unsupported TemplateID
- Create() returns ErrInvalidDateRange when StartDate >= EndDate
Each Phase has a Tier marker like <!-- T1:auto --> or <!-- T2:deep -->. This matters. Review depth differs per Tier:
| Tier | Meaning | Execution | Review Depth |
|---|---|---|---|
| T1 | Scaffold (structure) | Auto | Light |
| T2 | Core (business logic) | Detailed | Deep |
| T3 | Integration (Repository) | Auto | Medium |
| T4 | Surface (Handler/E2E) | Auto | Light |
I only do Deep Review on T2 (business logic) and let the rest go through automatically. Reviewing every line at the same depth is inefficient. Focus on the core logic, trust the rest if the tests pass.
tdd-go-loop Workflow Orchestrator
The full flow of tdd-go-loop. Notice that this is not just /tdd:go on repeat. It’s a workflow orchestrator coordinating multiple sub-agents:
flowchart TD
start([Start])
spec_review["spec-review agent"]
user_confirm{User Confirm?}
find_next[Find Next Test]
has_next{Found Next?}
tdd_execute["TDD Cycle"]
update_plan[Mark Complete]
codex_review["codex-review agent"]
has_critical{Critical Issues?}
apply_feedback["apply-feedback agent"]
commit_ask{Commit Tier?}
is_t3{T3 Integration?}
sql_review["sql-review agent"]
plan_done{All Done?}
final_test["final-test"]
finish([End])
start --> spec_review
spec_review --> user_confirm
user_confirm -->|No| spec_review
user_confirm -->|Yes| find_next
find_next --> has_next
has_next -->|Yes| tdd_execute
has_next -->|No| codex_review
tdd_execute --> update_plan
update_plan --> find_next
codex_review --> has_critical
has_critical -->|Yes| apply_feedback
has_critical -->|No| commit_ask
apply_feedback --> codex_review
commit_ask --> is_t3
is_t3 -->|Yes| sql_review
is_t3 -->|No| plan_done
sql_review --> plan_done
plan_done -->|No| find_next
plan_done -->|Yes| final_test
final_test --> finish
The key part: every time a Tier ends, the codex-review agent kicks in automatically. If Codex finds a Critical or Major issue, the apply-feedback agent applies the feedback automatically; Minor issues are just logged. If review repeats more than three times, it forces a move to the next step (no infinite loops).
At the T3 (Integration) stage, the sql-review agent runs additionally to catch performance issues like N+1 queries or missing indexes.
api-final-review: Four Parallel Agents
When API development wraps up, the final review workflow runs. Four specialist agents perform reviews in parallel:
┌─────────────────────────────────────────────────────────────┐
│ api-final-review │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ 1. Checklist │ │ 2. Shell/Fixture│ │ 3. Codex Agent │
│ Agent │ │ Agent │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ 4. SQL Agent │
└─────────────────┘
Actual setup from AGENTS.md
# Parallel Agent Setup
The four specialist agents used in the API final review.
## 1. Checklist Agent
Verifies the checkbox-completion state of the plan document.
Output format:
- Completion rate: X/Y (Z%)
- Incomplete items: (list, if any)
- Verdict: ✅ Complete / ❌ Incomplete items remain
## 2. Shell/Fixture Agent
Checks that the scripts and fixture files needed for integration tests exist.
Paths checked:
tests/
├── scripts/{feature}/test\_{api_name}.sh
└── fixtures/{feature}/{api_name}/
├── valid_request.json
└── invalid_request.json
## 3. Codex Agent
Tier-based review:
| Tier | Target | Review Depth |
|------|--------|--------------|
| T1 | Domain Entity | Strictest |
| T2 | Application Service | Strict |
| T3 | Infrastructure | Standard |
| T4 | Handler | Standard |
## 4. SQL Agent
Reviews database query performance and optimization.
| Item | Description |
| ----------------- | ---------------------------------------- |
| N+1 query | Per-row queries inside a loop |
| Missing index | Indexes needed for WHERE, JOIN clauses |
| Over-fetching | SELECT \*, unnecessary columns |
| Transaction scope | Right boundaries set |
Because these four run in parallel, total time drops a lot. Sequential runs that take 10 minutes finish in 3 minutes when parallel.
Deployment-decision criteria
| Critical | Major | Minor | Verdict |
|---|---|---|---|
| 0 | 0 | 0-2 | ✅ OK to deploy |
| 0 | 0 | 3+ | ⚠️ Recommend Minor cleanup |
| 0 | 1+ | - | ❌ Major fixes required |
| 1+ | - | - | ❌ Critical fixes mandatory |
Code Review Guide
Sometimes the code review itself is the hard part, so I wrote a guideline for the order to read code in and what to focus on.
The T2 (Core) review checklist I actually use:
## T2 (Core) Review Checklist
### UseCase Implementation
| Check | Question |
| ---------------------- | -------------------------------------------- |
| Pure Functions | Are helper functions pure (no side effects)? |
| Explicit Mutations | Are all mutations visible in main method? |
| Error Wrapping | Are errors wrapped with context? |
| No Business in Helpers | Is business logic in Execute, not helpers? |
### Pure Function Verification
// GOOD: Pure function (data in, data out)
func buildListOptions(input \*ListPlansInput) plan.ListOptions {
return plan.ListOptions{
Limit: input.Limit,
Status: input.Status,
}
}
// BAD: Impure (modifies input)
func buildListOptions(input *ListPlansInput, opts *plan.ListOptions) {
opts.Limit = input.Limit // mutates!
}
With a guide like this, Claude reviews against the same criteria, and I also know exactly where to focus when I look at the code.
How to Get Started
People keep talking about how they configure Claude Code, how they set up sub-agents, who follows whom, who asks what. Don’t let that FOMO push you around. Just install Claude Code right now and ask it directly. What can it do. Then ask it about the things you want to do, and build your own system from there. That process itself is what’ll give you survival skills in this fast-moving AI era.
Here’s how I’d start:
- Make a
.claude/folder. Just one folder at the project root. - Create a simple command in
commands/. One markdown file is one command. - Run it as
/command-name. Claude Code runs it directly. - Extend as needed. Agents, skills, templates — grow it from there.
You can also just ask Claude Code to do all of the above for you.
Most of the workflows, commands, and agents I’ve built were made by asking Claude Code while I used it. Don’t try to build a perfect system from day one. Start small and add as you need. Even this blog post went through a multi-agent workflow that takes the rough draft I dumped out and polishes it.
Closing
One thing I learned during my CTO years. Delegate but own quality. Hold the whole picture but trust the details. I couldn’t read every line of code, so I focused on the important parts and trusted teammates with the rest. The same thing applies when working with the AI now. Treat the AI like a capable junior developer. Give clear instructions, review the output, give feedback when needed, hand off the next task when it does well.
Looking back, this is a leadership skill. Organize complex requirements before passing them on, focus on the key decisions, delegate the execution. In the AI era, I think we’re all going to work this way. Whether I type the code myself or hand it to the AI, in the end I’m the one who defines what to build and owns the quality. Picking up that sense might be the most important skill right now.
If I’m being honest, I could only run experiments like this because I left the company last year and started a personal project. At work, between meetings, planning sync, and reviewing teammates’ code, I had no room to dig into this kind of thing. Working alone, with no one to second-guess, I could try things freely. OpenCode being blocked is a small bummer, but I’m more excited than worried about how AI evolves through 2026.
What matters in the end is finding your own approach. Vibe coding, augmented coding, whatever style. As long as you don’t lose the joy along the way. If chasing productivity costs you the fun of coding, that’s the real loss. As long as making things with code still feels good to me, I’ll adapt to whatever tool comes next.
댓글
댓글을 불러오는 중...
댓글 남기기
Legacy comments (Giscus)