Why Qwen 3.5 Flash instead of GPT or Claude?

This task is a single-request pattern: send one image, get one structured JSON response back. For that pattern, Qwen 3.5 Flash gave us structured-output quality on par with other lightweight peers, while costing dramatically less and responding faster. Multi-turn performance drops off sharply, but multi-turn wasn't required here, so the weakness didn't matter.

Isn't a sync API enough? Why did you need the async job system and SSE?

A sync API takes tens of seconds for three images, and on mobile, holding an HTTP connection open for 30+ seconds drops it depending on network conditions. The async job system returns instantly and pushes the work to the background. SSE then streams real-time progress (preview), turning dead wait time into an experience with momentum.

How did you handle the model's nondeterminism?

I built a test set from 30 real books and ran canonical-fixture-based batch integration tests over and over. Failure patterns surfaced: missing chapters, Roman numeral misreads, depth misjudgments, page-order drift. I patched the prompt and post-processing pipeline against each one. Quality came from the test set and the regression-prevention loop, not from swapping models.

Between a Working Feature and a Trustworthy Product: Building ToC Recognition

Opening

The biggest source of resistance in any mobile app eventually comes down to input. For utility apps especially (anything that isn’t pure content consumption), input friction is the number-one cause of churn, regardless of category. A budgeting app needs spending entries. A productivity app needs schedules and to-dos. It’s all input. Models have gotten dramatically smarter in the AI era, but the user-specific data still has to come from the user.

The thing is, a B2C mobile app has no way to know that user data on its own, and we’re not Google or Meta sitting on a personal data trove. So even in the AI era, mobile apps still have to win on UX, and the problem of collecting user-specific data with the lowest possible friction and the highest possible accuracy is still wide open.

The moment you make users type in a table of contents, this feature is dead

I first tried to solve this problem five years ago. It was the second version of the service I’m currently rebuilding. (We’re on version three now.) To give users a personalized study plan, we needed the table-of-contents data (ToC) of the book or workbook they were trying to read. Page count alone gave us a similar feature, but to take it further, the ToC mattered. Common sense tells you that asking someone to type in a book’s ToC by hand, on mobile no less, is an instant exit ramp.

Pre-collecting ToC data (feat. QWEN)

The first thing I tried was building a pre-collected ToC database. Long before the work I’m describing in this post, I’d already poured hours into it. Unlike Korean book sites, most overseas book sites (especially English-language ones, which are our current primary market) just don’t surface ToC data at all. (And Korean book APIs don’t expose ToC either.) On top of that, since we aren’t tied to any single publisher, there was no easy way to build a generic API-based scraper.

The second card I played was AI. That AI ended up being the protagonist of this post: Qwen 3.5 Flash. I built a broad ToC collection pipeline on top of it. (More on the model itself later.)

I poked around qwen.ai with a handful of reference books and saw that collection actually worked surprisingly well, so I went straight into building against the LLM API directly. I started with OpenRouter. The pricing was similar and it gave me model-portability, but the Qwen models on OpenRouter were just vanilla weights with none of the toolchain options. (Vanilla model results were brutal.) I migrated to Alibaba Cloud and re-wired the API through DashScope. DashScope had the full Qwen toolchain (web_search, web_extract), and on top of those tools I could build a pipeline that pulled ToC data with reasonable accuracy.

I built a pipeline where you’d input an ISBN13 and it would automatically collect both the metadata and the ToC. For something like a 7-volume bundle in the test set, pulling the entire ToC at once would blow past the token limit, so I split it: collect chapters at a coarse grain first, then run a second pass for sub-chapters. The test set wasn’t huge, but a pipeline that accurately collected large amounts of ToC data across many genres was finally working.

I’d ended up with a collector that took an ISBN13 and spat out a depth-aware ToC as JSON. I’m describing it briefly here, but the LLM’s inherent nondeterminism meant I needed multiple guardrails, and I think I pulled two or three all-nighters straight to get the pipeline standing.

But I never got to actually run real data collection through it. The biggest problem was cost. The testing alone burned more than $2,000. That’s separate from Qwen 3.5 Flash’s cheap token rate. The killer was the web_search tool-calling cost. Qwen’s web_search tool-calling has no built-in compact step, so every byte that flows in through web_search counts straight against your token bill. I’d picked the model based on token pricing alone and never thought about toolchain or side-effect costs, so the bill caught me off guard.

You can’t predict what books a user will request, so you have to collect as widely as possible, and to run this at production quality you also need a periodic verification pipeline. The cost could multiply many times over. The data set you need to cover is effectively infinite (yes, there’s a long tail, but if the niche data isn’t there, that user bounces immediately), and the collection cost is enormous. I kicked off a full verification test through Codex and went to bed. I woke up to a bill for the equivalent of three million won. I was wrecked.

I didn’t quit there. I tried building custom skills inside subscription-based Codex and Claude Code, but maybe because they aren’t API-mode models, the results were poor despite the much stronger underlying model performance. The client-side skills and plugins (Playwright and friends) couldn’t keep up with Qwen’s native toolchain on web_search. When I bolted a Playwright skill onto the GPT-5.3-Codex-Spark model, Chrome devoured all the memory and my M4 Max maxed-out MacBook locked up for the first time ever.

This wasn’t a pure technical failure. It was the first lesson that no technology becomes a product if you don’t think about operating cost and data coverage at the same time. Three days of flailing, $2,000 in cost (billed during a weak-won stretch, which made it 3 million won), plus the fact that even with a pre-built database, niche user-specific ToCs would still be a blind spot. All of that stacked up, and I shut the project down.

Letting users do the recognition themselves

In the end, the user inputs it

After paying a fairly steep tuition, I landed back at square one. “The user inputs it.” Second-best, but not a bad call. Obviously asking users to type in every line of the ToC is absurd. OCR has gotten a lot better, so the user can just take a photo. As a bonus, that data becomes their own.

The problem wasn’t text input. It was structure input. A ToC isn’t a flat text list, it’s a graph of nodes with depth levels. Snapping a photo doesn’t get iOS VisionKit to recognize the structure. Compared to older OCR models, raw text recognition was strong, and it could even handle moderately structured documents. But that “moderately” produced the worst possible experience for the user.

Why this didn’t work before LLMs

Like I said, this wasn’t my first attempt. Five years ago I tried OCR + normalization in a handful of ways. The OCR libraries and services back then could already pull a flat list. But what I actually needed was:

Part / Chapter / Section (and possibly more depth)
depth
page
hierarchy

What I needed wasn’t a slab of text but a hierarchy.

Back then I also tried using language models to build a classification system. Compared to today, we were in the very early days of language modeling. Transformers and attention had just started showing up in real products, so I tried building a language model on GCP’s ML platform by collecting as much ToC data as I could. The idea was to feed a flat-list ToC into the language model so it could learn each item’s distinctive pattern, then return a structure given a flat text list. But once the text was already converted to a flat list, the model had to infer the hierarchy from scratch, and between case diversity and lack of training data, it couldn’t solve the problem at all.

Line breaks, indentation, numbering schemes, mixed Roman/Arabic numerals (every structural signal you’d want) all got flattened the moment OCR touched them. The extra requirement of matching page numbers, I never even attempted.

The hard part of ToC recognition wasn’t OCR text accuracy. It was structuring. Not reading characters but reading the hierarchy between characters. And solving it solo back then was an extremely inefficient use of time. The ROI just wasn’t there.

Why it works now

Major models like GPT, Claude, and Gemini have gotten enormously better. The reason it’s still hard to ship a real AI-powered service is API cost. I subscribe to the $200 GPT Pro plan, but if I’d been paying for the same Codex usage at API metered rates, I’d be staring at a bill in the thousands of dollars.

Most people overlook this. Because we’re always running on SOTA models, people say things like “agentic engineering matters, prompt engineering doesn’t.” But if you’re wiring an LLM API into a real product and need the unit economics to work, you cannot use a SOTA model. You’re stuck with API models from one or two generations back. All of this comes down to cost.

On February 23, 2026, Alibaba Cloud announced Qwen 3.5. As part of that release they shipped Qwen 3.5 Flash. The Pro model is usually compared to Claude Sonnet, and Flash sits well below Pro. Multi-turn performance falls off a cliff, and like older-generation models, it answers single requests well. But on top of the web_search and web_extract tools I mentioned earlier, Vision was also baked in. For simple tasks, it’s blazingly fast and accurate, with API cost that’s overwhelmingly cheaper than the competition. (Chinese-origin models raise privacy concerns, but Qwen also ships local open-source weights separately.)

Why Qwen 3.5 Flash

Below is a comparison of each vendor’s lightweight (Flash/mini/nano) line. Same-tier comparison feels fair. (Official pricing as of February 2026.)

	Claude Haiku 4.5	GPT-5-nano	Gemini 3 Flash	Qwen 3.5 Flash
Tier	Lightweight (Haiku)	Lightweight (nano)	Lightweight (Flash)	Lightweight (Flash)
Input (per 1M tokens)	$1.00	$0.05	$0.50	$0.10
Output (per 1M tokens)	$5.00	$0.40	$3.00	$0.40
Vision (image recognition)	Excellent	Excellent	Excellent	Sufficient
Structured JSON output	Excellent	Good	Excellent	Sufficient (single-request)
Speed	Fast	Very fast	Fast	Very fast
Multi-turn performance	Good	Average	Good	Drops off sharply

Looking at the table alone, GPT-5-nano has a unit-price edge over Qwen on Input. But real operating cost doesn’t come down to Input pricing alone. If structured JSON output quality is poor, you accumulate retries, post-processing, and fallback calls to other models, and that piles up as hidden cost. For this task, GPT-5-nano’s structured output was only “Good,” while Qwen 3.5 Flash delivered structured output stable enough to pass actual production tests on a single-request basis. Since the core pattern wasn’t a complex multi-turn dialogue but rather “send one image, receive one structured JSON,” that gap was decisive.

The ToC recognition workflow doesn’t demand multi-turn either. The user takes a photo of a book or workbook page, and the system needs to receive that image once and produce a JSON ToC structure. In this scenario, the reliability of a single “vision + structured output” shot matters far more than multi-turn reasoning. Qwen 3.5 Flash gave us results that were more than satisfactory on this single-request + structured-output combo against same-tier models, and that became the core argument for choosing it. (This doesn’t mean Qwen Flash is the best at every task, just that it was a strong fit for this specific one.)

Speed matters too. When a user takes a camera shot of a page and waits for the result, response time is UX. With the same image, Haiku 4.5 took roughly 10 to 15 seconds, while Qwen 3.5 Flash returned in 5 to 8 seconds. Nearly twice as fast in feel. On cost, Qwen Flash also sits comfortably on the cheap end of the lightweight tier. Measured against this task’s requirements (lightweight, cheap, fast, single-request structured output), it was harder to find a reason not to use Flash.

The remaining worry was OCR/vision quality. Honestly, I was skeptical at first that a Flash-tier vision model could handle real book photos with uneven lighting, page curvature, and tiny print. The actual tests showed that text recognition itself was practical. The harder question wasn’t recognition rate but “how do you structure and emit it,” and that part was prompt design and post-processing territory. When the model gives you 80, you fill in the remaining 20 with engineering. (That sentence is the thesis of this whole post.)

Why DashScope

I started by wiring it through OpenRouter. Same model, brutal results. Turns out it was a completely different beast from running DashScope native. With the vanilla model, the same prompt produced unusable output, but on DashScope, web_search, web_extract, and Vision were all attached as native toolchain. The fact that the same model could feel that different across platforms was a shock. It had been the deciding factor for the collection pipeline, and it was the same story for the recognition pipeline.

It ran reliably, and the cost was predictable. DashScope has clear region-by-region pricing and a free quota. The scariest thing about wiring an LLM API into a commercial service is “I don’t know what this month’s bill will be.” DashScope had less of that uncertainty. The Singapore region has the latest models available immediately, so I knew what each call would cost.

Stability was the other reason. Alibaba Cloud is an infrastructure company and DashScope is a service layer on top of it, so at minimum I could worry less about the API just disappearing. Add an extra proxy layer and you add an extra failure point. I’d already lived through one OpenRouter to DashScope migration during the collection phase, so this time I went straight to DashScope.

If you’re considering an LLM API integration, it’s worth testing this. Beyond the US-origin SOTA models, China-origin models like Kimi, Qwen, and GLM are worth a look. The fact that the same model can produce completely different results depending on the platform you run it on is something you only learn by experiencing it firsthand.

From a working feature to a trustworthy product (build log)

The rest is the actual development flow. You don’t need to follow every technical detail. But I hope you walk away with a feel for why this had to be this complicated. What I really want to convey isn’t the technical detail itself, it’s how far the distance is from “it works” to “you can trust it.”

Do Work: wired up the sync API first

At least the parsing works.

I didn’t build the right structure from day one. I built a working version first.

I started with the simplest possible flow. iOS takes a photo and uploads the image to the server. The server sends the image to the DashScope API, gets the JSON back synchronously, and pipes it down to iOS. The prompt included bookTitle and totalPages as hints. Telling the model the book title gives it more context, and telling it the total page count makes page-number inference more accurate. A small hint like that turned out to make a surprisingly large difference in result quality.

I still remember how the first test felt. I sent in a single photo and looked at the JSON that came back, and the Chapter, Section, depth, and startPage were all picked up pretty cleanly. The “wait, this actually works?” moment. After burning $2,000 on the pre-collection pipeline, that moment was honestly emotional.

Multi-image mattered more than I expected

At first I assumed “one photo and we’re done.” But when you actually open a real book’s ToC, more often than not it doesn’t fit on one page. Technical books and workbooks especially can spread their ToC across 5 to 6 pages (one book in the test set was 10 pages). Multi-image parsing wasn’t optional. Without it, the feature was unusable.

The catch is that merging the parse results from multiple images isn’t a simple concat.

First, you have to preserve image order. There’s no guarantee the user shot page 1 first.

Second, you need chapter merging. The last chapter of image A and the first chapter of image B might be the same chapter. Chapter 3 might start at the end of the first photo and continue with sub-sections in the second. Treat that as a duplicate and you get two chapters. Ignore it and you lose the sections.

Third, dedupe and startPage-based sorting. The same chapter can appear in multiple images, and page numbers can overlap.

Fourth, warning handling. If the model judges that an uploaded image isn’t actually a ToC, it has to return not_toc. If chapter count is abnormally low, it has to surface too_few_chapters. If it had to force-adjust page order, it should send page_order_adjusted. Without those warnings, a quietly returned result means the user ends up using bad data.

The moment multi-image entered the picture, prompt design, merge logic, dedupe rules, sorting algorithms, and the warning system all jumped a level in complexity. It didn’t take long to realize how naive “one photo and we’re done” had been.

It worked, but it wasn’t a product

The feature ran. Send three images, get a merged ToC JSON back. Accuracy was decent. But there was a fatal problem. Three images meant tens of seconds before the model responded. During that time, the HTTP connection stayed open, a server worker stayed pinned, and the user stared at a blank screen.

Worse was the timeout. On mobile, holding an HTTP connection for over 30 seconds got you dropped depending on network conditions. Dropped meant starting from scratch. The model invocation cost was already spent and the result was gone.

This is a feature, not a product. At a demo people might say “oh wow,” but ship it to real users and they’ll use it once and never again.

Good: splitting it into an async parse-job, finally felt product-shaped

It becomes a feature you can wait on.

Sync requests were stressful for both server and user. I couldn’t shrink model latency. So I had to change how you wait.

Evolving into a job system

I switched to a parse-jobs async model. The flow:

Client uploads images and requests parsing
Server immediately creates a job and returns a jobId (under a second up to here)
Actual parsing runs on a background worker
Client polls status periodically using the jobId

That switch alone changed the user experience drastically. Sending the request now returns “accepted” instantly, and the app can render a “processing” UI. The feature evolved from “model invocation” into a “job system.”

Deduplication, not just async-ification

Going async wasn’t the only change. If a job was already in flight or completed for the same image set, I reused the existing job instead of creating a new one.

The user accidentally sends the same photo twice → 2x model cost
Network issues cause the app to retry → two duplicate jobs
The user wants to see “that result from earlier” → just look up the existing one and return it

Whether the result was new or reused, the client got 200 either way, with the reused flag spelled out. The client only needs to know “I got 200, so I can take the jobId and start querying status.” (Debugging needs the flag, so we keep it visible there.)

This is a decision tied directly to cost. LLM API calls aren’t free. Run this without dedup and your costs become unpredictable.

Failure contracts: being good only on the happy path isn’t enough

On the iOS side, I didn’t stop at “show it on success.” The failure contract had to be explicit.

If we don’t get HTTP 200, fall back to the local path. The app has to keep working even when the server is down. Showing the user a “server error” message is the worst option. Better to offer “automatic recognition failed, please input it manually” as an alternative. (Sure, manual input is the worst UX, but it beats showing a raw error message.)

We were lucky to have Apple VisionKit sitting on-device as a local MLKit, so even the fallback was a notch better than pure manual input. (It can’t structure things, of course.)

A backend isn’t a service that’s “fine when things are fine.” You have to agree on how the client reacts when things fail too. That isn’t an API spec. It’s a product contract.

Good, but still not enough

If Do Work was “a feature that runs,” Good was the stage where it became “a feature you can wait on.” But the user still saw nothing during that wait, with no idea when it would end. Polling that only said “still processing” wasn’t an experience worth shipping in 2026.

I could have stopped at “this is good enough.” Plenty of services do stop here. But what do you do with the time the user spends waiting? That, to me, is what separates Good from Great.

Great: real-time experience, load distribution, operational risk all baked in, and finally production

It becomes a product experience you can trust.

Adding SSE turned the feature into an experience

Polling had clear limits. The user has no idea “where we are right now.” Tight polling intervals raise server load. Loose intervals raise perceived delay.

So I introduced SSE (Server-Sent Events). The client opens a connection and the server pushes events in real time.

snapshot: full current state at connect time
status: job state changes (queued → processing → completed/failed)
preview: progressively recognized ToC structure as parsing runs
heartbeat: signal that the connection is alive
usage: token-usage and other meta info
completed: final result confirmed
failed: failure and error info

The biggest UX win was preview. It shows the model recognizing the ToC in real time. Chapter 1 appears first, Section 1.1 nests underneath, the next Chapter is added. Like ChatGPT streaming an answer character by character, the ToC progressively assembles itself in front of you. The moment that landed, the “feature” became an “experience.” The wait shifted from boring dead time into something closer to anticipation.

At first this looked simple enough that “just show it as it comes out” felt sufficient. But once I actually built it, the hard part started right after.

preview and truth had to be separated

Showing things in real time and storing things you can trust are two different problems.

I learned this when I tried using preview directly as the final result and things blew up. preview has no nodeId and no order. Enough to render in the UI, sure, but downstream (plan generation, checkItems conversion, user customization) needs metadata that preview doesn’t carry.

So I set the rules:

preview is for UI. Non-authoritative.
Only the final result is truth. Stored in the DB, accessed only via the GET parse-jobs/{jobId} read path.
preview and the worker queue stream are separated.

Treat preview as truth and you can generate plans from incomplete, mid-parse data. Don’t separate them, and what happens? You showed the user “Chapter 3 recognized,” but actually there were 5 chapters, and now their plan is incomplete. Display and storage cannot share the same channel. Obvious in hindsight, easy to miss in practice.

The hard part wasn’t SSE, it was the preview emit policy

Wiring up SSE itself isn’t hard. The hard part comes after. How often, and on what kind of change, do you emit a preview?

Every preview emission costs you twice: storing it in the latest preview cache, and appending it to the event stream. Emit a lot and the UX is flashy but server writes explode. Emit too little and there’s no point in having SSE at all. If the screen is empty for a while and then the whole result suddenly appears, how is that any different from polling?

Mobile environments make it worse. Connection drops on the subway, drops switching from Wi-Fi to LTE. Every reconnect makes the server query the latest preview and replay the prior events. The more reconnects pile up, the heavier the server-read load. Real-time delay might not even be the bigger problem.

So at the PreviewAssembler level, I built in an adaptive emit policy:

throttle: minimum emit interval. Too often and the server suffers.
chapterThrottle: emit only on chapter-level changes. A single new section gets held.
maxSilence: if nothing is sent for too long, send a preview as a heartbeat substitute. Keeps the user from worrying that things have stalled.
fingerprint comparison: hash against the previous preview and emit only when there’s an actual change. The model sometimes repeats the same content.
pending preview hold: if emit conditions aren’t met, hold and send only the latest at the next emit point.

The strategy is two-tier. Incremental strategy keeps things cheap by default, and when it breaks, whole replay fallback recovers safely. As long as incremental processing holds, we save cost; once state gets tangled, we resend the whole thing to guarantee consistency.

What you see in real time looked flashy, but the actual hard problem was making it look better while sending less. SSE quality wasn’t about whether the connection was up. It was about how carefully you let preview flow through.

Real-time UX forced server design to start over

SSE isn’t a pretty technology. It’s a different way of managing wait and load. Adding SSE meant redesigning a substantial chunk of the server architecture.

Separating worker queue from SSE event stream. At first I tried to reuse the existing worker queue’s Redis Stream as the SSE source. But that couples the worker’s processing unit to SSE’s emit unit. If the worker speeds up, SSE over-emits. If the worker slows down, SSE feels sluggish. These two need to be independent. I split the Redis event stream onto a separate key.

Replay, reconnect, Last-Event-ID. On mobile, reconnects aren’t an exception, they’re normal. When reconnecting, sending Last-Event-ID lets the server replay only events after that point. Without it, every drop sends the user back to the start. Same problem we hit in Do Work, repeating itself at the SSE layer.

Nginx config. proxy_buffering off, X-Accel-Buffering: no. Skip these and Nginx quietly buffers SSE events and ships them in one batch. You built a “real-time” thing and got a delayed batch. Miss this and SSE stops meaning anything.

Graceful shutdown and deploy. I set shutdown timeout to 30 seconds and used blue/green for cutovers. The deploy goal wasn’t “zero downtime,” it was a system that can reconnect. Perfect zero-downtime costs too much. A system that recovers when it drops is enough. Even on disconnect, the client auto-reconnects, and the snapshot-first + replay + live tail structure restores prior state.

Full SSE session flow. When a session opens: query the job → query the latest preview → assemble the snapshot event → replay → live tail. Those five steps repeat on every connection. UX got better, but as reconnects pile up, that initial handshake cost accumulates. Recent optimization focus has shifted from “add more SSE” to controlling preview emit frequency and reconnect cost. Real-time UX isn’t free. Behind the flashy display is a cost the server has to absorb.

Real quality gains came from the 30-book test set, not from swapping models

The system was complete. But to push it to production grade, “it works” wasn’t enough.

I ran 30 real books through it. Different fields (CS, math, literature, business), different publishers (O’Reilly, Pearson, Korean publishers), different layouts. Failure patterns started to surface.

Missing chapters: model skips entire chapters → prompt reinforcement + missing detection in merge-normalize
Roman numeral misreads: confuses Part Ⅳ with Part IV → numbering normalization rules patched
Depth misjudgment: promotes a Section to a Chapter or demotes one → depth correction logic added in parser/post-process
Page order drift: image order doesn’t match actual page order → startPage-based sorting reinforced
Line break / indentation confusion: OCR reads line breaks as structural separators → spelled out in prompt
Debugging difficulty: same image, nondeterministic results → live integration tests + detailed logging

Accuracy didn’t go up just because I swapped models. Quality went up while building the test set and blocking regressions. Build a canonical fixture, run the batch integration tests, reproduce failed cases as live integration tests, edit the prompt builder, run the whole thing again. I repeated that loop dozens of times.

There’s a commit message that just says “dev environment debugging support,” and that one was central. You can’t raise quality unless you can reproduce real-world failures with real data. A bug you can’t reproduce is a bug you can’t fix. That part isn’t the model’s job. It’s the engineer’s.

Tests were green, but production handed me a release blocker

We passed the 30-book test set and the SSE flow was stable. Tests were all green. I was ready to ship.

Then the final review surfaced release blockers.

Permanent wait on terminal event loss (critical): if completed or failed gets dropped, the client stays “processing” forever. The user has to force-quit the app.
Missing event timestamp/message: replay order can scramble.
Redis MAXLEN unset: events accumulate without bound. Memory creeps up until one day Redis dies.
No retry on worker failure: if the DashScope API fails, the job stays “processing” forever.
preview partial JSON repair vulnerability: incomplete JSON could be sent down as is.

Any one of these going off in production seriously breaks the user experience. Tests verify “does the happy path work.” Production demands “are the unhappy paths also safe.” After fixing every release blocker, writing the manual runbook, building the deploy checklist, and validating the actual SSE flow with a live integration transcript, only then did “okay, ready to ship” feel earned.

The last piece of the pipeline: hierarchy selection UX

You’d think we were done at this point. There was one more piece.

No matter how accurately the model extracts the ToC structure, you can’t just hand the raw result to the user. Different users need different depths. Some only want the Chapter level, others want it down to Section.

So we needed hierarchy selection. Default behavior:

Show leaf nodes (the deepest items) selected by default
Toggling a parent node selects/deselects everything underneath
The default state should already match “what most cases want”

Photo to structured JSON to hierarchy selection to Items generation

The whole thing is a single pipeline. The model is good at photo to JSON. Everything else is engineering.

What is product engineering in the AI era?

As you’ve probably gathered, this isn’t simply an OCR feature. It’s an engineered flow: photo to structured JSON to hierarchy selection to Items generation.

On the surface, sure, this work could look like nothing more than wiring up an LLM API and slapping SSE on top for streaming UX. You might wonder, “isn’t everyone doing this these days?” What I want to show through this process is the work of taking a slow, nondeterministic, expensive LLM call and turning it into a product experience the user can wait on, can understand, and can trust.

What got solved here:

The UX problem of reducing mobile input friction
The recognition problem (it’s structuring accuracy, not OCR)
The job-system problem of fitting slow LLM responses into a product flow
The real-time UX problem covering polling, SSE, and reconnect
The data trust problem of separating preview from truth
Operational problems like emit frequency, Redis writes, and reconnect cost
The product problem of fallback on failure and the user contract

A product engineer’s persistence isn’t proven by reaching for big-name technologies. It shows up in chasing one small-looking feature all the way down until the user feels no friction.

ToC recognition is a tiny part of the product I’m currently building. It’s not even a required feature. In an earlier era I might have cut it from the early roadmap. (And it would have lived in the backlog forever.) If someone had told me they’d build it to this depth back then, I’d have argued them out of it. The cost wouldn’t pencil out. But in today’s AI coding agent era, this is something you can build in a day, or even a few hours. At that price, it’s worth paying.

So what is product engineering in the AI era? The core is going beyond Do Work, through Good, all the way down into Great. Not stopping at “this is good enough” (Good), but pushing into the maximum of detail (Great). That, I think, is what AI-era product engineering is.

No matter how well you build the harness, no matter how strong the agent gets, the one thinking and deciding is still the engineer. I built this mostly with Codex, leaning hard on the Superpowers skill (Codex’s autonomous-execution mode). But the implementation flow itself, requirements analysis, implementation, optimization, was something I kept digging into and deciding on personally.

Just because I’m not writing the code by hand doesn’t mean Craftsmanship disappears. The willingness to step back when “I think we’re done” comes up, re-check everything from the start, re-test, and push the last 2%. That’s what’s needed.

Closing

When you first learn to code, it’s genuinely thrilling. The people getting in through vibe coding right now probably feel the same way. I’ll never forget the feeling of building the first service that did exactly what I’d intended. (Back in 2011 I built my own web Dropbox clone in Ruby on Rails. The pride.)

But there’s an enormous gap between a working feature and a product you can trust. AI model integration, the topic of this post, is especially seductive at the start. You test the model in the vendor’s playground, see real performance, and feel “I could build anything with this.” You’re ecstatic. Then you crash. Between LLM API costs and the realities of running things in production, the work I assumed would be “just plug in the API” turned out to need 80% more thought and trial and error.

I don’t think the product I built is Great. But when the user tells you “this is awkward” or “this isn’t intuitive,” whether you ignore that point or hold onto it until the end, that, I think, is the fork in the road to Great.

In the era when development cost was high, “over-engineering” became an excuse to dismiss obsession with detail. But if you can build that detail in a day instead of three weeks, why give it up? If you have ten possible features and only ship three, but those three give the user an outstanding product experience, I’d argue that’s the better direction.

Even in the era where AI writes the code, when you look at the engineers building good products, they’re still pulling all-nighters. iOS development is hard for me. After years of Flutter, I picked iOS, and no matter how hard I beat on Codex, UI transition code can spiral once it goes off the rails. Infinite loop time. After spending a full day failing to fix a trivial issue, that’s when I finally regret skipping fundamentals, start adding logs, and search for similar demos. Like the old days. When AI can’t solve it, I have to find it. Like staying up all night thinking through preview emit policy.

I’m not the only one. The OpenClaw developer built and threw away 43 projects in 60 days before one hit. Look at the commit history from those 60 days and you’ll see 3 a.m. and 4 a.m. commits everywhere. Even in the era when AI writes the code, the engineers building real products are the ones writing “broken at least once, and eventually fixed at 1 AM while questioning my life choices.” The Zed editor team wrote a piece last year called “The Case for Software Craftsmanship in the Era of Vibes”, declaring that craftsmanship still matters in the vibe coding era. There’s even an Artisanal Coding(職人コーディング) manifesto out now. Summoning Japanese-style craftsmanship in this era. A bit much, maybe, and I still feel it.

AI coding agents reportedly raise productivity by 30 to 60 percent. True. I feel it too. But where you spend the time saved is the fork in the road. Stamp out more features, or push one feature deeper. I picked the latter, and I still think that’s right. (Of course, revenue has to prove that out, and that part I don’t know yet.)

Other people ship 10 apps a month, and I’m three months into building one product, having cut a substantial portion of the originally planned features. Even with a coding agent sitting next to you, engineering is craftsmanship. When AI returns a passable 80-point average, the remaining 20 points are on you.

References

9 Survival Skills for the Agentic Engineering Era: the 9 abilities Karpathy says engineers need in the agentic engineering era
Give Claude Code Wings: Introducing Superpowers: how to install Codex’s autonomous-execution mode Superpowers and the 7-step workflow
How a 15-Year CTO Vibe-Codes: pair-programming with AI based on Kent Beck’s augmented coding philosophy
AI Is Only as Smart as You Are: why two people using the same AI get different gaps. AI’s output is decided by your input
AI Agent Jarvis: Becoming My Second Brain: real-world notes on building a 24/7 AI agent with the OpenClaw framework
The Case for Software Craftsmanship in the Era of Vibes: software craftsmanship declaration from the Zed editor team (2025.06)
Artisanal Coding(職人コーディング): A Manifesto: a manifesto for software craftsmanship in the AI era (2025.10)
OpenClaw: One Developer, 43 Failed Projects: Peter Steinberger’s story of building and throwing away 43 projects before OpenClaw
Introducing Qwen 3.5: Alibaba Cloud’s official Qwen 3.5 Flash model announcement (2026.02.23)
DashScope Model Studio: AlibabaCloud DashScope API documentation and pricing

If you've read this far, leave a note. Reactions, pushback, questions — all welcome.

Between a Working Feature and a Trustworthy Product: Building ToC Recognition

TL;DR

Between a Working Feature and a Trustworthy Product: Building ToC Recognition

Opening

The moment you make users type in a table of contents, this feature is dead

Pre-collecting ToC data (feat. QWEN)

Letting users do the recognition themselves

In the end, the user inputs it

Why this didn’t work before LLMs

Why it works now

Why Qwen 3.5 Flash

Why DashScope

From a working feature to a trustworthy product (build log)

Do Work: wired up the sync API first

Multi-image mattered more than I expected

It worked, but it wasn’t a product

Good: splitting it into an async parse-job, finally felt product-shaped

Evolving into a job system

Deduplication, not just async-ification

Failure contracts: being good only on the happy path isn’t enough

Good, but still not enough

Great: real-time experience, load distribution, operational risk all baked in, and finally production

Adding SSE turned the feature into an experience

preview and truth had to be separated

The hard part wasn’t SSE, it was the preview emit policy

Real-time UX forced server design to start over

Real quality gains came from the 30-book test set, not from swapping models

Tests were green, but production handed me a release blocker

The last piece of the pipeline: hierarchy selection UX

What is product engineering in the AI era?

Closing

References

FAQ

Tony Cho

댓글

댓글 남기기

Legacy comments (Giscus)

Between a Working Feature and a Trustworthy Product: Building ToC Recognition

TL;DR

Between a Working Feature and a Trustworthy Product: Building ToC Recognition

Opening

The moment you make users type in a table of contents, this feature is dead

Pre-collecting ToC data (feat. QWEN)

Letting users do the recognition themselves

In the end, the user inputs it

Why this didn’t work before LLMs

Why it works now

Why Qwen 3.5 Flash

Why DashScope

From a working feature to a trustworthy product (build log)

Do Work: wired up the sync API first

Multi-image mattered more than I expected

It worked, but it wasn’t a product

Good: splitting it into an async parse-job, finally felt product-shaped

Evolving into a job system

Deduplication, not just async-ification

Failure contracts: being good only on the happy path isn’t enough

Good, but still not enough

Great: real-time experience, load distribution, operational risk all baked in, and finally production

Adding SSE turned the feature into an experience

preview and truth had to be separated

The hard part wasn’t SSE, it was the preview emit policy

Real-time UX forced server design to start over

Real quality gains came from the 30-book test set, not from swapping models

Tests were green, but production handed me a release blocker

The last piece of the pipeline: hierarchy selection UX

What is product engineering in the AI era?

Closing

References

FAQ

Tony Cho

Related posts

댓글

댓글 남기기

Legacy comments (Giscus)