Scrumble Backend Tech Retrospective
Bring back the human connection we’re losing in the AI era, into the way we work. Scrumble is a daily-scrum-based team communication platform built around emotional bonds and mutual support between teammates.
That was the early direction of the project. For non-technical details, you can read Scrumble Project Retrospective (June–August 2025).
There were a lot of requirements, but the core features ended up looking like this:
- Workspace and member management (creation, invites, etc.)
- Check-in / check-out posts (the daily scrum)
- Feed list
- Real-time post comments and reaction emojis
- To-do list
- Notification system
Stats and reports, third-party integrations, and bot connections are still WIP, so the implemented list is what’s above. My current team uses this every day at the start and end of work, with active check-in scores, check-in posts, and comments flowing through it.
Looking inside the domain requirements, a few technical agendas stand out from the backend angle. The main three:
- Domain and schema relationships across workspace, member, post, comment, reaction, and to-do
- Feed list
- Real-time updates (seamless UX)
It’s a fairly typical SNS-plus-SaaS-platform setup.
Looking at my own list, it doesn’t look like much… but below I’ll share what I wrestled with while building each of these.
Tech Retro
Project Tech Stack
- Language: Go 1.23+
- Web framework: Fiber v2
- Database: PostgreSQL 15
- ORM: EntGo + Atlas migrations
- Cache: Redis 7 (real-time state management)
- Auth: JWT + Google OAuth (Goth)
- WebSocket: Centrifugo (real-time reactions/comments)
- Dependency injection: Wire (Google)
- Logging: Zap (structured logging)
- Dev tool: Air (hot reload)
graph TD
subgraph Browser["Next.js"]
UI["App Router / TanStack Query"]
WS["WS Client (Centrifugo)"]
end
subgraph API["Scrumble API (Go Fiber)"]
H["HTTP Handlers"]
S["Application Services"]
D["Domain"]
R["Repositories"]
end
subgraph RT["Real-time"]
CF["Centrifugo"]
end
subgraph Data["Data Stores"]
PG[(PostgreSQL)]
REDIS[(Redis)]
end
UI --> H
H --> S --> D
S --> R
R --> PG
S --> REDIS
S --> CF
WS --> CF
CF --> REDIS
Fiber
The most common Go web framework is probably Gin, and I’ve been working with Echo before this, but I’ll be sticking with Fiber for a while. The biggest difference: Gin and Echo are built on net/http, so they follow Go’s server standard, while Fiber is built on fasthttp and isn’t standard. It also has a smaller set of standardized core libraries compared to Gin. Still, the essentials are all there, and since it’s Express-inspired, the initial setup is simple. People advertise it as faster, but once you attach a DB it depends on the situation and environment, and Gin/Echo are both plenty fast, so that’s not the reason. (I haven’t built anything serving the kind of traffic where it would actually matter.) Because it’s fasthttp, the libraries for things like real-time differ a bit from the standard ones, but rolling your own isn’t hard, and I’m using Centrifugo anyway, so it doesn’t really come up.
If you’re picking up Go for the first time, I’d usually recommend Gin. But if you’ve enjoyed working with Express in Node.js/TS, Fiber is a fine choice too.
In Go, what the web framework does sits on the outer edge anyway, so as long as you’ve cleanly separated the handler layer, swapping frameworks isn’t a huge job. Honestly, going non-mainstream is just my taste, so take that with a pinch of salt.
Entgo + Atlas migrations
To be precise, the stack is Entgo + SQLC. About halfway through the project, I realized Entgo doesn’t issue JOIN queries, and that it leans more toward being a type-safe query builder than a real ORM. Given that my architecture already separated domain entities from Entgo schema entities, Entgo became more of a headache the deeper I got into the project. I’d been using gorm before that, but Entgo felt like it might be the new standard, so I introduced it without much thought. The result was a mess. The pain points piled up:
- No JOINs
- The default behavior is the N+1 query pattern. If you write
client.User.Query().withPosts()in code, instead of joining, it SELECTs from users, then takes those IDs and runs a second SELECT against posts. - There are reasons for this. Turns out Entgo plays well with GraphQL (which I keep doing wrong, apparently), and it’s designed so you build queries in a graph style. That design has its upsides: cleaner entity mapping, easier control over lazy/eager loading. But you end up firing N queries when one would have done the job, and that’s a real performance hit. While trying to express it in Entgo’s syntax, I kept thinking, why am I doing this when raw SQL would be one line? Eventually I went with CQRS and switched the Query repository to SQLC.
- The default behavior is the N+1 query pattern. If you write
- Code-first schema migration
- Code-first means you control DB tables from code. The problem is that PostgreSQL has a huge feature surface, and there are times when you have to wire those features up by hand in code. And while you’re managing migration files, you’d like to keep a record of what changed. But because it’s code-first, when Atlas reads my code to generate migrations and syncs the schema, the SQL migration files I’d written by hand get wiped.
- For things like creating GIN indexes on JSON columns, I had to dig through the latest library code to figure out how Entgo supports it. Claude didn’t have current info either, so it kept generating the wrong schema definitions, and I burned a lot of time on what I’ll politely call grunt work.
- Mountains of generated boilerplate
- Because it’s type-safe and code-first, defining an Entgo schema and running generate produces a vast number of default files. I prefer not to have my searches turn up code I didn’t write, so the constant hits on code that was, in a sense, someone else’s, got under my skin.
Unless your ORM ships a lot of conveniences, like JPA, ActiveRecord, or Django ORM, or it’s the de facto standard, the whole point of using one is to make the object-mapped schema slightly easier to handle in code. For that level of value, Entgo had too many drawbacks for me. And in an early-stage project that doesn’t really need the complexity of CQRS, the basic query-performance issue alone forced me into it.
If I were running a tiny microservice with very simple entities, paired with GraphQL, no complex layering, sure, Entgo might be worth a look. But for that kind of simple server, do you even need it? Anyway, I picked it from day one and I’m seeing this project through with it, but I wouldn’t choose it again. Given my preferences, future Go projects will use schema-first SQL migrations and an ORM that runs JOIN queries by default. I don’t want to split things into Command Repository and Query Repository early on. (For reference, my current project uses Bun ORM and golang-migrate.)
Other pieces
I’ve stuck with PostgreSQL as the standard database. For real-time and caching, Redis (more on that below). Social auth is Goth. It’s not at Supabase’s level, but it’s pretty simple, just hook up the API and the keys. For DI I went with Wire, which works at compile time. That’s the standard pick in Go DI, nothing fancy. Logging is Zap, also a standard pick, and it does structured logging well. For the dev server I used Air. Considering Go is a compiled language, the experience of having it auto-compile and hot-reload on every code change feels almost like developing a server in a scripting language. When I worked in Spring, the slow startup was painful since I’d been spoiled by Ruby and Python server boot times, and it was hard to keep development flow going. (Maybe it’s faster now? I haven’t touched Spring in a while.) Go’s lightweight, fast feedback loop is genuinely great.
Architecture (DDD / Clean Architecture layering)
Scrumble is essentially a workspace-based SNS, so I needed an architecture that could grow. I aimed for one that pays off more in the middle of the project than at the very start, and I refactored the early implementation more than once.
The architecture follows Domain-Driven Design with the layering Interface (handler) <- Application <- Domain <- Infrastructure (repository+).
Interface handler functions and Application service functions are 1:1, and the business logic lives in Application, where I compose Domain entities and functions. I avoid building services as separate Domain structs (what would be classes in OOP). Leaning into Go’s package-oriented nature, I use package-level global functions instead. Repositories are also defined as interfaces in the Domain layer, so in practice Application does all the work via Domain entities + Domain package functions + Repository functions (interfaces).
Since the project leans into DDD, design always begins with Domain entities and value objects, then Repository interface definitions and use-case implementations (Application). The Infrastructure schema (Entgo schema structs) is built separately to mirror those, and the Repository implementation loads those schema structs and converts them into domain structs before handing them up to Application.
Early on I used to return domain objects from Application too, but partway through I redid the structure. On top of Handler request and response types, I introduced separate DTOs in Application as well, decoupling the layers as much as I could.
For a really simple microservice, I think it’s fine to write queries directly in handlers. The mapping code between layers is genuinely a pain (especially when handling slices in Go), but the samber/lo package cut a lot of that down.
graph LR
H[Interface / Handlers] --> A[Application Services]
A --> D[Domain - Entities, VOs, Policies]
A --> IRepo[Repository Interfaces]
subgraph Infrastructure
DBRepo[DB Repo - Ent / SQLC]
CacheRepo[Cache Repo - Redis]
EventPub[Event Publisher - Centrifugo]
end
IRepo -.implemented by.-> DBRepo
IRepo -.implemented by.-> CacheRepo
D <--> DE[Domain Events]
Testing
Test code
I write tests in BDD style with ginkgo/gomega. Having learned TDD in RSpec, the default Go testing style isn’t very intuitive to me. My TDD philosophy is: for anything that touches the DB (repositories, application services), run integration tests against a test DB. Mocking is reserved for the truly external things like OAuth, or these days an API like GPT. In the era of Docker-based development, mocking the DB itself feels like a real waste of time to me.
Domain layers are pure functions, so unit tests are easy. For repositories I don’t test everything, but for complex queries or specific business rules I run integration tests against a real DB. Application business logic is the same, integration tests against a test DB.
That said, there was one time I told Claude Code to fill in missing test coverage without thinking it through, and it ignored what I’d specified and dumped a pile of mocked-DB tests. So a chunk of the current test suite is mocked-DB tests, which makes adding new repositories really annoying. I need to clean that up but haven’t gotten to it.
Apidog
For API testing I use Apidog. I cycled through Postman, Insomnia, and Bruno, and Apidog covers most of Postman’s core features while doing automated documentation and scenario tests really well at the tool level. Even the free tier is generous enough that solo work is comfortable on it. You can have AI auto-generate Swagger, sync it straight into Apidog, and immediately fire test requests, with request/response schema validation included. Highly recommended.
(Why not Postman: too many features, too heavy. The UI also doesn’t feel clean to me, so I preferred Insomnia. But Insomnia’s updates went weird, so I tried the open-source Bruno, which was too feature-poor. I was looking around and found Apidog.)
Performance testing
I didn’t run a proper load test. I did do performance comparisons between the original repository and the optimized query-side repository. The catch was that adopting the query-side version would have required changing the entire client structure, so I didn’t roll it out. I ended up further optimizing the legacy repository queries instead, and that test is now deprecated.
graph TD
U[Unit: Domain - Ginkgo/Gomega] --> I[Integration: Repo+App - Test DB]
I --> E[E2E: API scenarios - Apidog]
E --> P[Perf: optional load metrics]
Real-time
In Scrumble, after writing a check-in or check-out post, you can leave reactions and comments like on a regular SNS. But beyond just commenting, I wanted the room to feel alive, with emojis and comments landing in real time and a seamless UX. Real-time was a baseline requirement.
So at first I built a WebSocket server directly on top of Go Fiber. Then I found https://centrifugal.dev/, which is genuinely impressive. I threw out my code and migrated everything over.
If you build the WebSocket server yourself, you have to handle all of this:
- Message loss and reconnection
- Horizontal scaling later
- Online presence, permissions, namespace management
- Server-side communication protocol
- Operational visibility
- And more
For a quick prototype, doing it yourself is fine. But once you start thinking about production-grade real-time, the cost of building all of it adds up fast. Centrifugo provides the genuinely hard parts of real-time systems (lossless recovery, large-scale fanout, presence and permissions, observability, multi-node scaling) at the framework level.
You spin up a Centrifugo server, hook up Redis, implement handlers in your real server, focus on the logic, and that’s it. With how good the future scaling story and developer ergonomics are, there’s no reason not to use Centrifugo.
sequenceDiagram
participant C as Client (Next.js)
participant API as Fiber API
participant PG as PostgreSQL
participant OB as Outbox (TX)
participant PUB as Publisher
participant CF as Centrifugo
participant O as Other Clients
C->>API: POST /posts/:id/reactions
API->>PG: INSERT reaction (in TX)
API->>PG: INSERT outbox_event (same TX)
PG-->>API: COMMIT OK
API->>PUB: notify new outbox_event
PUB->>CF: publish reaction.added
CF-->>O: push event
C-->>C: optimistic UI (optional)
The catch
I figured Centrifugo would make my real-time worries disappear, but there was one big issue. Some of Centrifugo’s recent features are only available on Redis v7, v7.2, and v7.4. In particular, presence, history, and TTL all live in v7. The channel-history feature (for message persistence and recovery, the one I praised above as built-in) sits there too. Our deploy infra, Upstash Redis, only supports up to v6.2. I didn’t know I needed to set those configs to 0 and disable recover, so I shipped with default settings, the real-time features didn’t work in production at all, and I burned a fair bit of time digging through it. Upstash is very economical, so even with those features off it’s not a real problem, but it is a shame.
Even so, since adopting Centrifugo, the real-time server itself hasn’t broken in production once. The only real-time issues we’ve seen have been frontend handler subscription and reconnection logic dropping connections.
Event-driven
With Centrifugo and Redis as the backend, I built domain-based event-driven patterns on top. Event messages are mostly defined in each domain layer, and event emission happens inside the business logic in the application layer. You could put it in the domain layer instead, but I went with application-layer code for the directness of the code flow, the fact that early-project events are almost all 1:1, and easier debugging. In the interface layer, alongside HTTP, I added an “events” interface and built handlers there too. These handlers are registered to handle domain events, and like HTTP handlers, they call into application-layer functions.
graph TD
CMD[Command - API] --> TX[DB TX]
TX --> W[Write Domain Data]
TX --> OBOX[Insert Outbox Event]
OBOX --> COMMIT[Commit]
COMMIT --> WKR[Outbox Worker]
WKR --> PUB[Publish to Centrifugo]
PUB --> CLIENTS[Subscribed Clients]
Other
Caching
Everything else gets basic caching with Redis: feed summaries, my own writing state, the spaces I’ve joined. These are calculated queries that can hold a long TTL. The default is cache-aside, with invalidation policy implemented per event handler when other events fire. It’s intuitive, but registering invalidation handlers one by one for every mutation event is a bit of a chore.
flowchart TD
RQ[Client reads Feed] --> GET[Redis GET]
GET -->|miss| DB[SQLC Query]
DB --> DTO[Shape DTO]
DTO --> SET[Redis SET - TTL]
SET --> RESP[Response]
GET -->|hit| RESP
subgraph Invalidation
EVT[reaction/comment created]
EVT --> DEL[DEL ws:id:feed:*]
EVT --> PUB[Publish feed.invalidate]
end
CQRS (Query Repository / SQLC)
When I say CQRS I don’t mean splitting the Query DB out fully. Because of Entgo’s query-performance issues mentioned above, I introduced SQLC, which is closer to raw SQL, and split off a Query Repository. The Query Repository skips the domain interface and acts as read-only. Application calls it directly and gets back Application-layer structs.
Personally I think DDD fits Command well. For most of the values an endpoint or a screen needs, instead of routing through the structured shape of domain structs, it’s more natural and more performant to fire a DB query and return what the screen wants directly. That fits a Query interface better, in my view.
I introduced this because of Entgo, of course, and you might say there’s no reason for early-stage projects to reach for CQRS. But to honor the spirit of the project laid out in Scrumble Project Retrospective (June–August 2025), I went in with the attitude of “let’s just try everything I want to try, the dumb way.”
We use SNS daily, so we tend to assume it’s easy to build. But these services come with more performance concerns than you’d expect, so building API endpoints intuitively to fit each screen wasn’t a bad call. What used to be many round-trips through Entgo became a clean JOIN in raw SQL via SQLC, and queries got a lot faster.
You can write raw SQL with Entgo too, but joining several tables (posts, reactions, comments, mediafiles) and doing it in a type-safe way with proper object mapping was harder than I expected. With SQLC, you write a SQL file and generate a Go file that maps types safely, so I got the performance and type safety together.
graph TD
CMD[Command] --> AC[App - Command]
QRY[Query] --> AQ[App - Query]
AC --> RC[Repo - Ent]
AQ --> RQ[Repo - SQLC]
RC --> PG[PostgreSQL]
RQ --> PG
AC --> EV[Domain Events] --> CF[Centrifugo]
AQ --> C[Redis Cache] --> AQ
Deploy infrastructure
For final deployment I’m using fly.io for both DB and server, with Upstash Redis. Both the backend server and the Centrifugo server run on fly.io. It’s only used internally right now, so infra cost is basically zero.
If you’ve read this far you might be thinking the optimization (caching, queries) feels excessive for an early-stage service. The reason: my initial DB infra was Neon. I’d heard “serverless” and pulled it in, but Neon has a few quirks, and on top of that my fly.io server was in NRT (Tokyo) while Neon’s only Asia region is Singapore. The server-to-DB latency was substantial. (Average 200ms+, so feeds with lots of posts and comments could take 2 seconds to load.) That pushed me hard into Optimistic UI on React Query and aggressive query optimization and caching on the server side, just to make this infra work.
Eventually, after seeing the cost on Neon, I migrated to fly.io and unified everything in NRT. The optimizations stayed, so the service runs much smoother now.
The open question is whether I’ll stick with fly.io for an actual public release. Deploys are easy, the experience is intuitive, and the early cost is low. But the status page is yellow disturbingly often, and individual regions go down. The service stays up, but the web console server goes down and the page won’t load. Things that wouldn’t happen on AWS happen often here. At some point I’ll need to rebuild on more reliable server infra.
graph TD
DEV[Local Dev - Air] --> CI[CI Build & Test]
CI --> REG[Container Registry]
REG --> API[Fly.io Scrumble API - NRT]
REG --> CF[Fly.io Centrifugo - NRT]
subgraph Data
FPG[Fly.io Postgres]
UREDIS[Upstash Redis]
end
API -- SQL --> FPG
API -- cache --> UREDIS
API -- publish --> CF
USERS[Users] --> API
USERS -- WS --> CF
Vibe coding on the backend?
User → Member migration
Early on, to move quickly, I designed most entities around the User schema. For a workspace-based SaaS like Slack (where each workspace has its own profile and you log in per workspace), I needed to migrate most of the domain/schema entities and the auth structure to a Member entity (a user-to-workspace relation schema) instead of a User entity. I knew partway through that I had to switch, but I kept pushing it back to keep adding features. By the end of the project, the migration was tangled with multiple domains and the dependency cost had grown.
Migrating top-level entity relationships isn’t actually hard. It’s mechanical and repetitive. So I documented the spec and plan in Kiro, which had just launched, and handed it to Claude Code. Claude Code went on a wild detour: ignoring the existing architecture, ignoring existing code, generating duplicate domains and code. I ended up redoing about three days of work by hand. So for the backend, there isn’t much to say about vibe coding. Later on I only used it to automate trivial scaffolding for simple API setups, and most of the work I did myself. The frontend, on the other hand, leaned heavily on Claude Code, so I’ll cover more there.
Wrap-up
One of the original goals of the project was “get the chops back”, and that goal was met. Now, with production release in sight, I’m building on a cleaner, clearer architecture in the current project. To summarize:
Keep
- DDD/Clean Architecture, plus CQRS and event-driven. The key is putting ports at every layer and enforcing real separation. Separation gives the domain implementation room to breathe. (And reference the domain everywhere.)
- There’s a tradeoff with duplicated code, of course
- Centrifugo
- Working with Redis caching
- A clearer test strategy
Problem
- Picking an ORM (Entgo) without enough evaluation. The resulting design got more complex (CQRS, SQLC)
- Code-first migrations via Entgo
- Region split at deploy time (Neon SG ↔ Fly NRT). Neon ate a lot of my time
- Didn’t clean up code generated badly through vibe coding
- Insufficient research at the dev stage when adopting Centrifugo and similar tools (the version issue surfaced at production-release time, after development was done, costing real time)
Try
- Schema-first migrations instead of code-first
- Long term, I need to swap the production deploy infra
- Get operational visibility in place
- Remove the mocked-DB tests and clean up the badly generated code
That’s about the size of it.
If you’ve read all of this, you’ll see the core of this design isn’t tied to Go. It applies in any language, any framework.
One thing I miss while working in Go is functional programming. The style I enjoy in Java lambdas, Kotlin, and JS/TS feels less fun in Go. samber/lo helps, but as a language that got generics late, the code doesn’t feel that intuitive. (lo does make it much cleaner, to be fair.)
If I do Spring next, it’ll probably be because of Kotlin. That said, Go’s directness pays off. Once you put in the small upfront cost, it’s easy to collaborate on and easy to operate. It’ll stay my primary choice for products I lead. I hope more developers in Korea pick up Go.
I’ll keep going in the frontend post…
댓글
댓글을 불러오는 중...
댓글 남기기
Legacy comments (Giscus)