Pattern Spotting: Hatchet
Hatchet is a durable task orchestration platform, built on PostgreSQL. Competitors include Celery or Temporal. Hatchet scaled from 20k tasks/month to 1B+/month without abandoning Postgres.
Some patterns caught my attention.
PostgreSQL as the Universal Backend #
Conventional wisdom says separate your queue (Redis/RabbitMQ) from your state (PostgreSQL). Different access patterns, different tools. Hatchet bets against this using Postgres for everything.
The queue implementation relies on FOR UPDATE SKIP LOCKED, a Postgres feature that lets workers claim rows without blocking each other.
This worked until it didn’t. At around 25k transactions/second (~5k tasks/sec, each task triggering ~5 transactions), they hit a pathological case: many queued tasks combined with many workers polling simultaneously caused CPU spikes that cascaded into thundering herd effect.
They considered the “obvious” fix—move hot data to Kafka + Clickhouse. Instead, they doubled down on Postgres and spent six months learning to operate it at scale.
What they learned:
- Buffer writes and flush every 10ms instead of per-task
- Use identity columns over UUIDs (UUIDs cause “index bloat”)
- Lean into triggers instead of application-side logic
- Read the Postgres manual: “amazing what you can do when you read the manual”
Trade-off: Requires Postgres tuning expertise at scale. In return: transactional enqueueing, durability by default, one system to operate. The last one imho is huge, you don’t need a cluster of zookeepers; just Postgres.
gRPC Push over HTTP Poll #
Conventional model: Workers poll the queue. Simple, but creates load even when idle.
Hatchet’s model:
- Workers open persistent gRPC streams (
Listen()) - Dispatcher maintains in-memory registry of connected workers
- Tasks pushed instantly
Benefits:
- Lower latency (no polling interval)
- Reduced database load
- Natural backpressure via stream flow control
Trade-off: Connection management complexity.
Dual Schema: OLTP + OLAP Separation #
The problem: Dashboard queries (SELECT COUNT(*)) compete with queue operations. Can’t optimize one table for both patterns.
The solution:
- Two schemas:
v1-core.sql(queue) andv1-olap.sql(monitoring) - Range-based partitioning for time series → easy retention (drop old partitions)
- Hash-based partitioning for task events → spread write load
The escape hatch: Schemas are independent enough to run on separate databases later if needed.
Applicable elsewhere: Any system mixing operational and analytical queries. Audit logs vs live data.
Facade Repository #
Data access in Go often means either one massive interface or dozens of small ones scattered everywhere. Hatchet picks a middle path: a single Repository interface that aggregates 30+ sub-repositories.
type Repository interface {
APIToken() APITokenRepository
Dispatcher() DispatcherRepository
Tasks() TaskRepository
Scheduler() SchedulerRepository
OLAP() OLAPRepository
// ... 25+ more
}
Services depend on this one interface. Need tasks? Call repo.Tasks().Create(...). Need to mock the data layer in tests? Implement one interface, not thirty.
The implementation uses composition:
type sharedRepository struct {
pool *pgxpool.Pool
queries *sqlcv1.Queries // SQLC-generated
cache *cache.Cache
l *zerolog.Logger
}
Each sub-repository receives this shared struct at construction time. Common resources (connection pool, logger, cache) are wired once and reused everywhere.
Why this works:
- One testing mock
Repository, get all data access - Sub-repositories stay focused (single responsibility)
- Shared resources avoid duplication and connection sprawl
Trade-off: The interface is large. If you only need Tasks(), you still depend on the full Repository.
Takeaway #
Sometimes the “boring” choice wins if you learn it deeply. PostgreSQL handled 1B tasks/month—they just had to read the manual.
Links:
- Hatchet on GitHub
- How to think about durable execution — good read on when durable execution makes sense (and when it doesn’t)
- HN discussion on v1 launch