Pattern Spotting: Hatchet

Hatchet is a durable task orchestration platform, built on PostgreSQL. Competitors include Celery or Temporal. Hatchet scaled from 20k tasks/month to 1B+/month without abandoning Postgres.

Some patterns caught my attention.

PostgreSQL as the Universal Backend #

Conventional wisdom says separate your queue (Redis/RabbitMQ) from your state (PostgreSQL). Different access patterns, different tools. Hatchet bets against this using Postgres for everything.

The queue implementation relies on FOR UPDATE SKIP LOCKED, a Postgres feature that lets workers claim rows without blocking each other.

This worked until it didn’t. At around 25k transactions/second (~5k tasks/sec, each task triggering ~5 transactions), they hit a pathological case: many queued tasks combined with many workers polling simultaneously caused CPU spikes that cascaded into thundering herd effect.

They considered the “obvious” fix—move hot data to Kafka + Clickhouse. Instead, they doubled down on Postgres and spent six months learning to operate it at scale.

What they learned:

  • Buffer writes and flush every 10ms instead of per-task
  • Use identity columns over UUIDs (UUIDs cause “index bloat”)
  • Lean into triggers instead of application-side logic
  • Read the Postgres manual: “amazing what you can do when you read the manual”

Trade-off: Requires Postgres tuning expertise at scale. In return: transactional enqueueing, durability by default, one system to operate. The last one imho is huge, you don’t need a cluster of zookeepers; just Postgres.

gRPC Push over HTTP Poll #

Conventional model: Workers poll the queue. Simple, but creates load even when idle.

Hatchet’s model:

Benefits:

  • Lower latency (no polling interval)
  • Reduced database load
  • Natural backpressure via stream flow control

Trade-off: Connection management complexity.

Dual Schema: OLTP + OLAP Separation #

The problem: Dashboard queries (SELECT COUNT(*)) compete with queue operations. Can’t optimize one table for both patterns.

The solution:

  • Two schemas: v1-core.sql (queue) and v1-olap.sql (monitoring)
  • Range-based partitioning for time series → easy retention (drop old partitions)
  • Hash-based partitioning for task events → spread write load

The escape hatch: Schemas are independent enough to run on separate databases later if needed.

Applicable elsewhere: Any system mixing operational and analytical queries. Audit logs vs live data.

Facade Repository #

Data access in Go often means either one massive interface or dozens of small ones scattered everywhere. Hatchet picks a middle path: a single Repository interface that aggregates 30+ sub-repositories.

type Repository interface {
    APIToken() APITokenRepository
    Dispatcher() DispatcherRepository
    Tasks() TaskRepository
    Scheduler() SchedulerRepository
    OLAP() OLAPRepository
    // ... 25+ more
}

Services depend on this one interface. Need tasks? Call repo.Tasks().Create(...). Need to mock the data layer in tests? Implement one interface, not thirty.

The implementation uses composition:

type sharedRepository struct {
    pool    *pgxpool.Pool
    queries *sqlcv1.Queries  // SQLC-generated
    cache   *cache.Cache
    l       *zerolog.Logger
}

Each sub-repository receives this shared struct at construction time. Common resources (connection pool, logger, cache) are wired once and reused everywhere.

Why this works:

  • One testing mock Repository, get all data access
  • Sub-repositories stay focused (single responsibility)
  • Shared resources avoid duplication and connection sprawl

Trade-off: The interface is large. If you only need Tasks(), you still depend on the full Repository.

Takeaway #

Sometimes the “boring” choice wins if you learn it deeply. PostgreSQL handled 1B tasks/month—they just had to read the manual.

Links: