Skip to content

Performance bottlenecks #147

@pinodeca

Description

@pinodeca

Summary

Stress testing on an 8-vCPU HorizonDB AE cluster (17,471 instances, 0 new failures) identified four bottlenecks limiting throughput to ~25 inst/s sustained and ~50 starts/sec ingest ceiling. This issue captures the analysis and improvement opportunities.

Full stress campaign report: zperf.md in the repo root (attached as a comment if available).

Key Numbers from Stress Campaign

Metric Value
Per-instance overhead (trivial body) ~60ms
df.start() round-trip (serial) ~130ms
Sustained completion ceiling ~25 inst/s (at concurrency 4)
Ingest ceiling ~50 starts/sec (at concurrency ≥16)
Effective worker parallelism ~6 (on 8 vCPU, max_user_connections=10)
Latency cliff c04 → c08: p50 jumps from 150ms to 14,410ms

Bottleneck Analysis

1. Fresh connection per SQL node (~10-20ms per node)

connect_as_user() in src/types.rs opens a new PgConnection (TCP connect + auth + SET df.in_workflow) for every SQL activity execution, then drops it. There is no connection pooling for user-execution connections — the semaphore in execute_sql.rs only limits concurrency, it does not cache connections.

For trivial bodies like SELECT 1, connection setup dominates execution time and is a significant chunk of the 60ms floor.

2. Write contention between df.start() and BGW (df.instances table)

Phase 1 shows linear scaling to c04, then latency explodes 80x at c08. The stress report notes: "df.start writers and workers are stepping on each other."

Both paths write to df.instances:

  • User backends INSERT rows (via df.start() → SPI)
  • BGW activities UPDATE status columns

At c08+ the ROW EXCLUSIVE locks and MVCC vacuum pressure create queueing on the shared table.

3. Single-threaded BGW tokio runtime

The background worker runs tokio::runtime::Builder::new_current_thread(). All async tasks — semaphore dispatch, duroxide state I/O, SQL execution, result encoding — share one OS thread. This caps effective parallelism at ~6 even though max_user_connections=10 permits are available.

4. Connection pool sizing (secondary)

Pool Default Observation
max_user_connections 10 Not fully saturated — runtime can't feed it fast enough
max_duroxide_connections 10 9 usable (1 for listener). Serialization point for orchestration state I/O
max_management_connections 6 Shared between graph loading, status updates, and heartbeats. Can stall under load

Connection limits are not the primary bottleneck. Even raising all limits would not help much because the single-threaded runtime cannot drive more concurrent work.

Improvement Opportunities (ranked by expected impact)

High Impact

  1. Connection pooling for user-execution
    Replace per-activity PgConnection::connect_with() in connect_as_user() with a per-user connection cache keyed by (submitted_by, database). Eliminates ~10-20ms TCP+auth per SQL node.
    Expected: 15-30% latency reduction on trivial bodies.

  2. Multi-threaded BGW runtime
    Switch from new_current_thread() to Runtime::new() with 2-4 worker threads. Lets semaphore dispatch, duroxide I/O, and SQL execution overlap on separate OS threads.
    Expected: effective parallelism closer to max_user_connections value.

  3. Batch node insertion in df.start()
    Replace recursive per-node INSERT INTO df.nodes with a single multi-row INSERT or COPY. A 5-node graph currently does 5 SPI round-trips while holding ROW EXCLUSIVE locks.
    Expected: faster df.start(), reduced lock hold time.

Medium Impact

  1. Separate ingest and progress write paths
    Decouple user INSERTs from BGW status UPDATEs on df.instances (e.g., separate status table, or append-only status log). Eliminates the row-level lock contention seen at c08+.
    Expected: higher sustained throughput at high concurrency.

  2. Raise default pool sizes for larger SKUs
    Bump max_duroxide_connections to 15-20 and max_management_connections to 10 for 8+ vCPU deployments. Current defaults are conservative.
    Expected: incremental improvement at high concurrency.

Stress Campaign Phases (summary)

Phase Goal Result
0 — Calibration Baseline per-instance latency ✅ 60ms/inst overhead, tight tails
1 — Throughput sweep Find sustained ceiling ✅ 25 inst/s sustained, 50 starts/sec ingest
2 — Concurrency saturation Effective parallelism ✅ ~6 workers effective on 8 vCPU
3 — Mixed workload Class starvation check ✅ No starvation (per-class latency not captured — observability gap)
4 — Durability under restart Replay after termination ⚠️ Inconclusive (HDB role lacks SUPERUSER for pg_terminate_backend)
5 — Soak (10 min) Sustained load stability ✅ 12,515 instances, 20.86 inst/s, zero failures, no drift

Campaign totals: 17,471 instances completed, 2 failed (pre-existing residue), 0 new failures.

Observability Gaps Found

  1. df.list_instances() has a hard ~10,000-row cap even with explicit limit=100000
  2. df.instance_executions(instance_id, 1) returns no row for some completed instances

These should be tracked separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions