Summary
Stress testing on an 8-vCPU HorizonDB AE cluster (17,471 instances, 0 new failures) identified four bottlenecks limiting throughput to ~25 inst/s sustained and ~50 starts/sec ingest ceiling. This issue captures the analysis and improvement opportunities.
Full stress campaign report: zperf.md in the repo root (attached as a comment if available).
Key Numbers from Stress Campaign
| Metric |
Value |
| Per-instance overhead (trivial body) |
~60ms |
df.start() round-trip (serial) |
~130ms |
| Sustained completion ceiling |
~25 inst/s (at concurrency 4) |
| Ingest ceiling |
~50 starts/sec (at concurrency ≥16) |
| Effective worker parallelism |
~6 (on 8 vCPU, max_user_connections=10) |
| Latency cliff |
c04 → c08: p50 jumps from 150ms to 14,410ms |
Bottleneck Analysis
1. Fresh connection per SQL node (~10-20ms per node)
connect_as_user() in src/types.rs opens a new PgConnection (TCP connect + auth + SET df.in_workflow) for every SQL activity execution, then drops it. There is no connection pooling for user-execution connections — the semaphore in execute_sql.rs only limits concurrency, it does not cache connections.
For trivial bodies like SELECT 1, connection setup dominates execution time and is a significant chunk of the 60ms floor.
2. Write contention between df.start() and BGW (df.instances table)
Phase 1 shows linear scaling to c04, then latency explodes 80x at c08. The stress report notes: "df.start writers and workers are stepping on each other."
Both paths write to df.instances:
- User backends INSERT rows (via
df.start() → SPI)
- BGW activities UPDATE status columns
At c08+ the ROW EXCLUSIVE locks and MVCC vacuum pressure create queueing on the shared table.
3. Single-threaded BGW tokio runtime
The background worker runs tokio::runtime::Builder::new_current_thread(). All async tasks — semaphore dispatch, duroxide state I/O, SQL execution, result encoding — share one OS thread. This caps effective parallelism at ~6 even though max_user_connections=10 permits are available.
4. Connection pool sizing (secondary)
| Pool |
Default |
Observation |
max_user_connections |
10 |
Not fully saturated — runtime can't feed it fast enough |
max_duroxide_connections |
10 |
9 usable (1 for listener). Serialization point for orchestration state I/O |
max_management_connections |
6 |
Shared between graph loading, status updates, and heartbeats. Can stall under load |
Connection limits are not the primary bottleneck. Even raising all limits would not help much because the single-threaded runtime cannot drive more concurrent work.
Improvement Opportunities (ranked by expected impact)
High Impact
-
Connection pooling for user-execution
Replace per-activity PgConnection::connect_with() in connect_as_user() with a per-user connection cache keyed by (submitted_by, database). Eliminates ~10-20ms TCP+auth per SQL node.
Expected: 15-30% latency reduction on trivial bodies.
-
Multi-threaded BGW runtime
Switch from new_current_thread() to Runtime::new() with 2-4 worker threads. Lets semaphore dispatch, duroxide I/O, and SQL execution overlap on separate OS threads.
Expected: effective parallelism closer to max_user_connections value.
-
Batch node insertion in df.start()
Replace recursive per-node INSERT INTO df.nodes with a single multi-row INSERT or COPY. A 5-node graph currently does 5 SPI round-trips while holding ROW EXCLUSIVE locks.
Expected: faster df.start(), reduced lock hold time.
Medium Impact
-
Separate ingest and progress write paths
Decouple user INSERTs from BGW status UPDATEs on df.instances (e.g., separate status table, or append-only status log). Eliminates the row-level lock contention seen at c08+.
Expected: higher sustained throughput at high concurrency.
-
Raise default pool sizes for larger SKUs
Bump max_duroxide_connections to 15-20 and max_management_connections to 10 for 8+ vCPU deployments. Current defaults are conservative.
Expected: incremental improvement at high concurrency.
Stress Campaign Phases (summary)
| Phase |
Goal |
Result |
| 0 — Calibration |
Baseline per-instance latency |
✅ 60ms/inst overhead, tight tails |
| 1 — Throughput sweep |
Find sustained ceiling |
✅ 25 inst/s sustained, 50 starts/sec ingest |
| 2 — Concurrency saturation |
Effective parallelism |
✅ ~6 workers effective on 8 vCPU |
| 3 — Mixed workload |
Class starvation check |
✅ No starvation (per-class latency not captured — observability gap) |
| 4 — Durability under restart |
Replay after termination |
⚠️ Inconclusive (HDB role lacks SUPERUSER for pg_terminate_backend) |
| 5 — Soak (10 min) |
Sustained load stability |
✅ 12,515 instances, 20.86 inst/s, zero failures, no drift |
Campaign totals: 17,471 instances completed, 2 failed (pre-existing residue), 0 new failures.
Observability Gaps Found
df.list_instances() has a hard ~10,000-row cap even with explicit limit=100000
df.instance_executions(instance_id, 1) returns no row for some completed instances
These should be tracked separately.
Summary
Stress testing on an 8-vCPU HorizonDB AE cluster (17,471 instances, 0 new failures) identified four bottlenecks limiting throughput to ~25 inst/s sustained and ~50 starts/sec ingest ceiling. This issue captures the analysis and improvement opportunities.
Full stress campaign report:
zperf.mdin the repo root (attached as a comment if available).Key Numbers from Stress Campaign
df.start()round-trip (serial)max_user_connections=10)Bottleneck Analysis
1. Fresh connection per SQL node (~10-20ms per node)
connect_as_user()insrc/types.rsopens a newPgConnection(TCP connect + auth +SET df.in_workflow) for every SQL activity execution, then drops it. There is no connection pooling for user-execution connections — the semaphore inexecute_sql.rsonly limits concurrency, it does not cache connections.For trivial bodies like
SELECT 1, connection setup dominates execution time and is a significant chunk of the 60ms floor.2. Write contention between
df.start()and BGW (df.instancestable)Phase 1 shows linear scaling to c04, then latency explodes 80x at c08. The stress report notes: "df.start writers and workers are stepping on each other."
Both paths write to
df.instances:df.start()→ SPI)At c08+ the ROW EXCLUSIVE locks and MVCC vacuum pressure create queueing on the shared table.
3. Single-threaded BGW tokio runtime
The background worker runs
tokio::runtime::Builder::new_current_thread(). All async tasks — semaphore dispatch, duroxide state I/O, SQL execution, result encoding — share one OS thread. This caps effective parallelism at ~6 even thoughmax_user_connections=10permits are available.4. Connection pool sizing (secondary)
max_user_connectionsmax_duroxide_connectionsmax_management_connectionsConnection limits are not the primary bottleneck. Even raising all limits would not help much because the single-threaded runtime cannot drive more concurrent work.
Improvement Opportunities (ranked by expected impact)
High Impact
Connection pooling for user-execution
Replace per-activity
PgConnection::connect_with()inconnect_as_user()with a per-user connection cache keyed by(submitted_by, database). Eliminates ~10-20ms TCP+auth per SQL node.Expected: 15-30% latency reduction on trivial bodies.
Multi-threaded BGW runtime
Switch from
new_current_thread()toRuntime::new()with 2-4 worker threads. Lets semaphore dispatch, duroxide I/O, and SQL execution overlap on separate OS threads.Expected: effective parallelism closer to
max_user_connectionsvalue.Batch node insertion in
df.start()Replace recursive per-node
INSERT INTO df.nodeswith a single multi-row INSERT or COPY. A 5-node graph currently does 5 SPI round-trips while holding ROW EXCLUSIVE locks.Expected: faster
df.start(), reduced lock hold time.Medium Impact
Separate ingest and progress write paths
Decouple user INSERTs from BGW status UPDATEs on
df.instances(e.g., separate status table, or append-only status log). Eliminates the row-level lock contention seen at c08+.Expected: higher sustained throughput at high concurrency.
Raise default pool sizes for larger SKUs
Bump
max_duroxide_connectionsto 15-20 andmax_management_connectionsto 10 for 8+ vCPU deployments. Current defaults are conservative.Expected: incremental improvement at high concurrency.
Stress Campaign Phases (summary)
pg_terminate_backend)Campaign totals: 17,471 instances completed, 2 failed (pre-existing residue), 0 new failures.
Observability Gaps Found
df.list_instances()has a hard ~10,000-row cap even with explicitlimit=100000df.instance_executions(instance_id, 1)returns no row for some completed instancesThese should be tracked separately.