Performance bottlenecks

## Summary

Stress testing on an 8-vCPU HorizonDB AE cluster (17,471 instances, 0 new failures) identified four bottlenecks limiting throughput to ~25 inst/s sustained and ~50 starts/sec ingest ceiling. This issue captures the analysis and improvement opportunities.

Full stress campaign report: `zperf.md` in the repo root (attached as a comment if available).

## Key Numbers from Stress Campaign

| Metric | Value |
|--------|-------|
| Per-instance overhead (trivial body) | **~60ms** |
| `df.start()` round-trip (serial) | **~130ms** |
| Sustained completion ceiling | **~25 inst/s** (at concurrency 4) |
| Ingest ceiling | **~50 starts/sec** (at concurrency ≥16) |
| Effective worker parallelism | **~6** (on 8 vCPU, `max_user_connections=10`) |
| Latency cliff | **c04 → c08**: p50 jumps from 150ms to 14,410ms |

## Bottleneck Analysis

### 1. Fresh connection per SQL node (~10-20ms per node)

`connect_as_user()` in `src/types.rs` opens a **new `PgConnection`** (TCP connect + auth + `SET df.in_workflow`) for every SQL activity execution, then drops it. There is no connection pooling for user-execution connections — the semaphore in `execute_sql.rs` only limits concurrency, it does not cache connections.

For trivial bodies like `SELECT 1`, connection setup dominates execution time and is a significant chunk of the 60ms floor.

### 2. Write contention between `df.start()` and BGW (`df.instances` table)

Phase 1 shows linear scaling to c04, then latency explodes 80x at c08. The stress report notes: *"df.start writers and workers are stepping on each other."*

Both paths write to `df.instances`:
- **User backends** INSERT rows (via `df.start()` → SPI)
- **BGW activities** UPDATE status columns

At c08+ the ROW EXCLUSIVE locks and MVCC vacuum pressure create queueing on the shared table.

### 3. Single-threaded BGW tokio runtime

The background worker runs `tokio::runtime::Builder::new_current_thread()`. All async tasks — semaphore dispatch, duroxide state I/O, SQL execution, result encoding — share **one OS thread**. This caps effective parallelism at ~6 even though `max_user_connections=10` permits are available.

### 4. Connection pool sizing (secondary)

| Pool | Default | Observation |
|------|---------|-------------|
| `max_user_connections` | 10 | Not fully saturated — runtime can't feed it fast enough |
| `max_duroxide_connections` | 10 | 9 usable (1 for listener). Serialization point for orchestration state I/O |
| `max_management_connections` | 6 | Shared between graph loading, status updates, and heartbeats. Can stall under load |

**Connection limits are not the primary bottleneck.** Even raising all limits would not help much because the single-threaded runtime cannot drive more concurrent work.

## Improvement Opportunities (ranked by expected impact)

### High Impact

1. **Connection pooling for user-execution**
   Replace per-activity `PgConnection::connect_with()` in `connect_as_user()` with a per-user connection cache keyed by `(submitted_by, database)`. Eliminates ~10-20ms TCP+auth per SQL node.
   *Expected: 15-30% latency reduction on trivial bodies.*

2. **Multi-threaded BGW runtime**
   Switch from `new_current_thread()` to `Runtime::new()` with 2-4 worker threads. Lets semaphore dispatch, duroxide I/O, and SQL execution overlap on separate OS threads.
   *Expected: effective parallelism closer to `max_user_connections` value.*

3. **Batch node insertion in `df.start()`**
   Replace recursive per-node `INSERT INTO df.nodes` with a single multi-row INSERT or COPY. A 5-node graph currently does 5 SPI round-trips while holding ROW EXCLUSIVE locks.
   *Expected: faster `df.start()`, reduced lock hold time.*

### Medium Impact

4. **Separate ingest and progress write paths**
   Decouple user INSERTs from BGW status UPDATEs on `df.instances` (e.g., separate status table, or append-only status log). Eliminates the row-level lock contention seen at c08+.
   *Expected: higher sustained throughput at high concurrency.*

5. **Raise default pool sizes for larger SKUs**
   Bump `max_duroxide_connections` to 15-20 and `max_management_connections` to 10 for 8+ vCPU deployments. Current defaults are conservative.
   *Expected: incremental improvement at high concurrency.*

## Stress Campaign Phases (summary)

| Phase | Goal | Result |
|-------|------|--------|
| 0 — Calibration | Baseline per-instance latency | ✅ 60ms/inst overhead, tight tails |
| 1 — Throughput sweep | Find sustained ceiling | ✅ 25 inst/s sustained, 50 starts/sec ingest |
| 2 — Concurrency saturation | Effective parallelism | ✅ ~6 workers effective on 8 vCPU |
| 3 — Mixed workload | Class starvation check | ✅ No starvation (per-class latency not captured — observability gap) |
| 4 — Durability under restart | Replay after termination | ⚠️ Inconclusive (HDB role lacks SUPERUSER for `pg_terminate_backend`) |
| 5 — Soak (10 min) | Sustained load stability | ✅ 12,515 instances, 20.86 inst/s, zero failures, no drift |

**Campaign totals**: 17,471 instances completed, 2 failed (pre-existing residue), 0 new failures.

## Observability Gaps Found

1. `df.list_instances()` has a hard ~10,000-row cap even with explicit `limit=100000`
2. `df.instance_executions(instance_id, 1)` returns no row for some completed instances

These should be tracked separately.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance bottlenecks #147

Summary

Key Numbers from Stress Campaign

Bottleneck Analysis

1. Fresh connection per SQL node (~10-20ms per node)

2. Write contention between `df.start()` and BGW (`df.instances` table)

3. Single-threaded BGW tokio runtime

4. Connection pool sizing (secondary)

Improvement Opportunities (ranked by expected impact)

High Impact

Medium Impact

Stress Campaign Phases (summary)

Observability Gaps Found

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Value
Per-instance overhead (trivial body)	~60ms
`df.start()` round-trip (serial)	~130ms
Sustained completion ceiling	~25 inst/s (at concurrency 4)
Ingest ceiling	~50 starts/sec (at concurrency ≥16)
Effective worker parallelism	~6 (on 8 vCPU, `max_user_connections=10`)
Latency cliff	c04 → c08: p50 jumps from 150ms to 14,410ms

Pool	Default	Observation
`max_user_connections`	10	Not fully saturated — runtime can't feed it fast enough
`max_duroxide_connections`	10	9 usable (1 for listener). Serialization point for orchestration state I/O
`max_management_connections`	6	Shared between graph loading, status updates, and heartbeats. Can stall under load

Phase	Goal	Result
0 — Calibration	Baseline per-instance latency	✅ 60ms/inst overhead, tight tails
1 — Throughput sweep	Find sustained ceiling	✅ 25 inst/s sustained, 50 starts/sec ingest
2 — Concurrency saturation	Effective parallelism	✅ ~6 workers effective on 8 vCPU
3 — Mixed workload	Class starvation check	✅ No starvation (per-class latency not captured — observability gap)
4 — Durability under restart	Replay after termination	⚠️ Inconclusive (HDB role lacks SUPERUSER for `pg_terminate_backend`)
5 — Soak (10 min)	Sustained load stability	✅ 12,515 instances, 20.86 inst/s, zero failures, no drift

Performance bottlenecks #147

Description

Summary

Key Numbers from Stress Campaign

Bottleneck Analysis

1. Fresh connection per SQL node (~10-20ms per node)

2. Write contention between df.start() and BGW (df.instances table)

3. Single-threaded BGW tokio runtime

4. Connection pool sizing (secondary)

Improvement Opportunities (ranked by expected impact)

High Impact

Medium Impact

Stress Campaign Phases (summary)

Observability Gaps Found

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Write contention between `df.start()` and BGW (`df.instances` table)