Problem
When PgDog opens a cohort of backend connections in a tight window — e.g. during a traffic ramp at the start of a sale — those connections share a near-identical birth time. With a finite server_lifetime, the entire cohort hits the cap at near-identical times, producing one large reconnect wave at every server_lifetime interval. Clients that need a slot during the wave queue up while PgDog reopens connections, which surfaces as a periodic latency stripe in application metrics.
This is the same failure mode HikariCP addresses with built-in maxLifetime jitter and that AWS RDS Proxy avoids by spreading recycle events across a window.
Proposal
Add a configuration option to randomize server_lifetime per connection.
Option A — explicit jitter knob (preferred, matches HikariCP):
[general]
server_lifetime = 3600000 # 1h base
server_lifetime_jitter = 600000 # +/- up to 10 min, sampled per connection at creation time
Each backend connection's effective lifetime would be server_lifetime + uniform(-server_lifetime_jitter, +server_lifetime_jitter). Default server_lifetime_jitter = 0 preserves current behavior.
Option B — implicit fractional jitter:
[general]
server_lifetime = 3600000
server_lifetime_jitter_fraction = 0.1 # +/- 10% of server_lifetime
Either works; (A) is more explicit and easier to reason about for operators.
Behavioral details that matter
- Jitter must be sampled once at connection creation, not on every check, so a connection's lifetime is deterministic from its birth time. Otherwise a connection could keep "rolling" past its cap.
- Jitter should be additive, not subtractive-only — operators want the average lifetime to remain
server_lifetime, not skew shorter.
- The randomness source can be process-global; cryptographic randomness is not required.
Workarounds today
The only mitigations available are increasing server_lifetime (reduces wave frequency but doesn't eliminate the cohort effect) or removing it entirely (gives up backend-memory recycling). Neither is a substitute for jitter.
Prior art
- HikariCP:
maxLifetime is jittered internally by 2.5% by default; documented for exactly this reason.
- AWS RDS Proxy spreads idle connection recycling across a window to avoid synchronized retirement.
Happy to send a PR if there's interest in (A). Would also be open to a min_lifetime/max_lifetime shape if maintainers prefer that.
Problem
When PgDog opens a cohort of backend connections in a tight window — e.g. during a traffic ramp at the start of a sale — those connections share a near-identical birth time. With a finite
server_lifetime, the entire cohort hits the cap at near-identical times, producing one large reconnect wave at everyserver_lifetimeinterval. Clients that need a slot during the wave queue up while PgDog reopens connections, which surfaces as a periodic latency stripe in application metrics.This is the same failure mode HikariCP addresses with built-in
maxLifetimejitter and that AWS RDS Proxy avoids by spreading recycle events across a window.Proposal
Add a configuration option to randomize
server_lifetimeper connection.Option A — explicit jitter knob (preferred, matches HikariCP):
Each backend connection's effective lifetime would be
server_lifetime + uniform(-server_lifetime_jitter, +server_lifetime_jitter). Defaultserver_lifetime_jitter = 0preserves current behavior.Option B — implicit fractional jitter:
Either works; (A) is more explicit and easier to reason about for operators.
Behavioral details that matter
server_lifetime, not skew shorter.Workarounds today
The only mitigations available are increasing
server_lifetime(reduces wave frequency but doesn't eliminate the cohort effect) or removing it entirely (gives up backend-memory recycling). Neither is a substitute for jitter.Prior art
maxLifetimeis jittered internally by 2.5% by default; documented for exactly this reason.Happy to send a PR if there's interest in (A). Would also be open to a
min_lifetime/max_lifetimeshape if maintainers prefer that.