Skip to content

Jitter for server_lifetime to break synchronized backend reconnect waves #962

@frangz

Description

@frangz

Problem

When PgDog opens a cohort of backend connections in a tight window — e.g. during a traffic ramp at the start of a sale — those connections share a near-identical birth time. With a finite server_lifetime, the entire cohort hits the cap at near-identical times, producing one large reconnect wave at every server_lifetime interval. Clients that need a slot during the wave queue up while PgDog reopens connections, which surfaces as a periodic latency stripe in application metrics.

This is the same failure mode HikariCP addresses with built-in maxLifetime jitter and that AWS RDS Proxy avoids by spreading recycle events across a window.

Proposal

Add a configuration option to randomize server_lifetime per connection.

Option A — explicit jitter knob (preferred, matches HikariCP):

[general]
server_lifetime = 3600000          # 1h base
server_lifetime_jitter = 600000    # +/- up to 10 min, sampled per connection at creation time

Each backend connection's effective lifetime would be server_lifetime + uniform(-server_lifetime_jitter, +server_lifetime_jitter). Default server_lifetime_jitter = 0 preserves current behavior.

Option B — implicit fractional jitter:

[general]
server_lifetime = 3600000
server_lifetime_jitter_fraction = 0.1   # +/- 10% of server_lifetime

Either works; (A) is more explicit and easier to reason about for operators.

Behavioral details that matter

  • Jitter must be sampled once at connection creation, not on every check, so a connection's lifetime is deterministic from its birth time. Otherwise a connection could keep "rolling" past its cap.
  • Jitter should be additive, not subtractive-only — operators want the average lifetime to remain server_lifetime, not skew shorter.
  • The randomness source can be process-global; cryptographic randomness is not required.

Workarounds today

The only mitigations available are increasing server_lifetime (reduces wave frequency but doesn't eliminate the cohort effect) or removing it entirely (gives up backend-memory recycling). Neither is a substitute for jitter.

Prior art

  • HikariCP: maxLifetime is jittered internally by 2.5% by default; documented for exactly this reason.
  • AWS RDS Proxy spreads idle connection recycling across a window to avoid synchronized retirement.

Happy to send a PR if there's interest in (A). Would also be open to a min_lifetime/max_lifetime shape if maintainers prefer that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    pgbouncerDrop-in replacement for PgBouncer.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions