-
Notifications
You must be signed in to change notification settings - Fork 25
[WIP] feat: planner reconciler #732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a35ac14
0c6aa56
2b28a13
ac48174
8074715
29d86c4
5156870
3a3277b
e397b16
9b8712f
c0e3413
0c04da4
d87d9eb
a3c7b44
8edd1f1
da6b357
f727741
018a344
70e5ebe
41a769b
83451a4
47f3d2a
2bb278f
a95eaf9
304f70b
a9298fb
e8376fe
b946175
6d65f14
60bd838
5408db3
b69ad26
ef6b000
fdfcac3
427708d
c32b919
06b5183
cb7b877
b4a5c2f
399b8c8
eaeb41b
93f06bf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,85 @@ | ||
| # ADR-001: Pressure Reconciler Design for Planner Integration | ||
|
Check failure on line 1 in docs/adr/001_pressure_reconciler_design.md
|
||
|
|
||
| **Author:** Christopher Bartz (christopher.bartz@canonical.com) | ||
|
Check failure on line 3 in docs/adr/001_pressure_reconciler_design.md
|
||
| **Date:** 2026/02/20 | ||
|
Check warning on line 4 in docs/adr/001_pressure_reconciler_design.md
|
||
| **Domain:** Architecture | ||
|
|
||
| ## Overview | ||
|
|
||
| This ADR documents the design of the `PressureReconciler`, which enables the | ||
| github-runner-manager to scale runners reactively in response to pressure signals | ||
|
Check failure on line 10 in docs/adr/001_pressure_reconciler_design.md
|
||
| from the planner service, while preserving the existing timer-based reconciliation | ||
| for deployments without a planner. | ||
|
|
||
| ## Context | ||
|
|
||
| The runner manager historically determined the desired runner count in two ways: | ||
| a static count from configuration (number of VMs per combination), or reactively | ||
|
Check failure on line 17 in docs/adr/001_pressure_reconciler_design.md
|
||
| by consuming job messages from a MongoDB queue. With the introduction of the planner | ||
| charm, the desired runner count becomes a dynamic value driven by observed queue | ||
| depth—referred to as *pressure*. The runner manager must respond to pressure changes | ||
| quickly (to prevent queued jobs from waiting) while also periodically cleaning up | ||
| stale runners. | ||
|
|
||
| Two competing concerns shape the design: | ||
|
|
||
| - **Low-latency scale-up**: runners should be created as soon as the planner signals | ||
| increased demand, not on a fixed reconcile tick. | ||
| - **Periodic cleanup**: stale runners (e.g. completed jobs whose VM was not yet | ||
| reclaimed) must be removed on a regular schedule regardless of inbound pressure. | ||
|
|
||
| ## Decision | ||
|
|
||
| The `PressureReconciler` runs two independent, long-lived loops that share a mutex | ||
|
Check failure on line 33 in docs/adr/001_pressure_reconciler_design.md
|
||
| with the existing reconcile path: | ||
|
|
||
| 1. **Create loop** – opens a long-lived streaming HTTP connection to the planner's | ||
| `GET /api/v1/flavors/{name}/pressure?stream=true` endpoint and creates runners | ||
| whenever the desired total exceeds the current total. Each pressure event updates | ||
| a shared `_last_pressure` field consumed by the delete loop. | ||
|
|
||
| 2. **Delete loop** – wakes on a configurable timer and calls `cleanup_runners` to | ||
| remove stale VMs, then converges the runner count toward `_last_pressure` (the | ||
| most recently observed pressure from the create loop). It does not fetch fresh | ||
| pressure from the planner. | ||
|
|
||
| Planner mode is activated only when `planner_url` and `planner_token` are present | ||
| in configuration, allowing staged rollout before the legacy reconcile path is | ||
|
Check failure on line 47 in docs/adr/001_pressure_reconciler_design.md
|
||
| removed. | ||
|
|
||
| When the streaming connection fails, the create loop falls back to | ||
| `fallback_runners` (configurable, default 0) and retries after a short backoff, | ||
|
Check failure on line 51 in docs/adr/001_pressure_reconciler_design.md
|
||
| preventing a hot loop on transient planner outages. | ||
|
|
||
| ## Alternatives Explored | ||
|
Check failure on line 54 in docs/adr/001_pressure_reconciler_design.md
|
||
|
|
||
| **Polling instead of streaming for creates.** A polling approach (e.g. fetching | ||
| pressure on each reconcile tick) is simpler, but it introduces a latency equal to | ||
| the polling interval between a demand spike and new runners appearing. Streaming | ||
| allows the manager to react within seconds of a pressure change. | ||
|
|
||
| **A single unified reconcile loop.** Combining create and delete into one loop | ||
| simplifies concurrency but forces a trade-off: either the loop runs frequently | ||
| (introducing excessive GitHub and OpenStack API calls) or it runs infrequently | ||
| (losing the low-latency create behaviour). Cleanup involves listing runners via | ||
| the GitHub API and querying OpenStack for VM state — calls that are expensive both | ||
| in latency and in quota. GitHub rate limiting has caused operational problems for | ||
| this project in the past, and OpenStack also degrades under high call rates. | ||
| Separate loops let creates react in near-real-time while keeping API call volume | ||
| proportional to the configured cleanup interval. | ||
|
|
||
| **Fetching fresh pressure in the delete loop.** Having the delete loop call | ||
| `GET /api/v1/flavors/{name}/pressure` itself would give it an up-to-date reading. | ||
| However, this adds an extra network round-trip on every timer tick, couples the | ||
| delete loop to planner availability, and is unnecessary because any over-deletion | ||
| caused by a stale reading is self-correcting: the create loop will scale back up | ||
| on the next streaming event. | ||
|
|
||
| ## Tradeoffs | ||
|
|
||
| The delete loop operates on a stale pressure value: it sees the last pressure | ||
| reported to the create loop rather than a live reading. The staleness window is | ||
| bounded by the planner's stream update frequency. Any over-deletion in that window | ||
| is self-correcting because the create loop re-scales up on the next event. This is | ||
| an acceptable trade-off given that scale-down correctness is less time-critical | ||
| than scale-up. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for
reconcile_intervalis inconsistent. InApplicationConfiguration(base.py:55), it's documented as "Seconds to wait between reconciliation", but the implementation inPressureReconcilerConfig(pressure_reconciler.py:47, 143) treats it as minutes (multiplies by 60 to get seconds). The integration test comment (test_planner_runner.py:56-57) also confirms the field is intended to be in minutes. Update the documentation in base.py line 55 to say "Minutes to wait between reconciliation" for consistency.