canonical · cbartz · Jan 21, 2026 · Jan 22, 2026 · Jan 22, 2026 · Jan 22, 2026
@@ -20,6 +20,7 @@ jobs:
       matrix:
         test-module:
           - test_debug_ssh
+          - test_planner_runner
     steps:
       - name: Checkout code
         uses: actions/checkout@v6.0.2

@@ -0,0 +1,85 @@
+# ADR-001: Pressure Reconciler Design for Planner Integration
+
+**Author:** Christopher Bartz (christopher.bartz@canonical.com)
+**Date:** 2026/02/20
+**Domain:** Architecture
+
+## Overview
+
+This ADR documents the design of the `PressureReconciler`, which enables the
+github-runner-manager to scale runners reactively in response to pressure signals
+from the planner service, while preserving the existing timer-based reconciliation
+for deployments without a planner.
+
+## Context
+
+The runner manager historically determined the desired runner count in two ways:
+a static count from configuration (number of VMs per combination), or reactively
+by consuming job messages from a MongoDB queue. With the introduction of the planner
+charm, the desired runner count becomes a dynamic value driven by observed queue
+depth—referred to as *pressure*. The runner manager must respond to pressure changes
+quickly (to prevent queued jobs from waiting) while also periodically cleaning up
+stale runners.
+
+Two competing concerns shape the design:
+
+- **Low-latency scale-up**: runners should be created as soon as the planner signals
+  increased demand, not on a fixed reconcile tick.
+- **Periodic cleanup**: stale runners (e.g. completed jobs whose VM was not yet
+  reclaimed) must be removed on a regular schedule regardless of inbound pressure.
+
+## Decision
+
+The `PressureReconciler` runs two independent, long-lived loops that share a mutex
+with the existing reconcile path:
+
+1. **Create loop** – opens a long-lived streaming HTTP connection to the planner's
+   `GET /api/v1/flavors/{name}/pressure?stream=true` endpoint and creates runners
+   whenever the desired total exceeds the current total. Each pressure event updates
+   a shared `_last_pressure` field consumed by the delete loop.
+
+2. **Delete loop** – wakes on a configurable timer and calls `cleanup_runners` to
+   remove stale VMs, then converges the runner count toward `_last_pressure` (the
+   most recently observed pressure from the create loop). It does not fetch fresh
+   pressure from the planner.
+
+Planner mode is activated only when `planner_url` and `planner_token` are present
+in configuration, allowing staged rollout before the legacy reconcile path is
+removed.
+
+When the streaming connection fails, the create loop falls back to
+`fallback_runners` (configurable, default 0) and retries after a short backoff,
+preventing a hot loop on transient planner outages.
+
+## Alternatives Explored
+
+**Polling instead of streaming for creates.** A polling approach (e.g. fetching
+pressure on each reconcile tick) is simpler, but it introduces a latency equal to
+the polling interval between a demand spike and new runners appearing. Streaming
+allows the manager to react within seconds of a pressure change.
+
+**A single unified reconcile loop.** Combining create and delete into one loop
+simplifies concurrency but forces a trade-off: either the loop runs frequently
+(introducing excessive GitHub and OpenStack API calls) or it runs infrequently
+(losing the low-latency create behaviour). Cleanup involves listing runners via
+the GitHub API and querying OpenStack for VM state — calls that are expensive both
+in latency and in quota. GitHub rate limiting has caused operational problems for
+this project in the past, and OpenStack also degrades under high call rates.
+Separate loops let creates react in near-real-time while keeping API call volume
+proportional to the configured cleanup interval.
+
+**Fetching fresh pressure in the delete loop.** Having the delete loop call
+`GET /api/v1/flavors/{name}/pressure` itself would give it an up-to-date reading.
+However, this adds an extra network round-trip on every timer tick, couples the
+delete loop to planner availability, and is unnecessary because any over-deletion
+caused by a stale reading is self-correcting: the create loop will scale back up
+on the next streaming event.
+
+## Tradeoffs
+
+The delete loop operates on a stale pressure value: it sees the last pressure
+reported to the create loop rather than a live reading. The staleness window is
+bounded by the planner's stream update frequency. Any over-deletion in that window
+is self-correcting because the create loop re-scales up on the next event. This is
+an acceptable trade-off given that scale-down correctness is less time-critical
+than scale-up.
@@ -2,6 +2,10 @@
 
 This changelog documents user-relevant changes to the GitHub runner charm.
 
+## 2026-02-18
+
+- Introduce planner-driven pressure reconciler.
+
 ## 2026-02-12
 
 - Add support to integrate with GitHub Runner planner charm with the `github_runner_planner_v0` interface.
@@ -51,7 +55,7 @@ This changelog documents user-relevant changes to the GitHub runner charm.
 ## 2025-12-10
 
 - Removed apt update step in cloud-init of the VM creation step since it is now applied in the
-GitHub runner image builder side.
+  GitHub runner image builder side.
 
 ## 2025-12-05
 

@@ -5,22 +5,42 @@
 
 import importlib.metadata
 import logging
+import signal
 import sys
 from functools import partial
 from io import StringIO
 from threading import Lock
+from types import FrameType
 from typing import TextIO
 
 import click
 
 from github_runner_manager.configuration import ApplicationConfiguration
 from github_runner_manager.http_server import FlaskArgs, start_http_server
+from github_runner_manager.manager.pressure_reconciler import (
+    PressureReconciler,
+    build_pressure_reconciler,
+)
 from github_runner_manager.reconcile_service import start_reconcile_service
 from github_runner_manager.thread_manager import ThreadManager
 
 version = importlib.metadata.version("github-runner-manager")
 
 
+def handle_shutdown(
+    signum: int, _frame: FrameType | None, pressure_reconciler: PressureReconciler
+) -> None:  # pragma: no cover
+    """Stop reconciler threads on shutdown signals.
+
+    Args:
+        signum: Received POSIX signal number.
+        _frame: Current stack frame when the signal was received.
+        pressure_reconciler: The reconciler instance to stop.
+    """
+    logging.info("Received signal %s; stopping pressure reconciler", signum)
+    pressure_reconciler.stop()
+
+
 @click.command()
 @click.option(
     "--config-file",
@@ -64,17 +84,17 @@
 @click.option(
     "--python-path",
     type=str,
-    default="",
-    help="The PYTHONPATH to the github-runner-manager library.",
+    required=False,
+    help="The PYTHONPATH to access the github-runner-manager library.",
 )
 # The entry point for the CLI will be tested with integration test.
 def main(  # pylint: disable=too-many-arguments, too-many-positional-arguments
     config_file: TextIO,
     host: str,
     port: int,
     debug: bool,
+    python_path: str | None,
     log_level: str,
-    python_path: str,
 ) -> None:  # pragma: no cover
     """Start the reconcile service.
 
@@ -83,29 +103,45 @@ def main(  # pylint: disable=too-many-arguments, too-many-positional-arguments
         host: The hostname to listen on for the HTTP server
         port: The port to listen on the HTTP server.
         debug: Whether to start the application in debug mode.
+        python_path: PYTHONPATH to access the github-runner-manager library.
         log_level: The log level.
-        python_path: The PYTHONPATH to access the github-runner-manager library.
     """
-    python_path_config = python_path if python_path else None
     logging.basicConfig(
         level=log_level,
         stream=sys.stderr,
         format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
     )
     logging.info("Starting GitHub runner manager service version: %s", version)
-
+    config = ApplicationConfiguration.from_yaml_file(StringIO(config_file.read()))
     lock = Lock()
-    config_str = config_file.read()
-    config = ApplicationConfiguration.from_yaml_file(StringIO(config_str))
-    http_server_args = FlaskArgs(host=host, port=port, debug=debug)
 
     thread_manager = ThreadManager()
     thread_manager.add_thread(
-        target=partial(start_http_server, config, lock, http_server_args), daemon=True
-    )
-    thread_manager.add_thread(
-        target=partial(start_reconcile_service, config, python_path_config, lock), daemon=True
+        target=partial(
+            start_http_server,
+            config,
+            lock,
+            FlaskArgs(host=host, port=port, debug=debug),
+        ),
+        daemon=True,
     )
-    thread_manager.start()
 
+    if config.planner_url and config.planner_token:
+        pressure_reconciler = build_pressure_reconciler(config, lock)
+        signal.signal(
+            signal.SIGTERM, partial(handle_shutdown, pressure_reconciler=pressure_reconciler)
+        )
+        signal.signal(
+            signal.SIGINT, partial(handle_shutdown, pressure_reconciler=pressure_reconciler)
+        )
+        thread_manager.add_thread(target=pressure_reconciler.start_create_loop, daemon=True)
+        thread_manager.add_thread(target=pressure_reconciler.start_delete_loop, daemon=True)
+    # Legacy mode is still supported for deployments without planner config.
+    else:
+        thread_manager.add_thread(
+            target=partial(start_reconcile_service, config, python_path, lock),
+            daemon=True,
+        )
+
+    thread_manager.start()
     thread_manager.raise_on_error()
@@ -50,17 +50,21 @@ class ApplicationConfiguration(BaseModel):
         non_reactive_configuration: Configuration for non-reactive mode.
         reactive_configuration: Configuration for reactive mode.
         openstack_configuration: Configuration for authorization to a OpenStack host.
+        planner_url: Base URL of the planner service.
+        planner_token: Bearer token to authenticate against the planner service.
         reconcile_interval: Seconds to wait between reconciliation.
-        reconcile_interval: Seconds to wait between reconciliation.
+        reconcile_interval: Minutes to wait between reconciliation.
-        reconcile_interval: Seconds to wait between reconciliation.
+        reconcile_interval: Minutes to wait between reconciliation.
     """
 
     allow_external_contributor: bool = False
     name: str
     extra_labels: list[str]
-    github_config: github.GitHubConfiguration | None
+    github_config: github.GitHubConfiguration
     service_config: "SupportServiceConfig"
     non_reactive_configuration: "NonReactiveConfiguration"
     reactive_configuration: "ReactiveConfiguration | None"
     openstack_configuration: OpenStackConfiguration
+    planner_url: Optional[AnyHttpUrl] = None
+    planner_token: Optional[str] = None
     reconcile_interval: int
 
     @staticmethod