Conversation
# Conflicts: # docs/changelog.md
- Make planner settings optional in manager config (planner_url, planner_token) so service can start without planner relation.
- Wire planner endpoint/token from charm relation into runtime state:
- add PlannerConfig in CharmState
- resolve planner token from Juju secret
- pass planner config through factories.py into ApplicationConfiguration.
- Update CLI startup to support dual modes with a shared lock:
- planner mode: run PressureReconciler
- legacy mode: fall back to start_reconcile_service when planner config is absent
- add explicit legacy-support comment in CLI.
- Refactor PressureReconciler behavior:
- inline create stream loop (remove _streaming_loop)
- add reconcile_interval (minutes) and fallback_runners
- cache last pressure in create loop
- run delete loop on timer using cached pressure (no streaming in delete path)
- on planner stream errors, reconcile to fallback_runners
- fix delete comparison to use desired_total vs current_total
- accept external shared lock.
- Update tests:
- config unit test for optional planner fields
- reconciler unit tests for fallback, cached delete behavior, and delete comparison fix
- planner state extraction/clearing tests moved to test_charm_state.py
- factory test for planner config wiring
- fixture updates for new planner_config field.
- Replaced stream_sequence/stream_interval_seconds with initial_pressure: float = 1.0
- Added _pressure_file_path(port), _read_pressure(path, default), and _pressure_stream_gen(path, default) as module-level helpers (this also brought complexity below the C901 limit)
- _make_app now writes initial_pressure to /tmp/planner_stub_{port}_pressure.json at startup; all endpoints read/write that file dynamically
- Added POST /control/pressure route
- Added PlannerStub.set_pressure(value) method
github-runner-manager/tests/integration/conftest.py
- Added from openstack.compute.v2.server import Server as OpenstackServer
- Added wait_for_runner(...) and wait_for_no_runners(...) as module-level functions (moved from test_debug_ssh.py / new)
github-runner-manager/tests/integration/test_debug_ssh.py
- Removed local wait_for_runner definition
- Added from .conftest import wait_for_runner
github-runner-manager/tests/integration/test_planner_runner.py (new)
- planner_app fixture: wires PlannerStub(initial_pressure=1.0) + RunningApplication
- test_planner_pressure_spawns_and_cleans_runner: waits for runner at pressure=1, sets pressure=0, waits for cleanup
tests/integration/test_charm_runner.py
- Deleted test_planner_pressure_spawns_and_cleans_single_runner and its exclusive imports (requests, ActiveStatus, Model)
tests/integration/conftest.py
- Removed mock_planner_http_app and mock_planner_http_unit_ip fixtures (verified no remaining references)
TICS Quality Gate✔️ Passedgithub-runner-operatorSee the results in the TICS Viewer The following files have been checked for this project
|
There was a problem hiding this comment.
Pull request overview
This pull request introduces a planner-driven pressure reconciliation mode for the GitHub runner manager, enabling dynamic scaling of runners based on pressure signals from a planner service. The implementation follows a dual-loop architecture (create and delete) designed to minimize API calls while maintaining low-latency scale-up.
Changes:
- Adds
PressureReconcilerandPlannerClientto enable planner-driven runner reconciliation - Introduces
planner_configextraction from Juju relations with endpoint and token handling via secrets - Implements a dual-loop architecture: streaming create loop for low-latency scale-up, timer-based delete loop for cleanup
- Extends
RunnerManagerwithcleanup_runners()method that returns cleaned VM IDs - Updates CLI to conditionally start planner mode or legacy mode based on configuration
- Adds comprehensive unit tests for planner client and pressure reconciler
- Includes integration test infrastructure with a lightweight planner stub server
- Documents design decisions in ADR-001
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| github-runner-manager/src/github_runner_manager/planner_client.py | New HTTP client for planner service with flavor/pressure APIs |
| github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py | Core pressure reconciler with dual-loop architecture |
| github-runner-manager/src/github_runner_manager/manager/runner_manager.py | Added cleanup_runners method returning cleaned VM IDs |
| github-runner-manager/src/github_runner_manager/configuration/base.py | Added optional planner_url and planner_token fields |
| github-runner-manager/src/github_runner_manager/cli.py | Conditional startup logic for planner vs legacy mode |
| github-runner-manager/src/github_runner_manager/openstack_cloud/openstack_runner_manager.py | Fixed null reference for runner_proxy_config in cloud-init |
| src/charm_state.py | Added PlannerConfig extraction from relations with secret handling |
| src/factories.py | Pass planner_url and planner_token to ApplicationConfiguration |
| github-runner-manager/tests/unit/test_pressure_reconciler.py | Unit tests for pressure reconciler behavior |
| github-runner-manager/tests/unit/test_planner_client.py | Unit tests for planner client API interactions |
| github-runner-manager/tests/integration/planner_stub.py | Lightweight Flask-based planner stub for integration testing |
| github-runner-manager/tests/integration/test_planner_runner.py | Integration test for full planner-driven lifecycle |
| tests/integration/helpers/common.py | Refactored wait_for helper to support both sync and async functions |
| docs/adr/001_pressure_reconciler_design.md | Architecture decision record documenting design rationale |
| docs/changelog.md | Added entry for planner reconciler feature |
| class _FakeManager: | ||
| """Lightweight runner manager stub for testing the reconciler.""" | ||
|
|
||
| def __init__(self, runners_count: int = 0): | ||
| """Initialize with an optional number of pre-existing runners.""" | ||
| self._runners = [object() for _ in range(runners_count)] | ||
| self.created_args: list[int] = [] | ||
| self.cleanup_called = 0 | ||
|
|
||
| def get_runners(self) -> list[object]: | ||
| """Return the current list of runners.""" | ||
| return list(self._runners) | ||
|
|
||
| def create_runners(self, num: int, metadata: object): # noqa: ARG002 | ||
| """Record the creation request and extend the internal runner list.""" | ||
| self.created_args.append(num) | ||
| if num > 0: | ||
| self._runners.extend(object() for _ in range(num)) | ||
|
|
||
| def cleanup_runners(self): | ||
| """Increment the cleanup counter.""" | ||
| self.cleanup_called += 1 |
There was a problem hiding this comment.
The _FakeManager test stub is missing the delete_runners method that is called by PressureReconciler._handle_timer_reconcile when scaling down. This will cause test_handle_timer_reconcile_uses_desired_total_not_raw_pressure to fail with an AttributeError when the timer reconcile path attempts to delete runners. Add a delete_runners method to the fake manager that tracks deletion calls.
| def _pressure_file_path(port: int) -> Path: | ||
| """Return the path to the pressure state file for the given port. | ||
|
|
||
| Port-namespaced to allow parallel test execution without conflicts. | ||
| """ | ||
| return Path(f"/tmp/planner_stub_{port}_pressure.json") |
There was a problem hiding this comment.
The pressure file path uses /tmp which can be subject to race conditions and security issues in multi-user environments. Consider using tempfile.gettempdir() with a secure subdirectory or tempfile.NamedTemporaryFile to ensure the file is created with proper permissions and in a secure location. Additionally, the port-based naming doesn't guarantee uniqueness if multiple test runs with the same port happen concurrently.
| planner_client=PlannerClient( | ||
| PlannerConfiguration(base_url=config.planner_url, token=config.planner_token) |
There was a problem hiding this comment.
The build_pressure_reconciler function is called only when config.planner_url and config.planner_token are truthy (line 129 in cli.py), so the Optional fields will not be None at runtime. However, the type system doesn't know this. Consider adding an explicit assertion or type guard at the beginning of build_pressure_reconciler to document this precondition and satisfy type checkers: assert config.planner_url is not None and config.planner_token is not None.
| is_disabled=data.get("is_disabled"), | ||
| minimum_pressure=data.get("minimum_pressure"), | ||
| ) | ||
| except (requests.RequestException, ValueError, requests.HTTPError) as exc: |
There was a problem hiding this comment.
The exception handling catches requests.HTTPError separately from requests.RequestException, but HTTPError is a subclass of RequestException. This makes the explicit requests.HTTPError clause redundant since it will never be reached. Remove requests.HTTPError from the except clause on line 112 (and similarly on line 138 if present).
| except (requests.RequestException, ValueError, requests.HTTPError) as exc: | |
| except (requests.RequestException, ValueError) as exc: |
| openstack_configuration: Configuration for authorization to a OpenStack host. | ||
| planner_url: Base URL of the planner service. | ||
| planner_token: Bearer token to authenticate against the planner service. | ||
| reconcile_interval: Seconds to wait between reconciliation. |
There was a problem hiding this comment.
The documentation for reconcile_interval is inconsistent. In ApplicationConfiguration (base.py:55), it's documented as "Seconds to wait between reconciliation", but the implementation in PressureReconcilerConfig (pressure_reconciler.py:47, 143) treats it as minutes (multiplies by 60 to get seconds). The integration test comment (test_planner_runner.py:56-57) also confirms the field is intended to be in minutes. Update the documentation in base.py line 55 to say "Minutes to wait between reconciliation" for consistency.
| reconcile_interval: Seconds to wait between reconciliation. | |
| reconcile_interval: Minutes to wait between reconciliation. |
|
Closing, as I will split this PR into two |
Applicable spec: ISD-4104
Overview
This PR adds the reconciliation mode based on the pressure coming from the planner.
It will be activated upon integration with the planner.
After successful validation in production, there will be a follow-up PR that removes the legacy reconciliation mode.
The PR is based on the work in #705 .
Rationale
With the new design, we will combine the reactive/non-reactive mode into one reconcilation based on a pressure information from th planner.
Juju Events Changes
n/a
Module Changes
Library Changes
n/a
Checklist
urgent,trivial,complex).github-runner-manager/pyproject.toml.