Skip to content

[WIP] feat: planner reconciler#732

Closed
cbartz wants to merge 42 commits intomainfrom
feat/planner-reconciler-ISD4104
Closed

[WIP] feat: planner reconciler#732
cbartz wants to merge 42 commits intomainfrom
feat/planner-reconciler-ISD4104

Conversation

@cbartz
Copy link
Collaborator

@cbartz cbartz commented Feb 20, 2026

Applicable spec: ISD-4104

Overview

This PR adds the reconciliation mode based on the pressure coming from the planner.

It will be activated upon integration with the planner.

After successful validation in production, there will be a follow-up PR that removes the legacy reconciliation mode.

The PR is based on the work in #705 .

Rationale

With the new design, we will combine the reactive/non-reactive mode into one reconcilation based on a pressure information from th planner.

Juju Events Changes

n/a

Module Changes

Library Changes

n/a

Checklist

  • The charm style guide was applied.
  • The contributing guide was applied.
  • The changes are compliant with ISD054 - Managing Charm Complexity
  • The documentation for charmhub is updated.
  • The PR is tagged with appropriate label (urgent, trivial, complex).
  • The changelog is updated with changes that affects the users of the charm.
  • The application version number is updated in github-runner-manager/pyproject.toml.

yanksyoon and others added 30 commits January 22, 2026 01:41
  - Make planner settings optional in manager config (planner_url, planner_token) so service can start without planner relation.
  - Wire planner endpoint/token from charm relation into runtime state:
      - add PlannerConfig in CharmState
      - resolve planner token from Juju secret
      - pass planner config through factories.py into ApplicationConfiguration.
  - Update CLI startup to support dual modes with a shared lock:
      - planner mode: run PressureReconciler
      - legacy mode: fall back to start_reconcile_service when planner config is absent
      - add explicit legacy-support comment in CLI.
  - Refactor PressureReconciler behavior:
      - inline create stream loop (remove _streaming_loop)
      - add reconcile_interval (minutes) and fallback_runners
      - cache last pressure in create loop
      - run delete loop on timer using cached pressure (no streaming in delete path)
      - on planner stream errors, reconcile to fallback_runners
      - fix delete comparison to use desired_total vs current_total
      - accept external shared lock.
  - Update tests:
      - config unit test for optional planner fields
      - reconciler unit tests for fallback, cached delete behavior, and delete comparison fix
      - planner state extraction/clearing tests moved to test_charm_state.py
      - factory test for planner config wiring
      - fixture updates for new planner_config field.
  - Replaced stream_sequence/stream_interval_seconds with initial_pressure: float = 1.0
  - Added _pressure_file_path(port), _read_pressure(path, default), and _pressure_stream_gen(path, default) as module-level helpers (this also brought complexity below the C901 limit)
  - _make_app now writes initial_pressure to /tmp/planner_stub_{port}_pressure.json at startup; all endpoints read/write that file dynamically
  - Added POST /control/pressure route
  - Added PlannerStub.set_pressure(value) method

  github-runner-manager/tests/integration/conftest.py
  - Added from openstack.compute.v2.server import Server as OpenstackServer
  - Added wait_for_runner(...) and wait_for_no_runners(...) as module-level functions (moved from test_debug_ssh.py / new)

  github-runner-manager/tests/integration/test_debug_ssh.py
  - Removed local wait_for_runner definition
  - Added from .conftest import wait_for_runner

  github-runner-manager/tests/integration/test_planner_runner.py (new)
  - planner_app fixture: wires PlannerStub(initial_pressure=1.0) + RunningApplication
  - test_planner_pressure_spawns_and_cleans_runner: waits for runner at pressure=1, sets pressure=0, waits for cleanup

  tests/integration/test_charm_runner.py
  - Deleted test_planner_pressure_spawns_and_cleans_single_runner and its exclusive imports (requests, ActiveStatus, Model)

  tests/integration/conftest.py
  - Removed mock_planner_http_app and mock_planner_http_unit_ip fixtures (verified no remaining references)
@github-actions
Copy link
Contributor

TICS Quality Gate

✔️ Passed

github-runner-operator

See the results in the TICS Viewer

The following files have been checked for this project
  • github-runner-manager/src/github_runner_manager/cli.py
  • github-runner-manager/src/github_runner_manager/configuration/base.py
  • github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py
  • github-runner-manager/src/github_runner_manager/manager/runner_manager.py
  • github-runner-manager/src/github_runner_manager/openstack_cloud/openstack_runner_manager.py
  • github-runner-manager/src/github_runner_manager/planner_client.py
  • github-runner-manager/tests/integration/application.py
  • github-runner-manager/tests/integration/conftest.py
  • github-runner-manager/tests/integration/factories.py
  • github-runner-manager/tests/integration/planner_stub.py
  • github-runner-manager/tests/integration/test_debug_ssh.py
  • github-runner-manager/tests/integration/test_planner_runner.py
  • github-runner-manager/tests/unit/test_config.py
  • github-runner-manager/tests/unit/test_planner_client.py
  • github-runner-manager/tests/unit/test_pressure_reconciler.py
  • github-runner-manager/tests/unit/test_runner_scaler.py
  • lib/charms/data_platform_libs/v0/data_interfaces.py
  • lib/charms/grafana_agent/v0/cos_agent.py
  • lib/charms/operator_libs_linux/v1/systemd.py
  • src/charm_state.py
  • src/factories.py
  • tests/integration/helpers/common.py
  • tests/unit/conftest.py
  • tests/unit/test_charm_state.py
  • tests/unit/test_factories.py

.github/workflows/tics.yaml / TICS / TICS GitHub Action

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a planner-driven pressure reconciliation mode for the GitHub runner manager, enabling dynamic scaling of runners based on pressure signals from a planner service. The implementation follows a dual-loop architecture (create and delete) designed to minimize API calls while maintaining low-latency scale-up.

Changes:

  • Adds PressureReconciler and PlannerClient to enable planner-driven runner reconciliation
  • Introduces planner_config extraction from Juju relations with endpoint and token handling via secrets
  • Implements a dual-loop architecture: streaming create loop for low-latency scale-up, timer-based delete loop for cleanup
  • Extends RunnerManager with cleanup_runners() method that returns cleaned VM IDs
  • Updates CLI to conditionally start planner mode or legacy mode based on configuration
  • Adds comprehensive unit tests for planner client and pressure reconciler
  • Includes integration test infrastructure with a lightweight planner stub server
  • Documents design decisions in ADR-001

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
github-runner-manager/src/github_runner_manager/planner_client.py New HTTP client for planner service with flavor/pressure APIs
github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py Core pressure reconciler with dual-loop architecture
github-runner-manager/src/github_runner_manager/manager/runner_manager.py Added cleanup_runners method returning cleaned VM IDs
github-runner-manager/src/github_runner_manager/configuration/base.py Added optional planner_url and planner_token fields
github-runner-manager/src/github_runner_manager/cli.py Conditional startup logic for planner vs legacy mode
github-runner-manager/src/github_runner_manager/openstack_cloud/openstack_runner_manager.py Fixed null reference for runner_proxy_config in cloud-init
src/charm_state.py Added PlannerConfig extraction from relations with secret handling
src/factories.py Pass planner_url and planner_token to ApplicationConfiguration
github-runner-manager/tests/unit/test_pressure_reconciler.py Unit tests for pressure reconciler behavior
github-runner-manager/tests/unit/test_planner_client.py Unit tests for planner client API interactions
github-runner-manager/tests/integration/planner_stub.py Lightweight Flask-based planner stub for integration testing
github-runner-manager/tests/integration/test_planner_runner.py Integration test for full planner-driven lifecycle
tests/integration/helpers/common.py Refactored wait_for helper to support both sync and async functions
docs/adr/001_pressure_reconciler_design.md Architecture decision record documenting design rationale
docs/changelog.md Added entry for planner reconciler feature

Comment on lines +19 to +40
class _FakeManager:
"""Lightweight runner manager stub for testing the reconciler."""

def __init__(self, runners_count: int = 0):
"""Initialize with an optional number of pre-existing runners."""
self._runners = [object() for _ in range(runners_count)]
self.created_args: list[int] = []
self.cleanup_called = 0

def get_runners(self) -> list[object]:
"""Return the current list of runners."""
return list(self._runners)

def create_runners(self, num: int, metadata: object): # noqa: ARG002
"""Record the creation request and extend the internal runner list."""
self.created_args.append(num)
if num > 0:
self._runners.extend(object() for _ in range(num))

def cleanup_runners(self):
"""Increment the cleanup counter."""
self.cleanup_called += 1
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _FakeManager test stub is missing the delete_runners method that is called by PressureReconciler._handle_timer_reconcile when scaling down. This will cause test_handle_timer_reconcile_uses_desired_total_not_raw_pressure to fail with an AttributeError when the timer reconcile path attempts to delete runners. Add a delete_runners method to the fake manager that tracks deletion calls.

Copilot uses AI. Check for mistakes.
Comment on lines +41 to +46
def _pressure_file_path(port: int) -> Path:
"""Return the path to the pressure state file for the given port.

Port-namespaced to allow parallel test execution without conflicts.
"""
return Path(f"/tmp/planner_stub_{port}_pressure.json")
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pressure file path uses /tmp which can be subject to race conditions and security issues in multi-user environments. Consider using tempfile.gettempdir() with a secure subdirectory or tempfile.NamedTemporaryFile to ensure the file is created with proper permissions and in a secure location. Additionally, the port-based naming doesn't guarantee uniqueness if multiple test runs with the same port happen concurrently.

Copilot uses AI. Check for mistakes.
Comment on lines +295 to +296
planner_client=PlannerClient(
PlannerConfiguration(base_url=config.planner_url, token=config.planner_token)
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build_pressure_reconciler function is called only when config.planner_url and config.planner_token are truthy (line 129 in cli.py), so the Optional fields will not be None at runtime. However, the type system doesn't know this. Consider adding an explicit assertion or type guard at the beginning of build_pressure_reconciler to document this precondition and satisfy type checkers: assert config.planner_url is not None and config.planner_token is not None.

Copilot uses AI. Check for mistakes.
is_disabled=data.get("is_disabled"),
minimum_pressure=data.get("minimum_pressure"),
)
except (requests.RequestException, ValueError, requests.HTTPError) as exc:
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling catches requests.HTTPError separately from requests.RequestException, but HTTPError is a subclass of RequestException. This makes the explicit requests.HTTPError clause redundant since it will never be reached. Remove requests.HTTPError from the except clause on line 112 (and similarly on line 138 if present).

Suggested change
except (requests.RequestException, ValueError, requests.HTTPError) as exc:
except (requests.RequestException, ValueError) as exc:

Copilot uses AI. Check for mistakes.
openstack_configuration: Configuration for authorization to a OpenStack host.
planner_url: Base URL of the planner service.
planner_token: Bearer token to authenticate against the planner service.
reconcile_interval: Seconds to wait between reconciliation.
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for reconcile_interval is inconsistent. In ApplicationConfiguration (base.py:55), it's documented as "Seconds to wait between reconciliation", but the implementation in PressureReconcilerConfig (pressure_reconciler.py:47, 143) treats it as minutes (multiplies by 60 to get seconds). The integration test comment (test_planner_runner.py:56-57) also confirms the field is intended to be in minutes. Update the documentation in base.py line 55 to say "Minutes to wait between reconciliation" for consistency.

Suggested change
reconcile_interval: Seconds to wait between reconciliation.
reconcile_interval: Minutes to wait between reconciliation.

Copilot uses AI. Check for mistakes.
@cbartz
Copy link
Collaborator Author

cbartz commented Feb 23, 2026

Closing, as I will split this PR into two

@cbartz cbartz closed this Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants