Skip to content

fix: planner exception logging#750

Open
cbartz wants to merge 16 commits intomainfrom
fix/planner-exception-logging
Open

fix: planner exception logging#750
cbartz wants to merge 16 commits intomainfrom
fix/planner-exception-logging

Conversation

@cbartz
Copy link
Collaborator

@cbartz cbartz commented Mar 13, 2026

Overview

Avoid repeating logging the same exception stack trace in the planner client and pressure reconciler and avoid logging an exception stack trace for connection issues.

Also add small optimisation for graceful shutdown (use _stop.wait(5) instead of time.sleep(5) )

Rationale

Currently, the logs are quite noisy due to redundant information. Furthermore, long-lived http connections in unstable environments can be flaky, having exception stack traces doesnt bring a lot of information there.

time.sleep could delay shutdown for some seconds unnecessarily

Checklist

  • The charm style guide was applied.
  • The contributing guide was applied.
  • The changes are compliant with ISD054 - Managing Charm Complexity
  • The documentation for charmhub is updated.
  • The PR is tagged with appropriate label (urgent, trivial, complex).
  • The changelog is updated with changes that affects the users of the charm.
  • The application version number is updated in github-runner-manager/pyproject.toml.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces log noise around planner pressure streaming by avoiding duplicate stack traces across layers and by treating transient connection failures differently from other planner API errors.

Changes:

  • Introduces PlannerConnectionError to represent transient streaming connection failures (e.g., disconnects/timeouts).
  • Updates PlannerClient.stream_pressure() to map requests connection/timeouts to PlannerConnectionError and stop logging exceptions at the client layer.
  • Updates PressureReconciler to handle PlannerConnectionError with a warning + fallback behavior (no stack trace), and adds unit tests for the new exception mapping.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
github-runner-manager/src/github_runner_manager/planner_client.py Adds PlannerConnectionError and adjusts exception handling in stream_pressure() to avoid redundant exception logging.
github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py Handles PlannerConnectionError separately from PlannerApiError to reduce noisy stack traces while applying fallback behavior.
github-runner-manager/tests/unit/test_planner_client.py Adds unit tests validating error mapping from transient requests failures to PlannerConnectionError and other failures to PlannerApiError.

cbartz and others added 2 commits March 13, 2026 08:21
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces noisy/duplicated exception logging around planner pressure streaming by moving detailed logging responsibility to the caller, and introduces a distinct transient connection error path so connection drop/timeout scenarios don’t emit stack traces. It also improves shutdown responsiveness by replacing time.sleep(5) backoff with Event.wait(5).

Changes:

  • Add PlannerConnectionError and raise it for transient requests.ConnectionError / requests.Timeout failures while streaming pressure.
  • Update PressureReconciler to handle transient planner connection issues with a warning + fallback behavior, and use _stop.wait(5) for backoff.
  • Add unit tests validating PlannerClient.stream_pressure exception mapping for transient vs non-transient request failures.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
github-runner-manager/tests/unit/test_planner_client.py Adds tests for new exception behavior (PlannerConnectionError vs PlannerApiError).
github-runner-manager/src/github_runner_manager/planner_client.py Introduces PlannerConnectionError and refines exception handling to avoid duplicate stack trace logging.
github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py Adds specific handling for transient planner disconnects and improves graceful shutdown/backoff behavior.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces planner pressure streaming log noise by separating transient connection failures from other planner API errors, avoiding duplicate stack traces, and improving shutdown responsiveness by using an interruptible wait instead of sleep.

Changes:

  • Introduced PlannerConnectionError to distinguish transient streaming disconnects/timeouts from other planner client failures.
  • Updated PressureReconciler to log transient stream failures as warnings (no stack trace) while preserving stack traces for non-transient planner errors, and replaced time.sleep(5) with _stop.wait(5) for faster shutdown.
  • Added/updated unit tests for new exception behavior and bumped the project version + changelog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
github-runner-manager/src/github_runner_manager/planner_client.py Adds PlannerConnectionError and reclassifies requests exceptions to reduce stack-trace noise for transient failures.
github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py Handles PlannerConnectionError separately with warning logs and uses _stop.wait(5) to avoid delaying shutdown.
github-runner-manager/tests/unit/test_planner_client.py Adds unit coverage for the new exception mapping behavior in the planner client.
github-runner-manager/tests/unit/manager/test_pressure_reconciler.py Updates reconciler tests to cover both transient and non-transient planner stream failures and to match the new wait-based backoff.
github-runner-manager/pyproject.toml / docs/changelog.md Version bump and user-facing changelog entry describing the logging change.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces log noise around planner pressure streaming by centralizing stack-trace logging, distinguishing transient connection issues from other planner failures, and improving shutdown responsiveness in the pressure reconciler.

Changes:

  • Introduced PlannerConnectionError to classify transient stream disconnects/timeouts and avoid stack traces for those cases.
  • Updated PressureReconciler to log transient stream interruptions as warnings (no stack trace), keep stack traces for non-transient planner errors, and use _stop.wait(5) instead of time.sleep(5).
  • Added/updated unit tests for the new error classification paths; bumped project version and documented the change.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
github-runner-manager/src/github_runner_manager/planner_client.py Adds PlannerConnectionError and adjusts exception wrapping in stream_pressure.
github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py Changes logging behavior for transient vs non-transient errors; replaces sleep with stop-aware wait.
github-runner-manager/tests/unit/test_planner_client.py Adds coverage for transient/request/mid-stream failures and expected wrapped exceptions.
github-runner-manager/tests/unit/manager/test_pressure_reconciler.py Updates tests to cover both PlannerApiError and PlannerConnectionError fallback behavior; adapts shutdown/backoff mocking to Event.wait.
github-runner-manager/pyproject.toml Bumps version from 0.15.0 to 0.15.1.
docs/changelog.md Adds a user-facing changelog entry describing the reduced log noise.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces log noise around planner pressure streaming by separating transient connection failures from other planner errors, ensuring stack traces are not duplicated across layers, and improving shutdown responsiveness in the pressure reconciler.

Changes:

  • Introduced PlannerConnectionError to classify transient planner streaming failures (disconnects/timeouts) separately from other planner API errors.
  • Removed planner-client-side exception logging to avoid duplicate stack traces; shifted behavior to structured handling/logging in PressureReconciler.
  • Replaced fixed sleep backoff with _stop.wait() to allow graceful shutdown without unnecessary delays.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
github-runner-manager/src/github_runner_manager/planner_client.py Adds PlannerConnectionError and wraps transient vs non-transient requests failures without logging stack traces in the client.
github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py Handles PlannerConnectionError with warning-only logs and uses _stop.wait(5) for interruptible backoff.
github-runner-manager/tests/unit/test_planner_client.py Adds unit coverage for transient/non-transient request failures, including mid-stream failures.
github-runner-manager/tests/unit/manager/test_pressure_reconciler.py Expands coverage for fallback behavior and verifies shutdown prevents fallback scaling during error handling.
github-runner-manager/pyproject.toml Bumps application version to 0.15.1.
docs/changelog.md Documents the user-visible logging behavior change.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces planner pressure-stream log noise by classifying transient connection/stream interruptions separately from other planner API failures, and improves shutdown responsiveness by replacing fixed sleeps with interruptible waits.

Changes:

  • Introduced PlannerConnectionError for transient requests/stream failures and updated PlannerClient.stream_pressure to wrap errors without logging stack traces.
  • Updated PressureReconciler to log transient planner disconnects as warnings (no stack trace), keep error-level stack traces for non-transient failures, and use _stop.wait(5) for backoff.
  • Added/updated unit tests for the new error classification and shutdown behavior; bumped patch version and added a changelog entry.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
github-runner-manager/src/github_runner_manager/planner_client.py Adds PlannerConnectionError and refines exception wrapping in stream_pressure to avoid duplicate stack traces.
github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py Differentiates logging/backoff behavior for transient vs non-transient planner failures; uses interruptible wait for graceful shutdown.
github-runner-manager/tests/unit/test_planner_client.py Adds coverage for transient request failures and mid-stream failures mapping to the correct wrapper exception type.
github-runner-manager/tests/unit/manager/test_pressure_reconciler.py Extends fallback/shutdown tests to cover both PlannerApiError and PlannerConnectionError paths and verifies no scale-up after stop is requested.
github-runner-manager/pyproject.toml Bumps version to 0.15.1.
docs/changelog.md Documents the user-facing logging behavior change.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces log noise around planner pressure streaming by introducing a distinct connection-related exception type and adjusting the pressure reconciler’s logging/backoff behavior, while also improving shutdown responsiveness.

Changes:

  • Introduced PlannerConnectionError to distinguish transient stream/connection failures from other planner API errors.
  • Updated PressureReconciler to log transient planner disconnects as warnings (no stack traces), keep stack traces for non-transient errors, and replace time.sleep(5) with _stop.wait(5) for faster shutdown.
  • Added/expanded unit tests for transient vs non-transient failures (including mid-stream failures), and bumped version + changelog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
github-runner-manager/src/github_runner_manager/planner_client.py Adds PlannerConnectionError and classifies transient request/stream failures separately from other RequestExceptions.
github-runner-manager/src/github_runner_manager/manager/pressure_reconciler.py Uses warning-level logging (no stack trace) for transient stream interruptions, retains exception logging for non-transient errors, and uses _stop.wait(5) for backoff/shutdown.
github-runner-manager/tests/unit/test_planner_client.py Adds coverage for transient request failures and mid-stream errors mapping to the correct wrapper exception type.
github-runner-manager/tests/unit/manager/test_pressure_reconciler.py Extends tests to cover both planner error types and validates no scaling after stop is requested during error handling.
github-runner-manager/pyproject.toml Bumps project version to 0.15.1.
docs/changelog.md Documents the user-visible logging behavior change.

Comment on lines +142 to +145
if self._stop.is_set():
return
self._handle_create_runners(fallback)
self._stop.wait(5)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small improvement to teardown - use _stop.wait instead of time.sleep and don't create runners if stop is set

@cbartz cbartz marked this pull request as ready for review March 13, 2026 15:07
@github-actions
Copy link
Contributor

TICS Quality Gate

✔️ Passed

github-runner-operator

See the results in the TICS Viewer

The following files have been checked for this project
  • github-runner-manager/tests/unit/test_planner_client.py

.github/workflows/tics.yaml / TICS / TICS GitHub Action

Copy link
Member

@yanksyoon yanksyoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants