[audit-workflows] π Agentic Workflow Audit β 2026-06-16 (prod-main 84.4%; day-10 sdk-lockout + NEW upload_assets false-red) #39691
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Agentic Workflow Audit Agent. A newer discussion is available at Discussion #39907. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
π Agentic Workflow Audit β 2026-06-16
Window: ~3.3h evening cluster (18:41β21:57Z). A ~20h stretch (06-15 22:00 β 06-16 18:41) is unobserved because the
logsMCP tool hit its 120s timeout β a recurring data-coverage limit, so today's totals are partial.Headline: Overall completion rate 71.2% (47β / 19β of 66 completed) β but that number is dragged down by a single PR's container-validation batch. Prod-main (
mainbranch) was 84.4% (27β / 4β), and all 4 prod-main failures are known/recurring classes with no new systemic incident.π Trend Charts (30 days)
Health holds in the mid-80s%/low-90s% band typical of the last two weeks; today's 71.2% dip is a window artifact (one PR's smoke/container batch), not a fleet regression β the 06-13 trough (68.4%) was the real low and rebounded. Failure counts stay small on prod-main.
Observed daily tokens (41.9M) sit near the 7-day moving average; the 06-12 spike (127M) was a full-window outlier and bars track window length, not a usage surge. No token-budget-429 events this window.
π¨ Failure Breakdown (19)
Cluster A β PR container-validation batch (9, by-design, NOT prod)
All 9 are on branch
copilot/add-gh-aw-node-to-default-containersβ a PR that modifies the default container images. Every smoke/container test on that branch failed at agent startup (0-tok / 0-turn): Agent Container Smoke Test, Changeset Generator, Smoke {Claude, Codex, Gemini, Antigravity}, Smoke Copilot Γ3 (incl. AOAI apikey/Entra). This is expected PR-validation noise while the container change is iterated β the Agent Container Smoke Test literally checks whethernode/tools exist in the container. No prod-main impact.Prod-main #1 β copilot-sdk tool-permission lockout (2 runs, RECUR day 10)
27643760095) β 5 denials of routine commands (readon/pkg/*,shell go version,git status) β SDK aborts at the 5-denial cap, ~24min wall, 0 tokens.27641213363) β NEW affected workflow, identical signature, ~26min, 0 tokens (previously failed via the long-gone minimatch class).This is the longest-running unresolved prod-main class (since 06-02, now day 10). The copilot-sdk-driver path denies read-only/source-inspection commands the workflows legitimately need, then hard-aborts with no retry. Top-priority fix remains: grant the required tool perms or fix the SDK allowlist plumbing.
Prod-main #2 β β NEW: upload_assets job reddens a fully-successful run (2 runs)
27643027159, claude) β agent SUCCESS (53 turns, 4.27M tok);activation,agent,detection,safe_outputs,update_cache_memory,push_repo_memoryall success β yet the run is red becauseupload_assets: failure(chartsquality_score_breakdown.png+historical_trends.png).27649783375, copilot) β same shape: agent SUCCESS, onlyupload_assetsfailed.A new variant of the safe-output partial-failure-intolerance family, isolated to the chart-upload job. Notably the same
upload_assetpath every chart-emitting report agent uses fleet-wide β including this audit agent (I hit a related path-resolution quirk below). Recommend makingupload_assetsnon-fatal to overall run conclusion (skip-and-warn) or adding upload retry/validation.Other PR-branch failures (6, non-prod)
activationjob failed (thedaily_effective_workflow_exceededevening-cap gate) β agent skipped β PR shows red. Knowndaily-effective-workflow-cap-reddens-pr-runs(UX false-red).support-array-object-runner-specs.π‘ Recommendations
upload_assetsjob non-fatal β a transient chart-upload failure should not discard a fully-successful agent run. Broad blast radius (every chart-emitting report agent). (rec-upload-assets-nonfatal)logstool's 120s timeout keeps leaving ~20h unobserved daily; andupload_assetonly resolves workspace-relative paths, not the/tmp/gh-aw/python/chartspath the Python skill documents (charts had to be copied into the workspace to upload).β Health Signals
metrics-summary,known-issues(sdk-lockoutβday10, +upload_assets-redden),recommendations,anomalies,workflow-trends,audit-history.References: Β§27643760095 Β· Β§27643027159 Β· Β§27641213363
Beta Was this translation helpful? Give feedback.
All reactions