Skip to content

feat(review): retry transient acpx failures + pass --prompt-retries#88

Merged
steipete merged 5 commits into
openclaw:mainfrom
coletebou:pr/retry-transient
May 20, 2026
Merged

feat(review): retry transient acpx failures + pass --prompt-retries#88
steipete merged 5 commits into
openclaw:mainfrom
coletebou:pr/retry-transient

Conversation

@coletebou
Copy link
Copy Markdown
Contributor

Summary

reviewFeature has a single attempt — any provider.review() failure marks the feature errored, even when the cause is a transient stream cut. acpx itself supports --prompt-retries <n> but clawpatch never sets it.

This PR adds two cheap retry layers, both opt-out via env var:

  1. acpx-level: runAcpxJson passes --prompt-retries <n> (default 1; env CLAWPATCH_ACPX_PROMPT_RETRIES; 0 omits the flag).
  2. clawpatch-level: runProviderReviewWithRetry wraps provider.review and retries ONCE on ClawpatchError && code === "malformed-output". Default 1; env CLAWPATCH_REVIEW_RETRIES.

Excluded codes (deterministic, won't retry): provider-auth, unsupported-provider, agent-refused, agent-cancelled.

The new upstream validateReviewOutput() step runs AFTER the retry succeeds — validation failures are NOT retried (they're per-finding bugs, not transient).

Why

Run 20260517T190759-3c9e9e had 37 malformed-output + 15 timeout errors. Most are recoverable on a single retry. With these layers landed, conservative estimate is ~50% of the transient errors clear without operator intervention.

Files

  • src/provider.tsacpxPromptRetries() helper + --prompt-retries arg in runAcpxJson; both exported via __testing
  • src/app.tsrunProviderReviewWithRetry, reviewRetries() knob, feature-retry progress event between attempts
  • src/provider.test.ts + src/app.test.ts — 26 new cases (5 + 4 + 7 + 4 + 6)

Validation

  • pnpm format:check — clean
  • pnpm typecheck — clean
  • pnpm lint — clean
  • pnpm build — clean
  • pnpm test — 567 passed, 1 skipped

Notes

Worst-case latency on a stuck feature with both retry layers maxed: 2 acpx tries × 2 clawpatch tries × 3 min default timeout = 12 min. CLAWPATCH_ACPX_TIMEOUT_MS is the safety valve. With default 10-jobs concurrency the bound is acceptable.

coletebou and others added 5 commits May 18, 2026 14:10
Two-layer retry for transient acpx review failures.

Layer 1 (acpx): runAcpxJson now passes --prompt-retries <n> (default 1) so
the underlying agent can recover stream cuts and partial JSON internally
without the whole feature failing. Override via
CLAWPATCH_ACPX_PROMPT_RETRIES; 0 disables.

Layer 2 (clawpatch): reviewFeature wraps provider.review() in a single
retry that fires only on ClawpatchError with code === "malformed-output".
Override via CLAWPATCH_REVIEW_RETRIES; 0 disables. Deterministic failures
(provider-auth, unsupported-provider, agent-refused, agent-cancelled) and
provider-failure (already covered by layer 1) are not retried at this
layer.

The retry is inside reviewFeature so the work-stealing loop in runReview
does not double-claim the feature lock. Emits a "feature-retry" progress
event between attempts.
@steipete steipete requested a review from a team as a code owner May 20, 2026 02:26
@steipete steipete merged commit b6d51dd into openclaw:main May 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants