fix(kiloclaw): recover from 409 already_exists in createNewMachine#1135
Open
evanjacobson wants to merge 10 commits intomainfrom
Open
fix(kiloclaw): recover from 409 already_exists in createNewMachine#1135evanjacobson wants to merge 10 commits intomainfrom
evanjacobson wants to merge 10 commits intomainfrom
Conversation
When createMachine hits a 409 "already_exists" (e.g. DO state lost flyMachineId and metadata recovery didn't run), adopt the existing machine instead of crashing. Adds logging when the regex fails to extract the machine ID from the 409 body, preventing silent failures if Fly changes their error format.
…runcation Add missing test for partial-success state where flyMachineId is persisted but startExistingMachine throws. Add status='running' assertion to happy-path test. Assert status unchanged on error paths. Use realistic hex machine ID fixture. Remove err.body.slice(0,500) so the full body is logged.
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (2 files)
Reviewed by gpt-5.4-20260305 · 578,593 tokens |
4 tasks
After replaceStrandedVolume swaps to a new volume, Fly can still return 409 already_exists for the old machine name. Blindly adopting that machine would reattach to the old host/volume, undoing recovery. The 409 handler now calls getMachine to verify the existing machine's volume matches the expected one. On mismatch, destroys the stale machine and re-throws so the alarm retries with a clean slate.
Inline the machine start logic after 409 adoption instead of calling startExistingMachine, whose 404 fallback would recurse back into createNewMachine and loop indefinitely if the adopted machine disappears while Fly's name registry still returns 409.
The inline start after adopting a 409 machine was missing the legacy machineSize backfill that startExistingMachine performs, which could silently resize the machine to DEFAULT_MACHINE_GUEST on start.
evanjacobson
added a commit
that referenced
this pull request
Mar 17, 2026
) ## Summary Adds a controller-side pairing cache and HTTP API so the Durable Object can resolve pairing requests via the controller's in-memory cache instead of shelling out via `fly exec` every time. - **Pairing cache** on the controller periodically runs `kilo pairing list` / `kilo device-pairing list`, parses stdout, and serves cached results. Refreshes reactively when the gateway logs a new pairing event (via a new `onStdoutLine` supervisor hook). - **HTTP routes** at `/_kilo/pairing/{channels,devices}` and `/_kilo/pairing/{channels,devices}/approve`. - **DO pairing methods** now try the controller HTTP API first and fall back to `fly exec` on unknown-route errors, preserving backward compatibility with older controller versions. - **Zod schemas** for typed controller pairing responses. - ~~**409 machine recovery**: if `createMachine` hits a 409 "already_exists", adopts the existing machine instead of crashing.~~ #1135 - **Volume existence check** after machine ID recovery to prevent startups against a missing volume. ## Verification - [x] Pairing cache tests (749 lines) — `pairing-cache.test.ts` - [x] Pairing route tests (410 lines) — `routes/pairing.test.ts` - [x] Supervisor `onStdoutLine` tests (135 lines) — `supervisor.test.ts` - [x] Manual testing — paired Telegram channel ## Visual Changes N/A ## Reviewer Notes - The controller-first / fly-exec-fallback pattern uses `isErrorUnknownRoute` to detect older controllers that don't have the pairing routes, so rollout is safe without version gating. - The pairing cache debounces rapid refresh triggers and caps concurrent CLI invocations to avoid overwhelming the gateway process. - Cache warmup is fire-and-forget — `start()` kicks off the initial refresh but does not block `server.listen()`. This avoids a slow or wedged OpenClaw CLI (45s timeout per path) holding the health endpoint unreachable past the DO's 60s startup probe. The brief empty-cache window is acceptable: the DO falls back to KV / fly exec, and the cache self-heals via periodic refresh and log-triggered debounce.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a Durable Object loses its
flyMachineId(e.g. after a DO wipe + Postgres restore) and metadata recovery doesn't run or fails, callingcreateMachineon Fly returns a 409already_existserror. Previously this would crash the reconciliation loop.createNewMachinenow handles this case: on a 409 withalready_existsin the body, it parses the machine ID from the error body, persists it to durable storage (following the existing "machine ID persisted before waiting" invariant), and callsstartExistingMachineto continue normally. If the ID can't be parsed or the 409 is for a different reason, it re-throws.Test coverage includes: successful adoption, partial-success where
startExistingMachinethrows after the ID is already persisted (verifying the ID survives for the next reconciliation retry), unparseable 409 body, unrelated 409, and non-409 errors.Verification
pnpm test— 669 tests across 36 files, all passVisual Changes
N/A
Reviewer Notes
state.flyMachineIdandctx.storage.put(flyMachineId)both happen beforestartExistingMachineis awaited — matching the ordering in the normalcreateNewMachinepath. A failure instartExistingMachineleaves the ID durably written so the reconciliation alarm can retry cleanly.The regex
/machine ID (\w+)/is coupled to Fly's error message format. If the format ever changes, the code degrades gracefully: it logs the full body and re-throws.