fix(kiloclaw): recover from 409 already_exists in createNewMachine by evanjacobson · Pull Request #1135 · Kilo-Org/cloud

evanjacobson · 2026-03-16T18:33:08Z

Summary

When a Durable Object loses its flyMachineId (e.g. after a DO wipe + Postgres restore) and metadata recovery doesn't run or fails, calling createMachine on Fly returns a 409 already_exists error. Previously this would crash the reconciliation loop.

createNewMachine now handles this case: on a 409 with already_exists in the body, it parses the machine ID from the error body, persists it to durable storage (following the existing "machine ID persisted before waiting" invariant), and calls startExistingMachine to continue normally. If the ID can't be parsed or the 409 is for a different reason, it re-throws.

Test coverage includes: successful adoption, partial-success where startExistingMachine throws after the ID is already persisted (verifying the ID survives for the next reconciliation retry), unparseable 409 body, unrelated 409, and non-409 errors.

Verification

pnpm test — 669 tests across 36 files, all pass

Visual Changes

N/A

Reviewer Notes

state.flyMachineId and ctx.storage.put(flyMachineId) both happen before startExistingMachine is awaited — matching the ordering in the normal createNewMachine path. A failure in startExistingMachine leaves the ID durably written so the reconciliation alarm can retry cleanly.

The regex /machine ID (\w+)/ is coupled to Fly's error message format. If the format ever changes, the code degrades gracefully: it logs the full body and re-throws.

When createMachine hits a 409 "already_exists" (e.g. DO state lost flyMachineId and metadata recovery didn't run), adopt the existing machine instead of crashing. Adds logging when the regex fails to extract the machine ID from the 409 body, preventing silent failures if Fly changes their error format.

…runcation Add missing test for partial-success state where flyMachineId is persisted but startExistingMachine throws. Add status='running' assertion to happy-path test. Assert status unchanged on error paths. Use realistic hex machine ID fixture. Remove err.body.slice(0,500) so the full body is logged.

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts

kilo-code-bot · 2026-03-16T18:37:39Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (2 files)

kiloclaw/src/durable-objects/kiloclaw-instance.test.ts
kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts

_{Reviewed by gpt-5.4-20260305 · 578,593 tokens}

After replaceStrandedVolume swaps to a new volume, Fly can still return 409 already_exists for the old machine name. Blindly adopting that machine would reattach to the old host/volume, undoing recovery. The 409 handler now calls getMachine to verify the existing machine's volume matches the expected one. On mismatch, destroys the stale machine and re-throws so the alarm retries with a clean slate.

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts

Inline the machine start logic after 409 adoption instead of calling startExistingMachine, whose 404 fallback would recurse back into createNewMachine and loop indefinitely if the adopted machine disappears while Fly's name registry still returns 409.

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts

The inline start after adopting a 409 machine was missing the legacy machineSize backfill that startExistingMachine performs, which could silently resize the machine to DEFAULT_MACHINE_GUEST on start.

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts

) ## Summary Adds a controller-side pairing cache and HTTP API so the Durable Object can resolve pairing requests via the controller's in-memory cache instead of shelling out via `fly exec` every time. - **Pairing cache** on the controller periodically runs `kilo pairing list` / `kilo device-pairing list`, parses stdout, and serves cached results. Refreshes reactively when the gateway logs a new pairing event (via a new `onStdoutLine` supervisor hook). - **HTTP routes** at `/_kilo/pairing/{channels,devices}` and `/_kilo/pairing/{channels,devices}/approve`. - **DO pairing methods** now try the controller HTTP API first and fall back to `fly exec` on unknown-route errors, preserving backward compatibility with older controller versions. - **Zod schemas** for typed controller pairing responses. - ~~**409 machine recovery**: if `createMachine` hits a 409 "already_exists", adopts the existing machine instead of crashing.~~ #1135 - **Volume existence check** after machine ID recovery to prevent startups against a missing volume. ## Verification - [x] Pairing cache tests (749 lines) — `pairing-cache.test.ts` - [x] Pairing route tests (410 lines) — `routes/pairing.test.ts` - [x] Supervisor `onStdoutLine` tests (135 lines) — `supervisor.test.ts` - [x] Manual testing — paired Telegram channel ## Visual Changes N/A ## Reviewer Notes - The controller-first / fly-exec-fallback pattern uses `isErrorUnknownRoute` to detect older controllers that don't have the pairing routes, so rollout is safe without version gating. - The pairing cache debounces rapid refresh triggers and caps concurrent CLI invocations to avoid overwhelming the gateway process. - Cache warmup is fire-and-forget — `start()` kicks off the initial refresh but does not block `server.listen()`. This avoids a slow or wedged OpenClaw CLI (45s timeout per path) holding the health endpoint unreachable past the DO's 60s startup probe. The brief empty-cache window is acceptable: the DO falls back to KV / fly exec, and the cache self-heals via periodic refresh and log-triggered debounce.

evanjacobson added 3 commits March 16, 2026 11:52

Merge branch 'main' into improvement/409-machine-recovery

f7bef11

kilo-code-bot bot reviewed Mar 16, 2026

View reviewed changes

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts Show resolved Hide resolved

evanjacobson mentioned this pull request Mar 16, 2026

feat(controller): pair with controller for channel/device pairing #1095

Merged

4 tasks

evanjacobson added 2 commits March 16, 2026 13:04

style: format test file

1163fbd

kilo-code-bot bot reviewed Mar 16, 2026

View reviewed changes

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts Outdated Show resolved Hide resolved

evanjacobson added 2 commits March 16, 2026 13:22

style: format test file

655f32b

kilo-code-bot bot reviewed Mar 16, 2026

View reviewed changes

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts Outdated Show resolved Hide resolved

evanjacobson added 2 commits March 16, 2026 13:33

fix(kiloclaw): backfill machineSize in 409 adoption path

b7a5d4e

The inline start after adopting a 409 machine was missing the legacy machineSize backfill that startExistingMachine performs, which could silently resize the machine to DEFAULT_MACHINE_GUEST on start.

style: format fly-machines

5c90cb8

kilo-code-bot bot reviewed Mar 16, 2026

View reviewed changes

kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts Show resolved Hide resolved

Merge branch 'main' into improvement/409-machine-recovery

a52d304

evanjacobson enabled auto-merge March 16, 2026 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kiloclaw): recover from 409 already_exists in createNewMachine#1135

fix(kiloclaw): recover from 409 already_exists in createNewMachine#1135
evanjacobson wants to merge 10 commits intomainfrom
improvement/409-machine-recovery

evanjacobson commented Mar 16, 2026

Uh oh!

Uh oh!

kilo-code-bot bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evanjacobson commented Mar 16, 2026

Summary

Verification

Visual Changes

Reviewer Notes

Uh oh!

Uh oh!

kilo-code-bot bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kilo-code-bot bot commented Mar 16, 2026 •

edited

Loading