Skip to content

fix(kiloclaw): recover from 409 already_exists in createNewMachine#1135

Open
evanjacobson wants to merge 10 commits intomainfrom
improvement/409-machine-recovery
Open

fix(kiloclaw): recover from 409 already_exists in createNewMachine#1135
evanjacobson wants to merge 10 commits intomainfrom
improvement/409-machine-recovery

Conversation

@evanjacobson
Copy link
Contributor

Summary

When a Durable Object loses its flyMachineId (e.g. after a DO wipe + Postgres restore) and metadata recovery doesn't run or fails, calling createMachine on Fly returns a 409 already_exists error. Previously this would crash the reconciliation loop.

createNewMachine now handles this case: on a 409 with already_exists in the body, it parses the machine ID from the error body, persists it to durable storage (following the existing "machine ID persisted before waiting" invariant), and calls startExistingMachine to continue normally. If the ID can't be parsed or the 409 is for a different reason, it re-throws.

Test coverage includes: successful adoption, partial-success where startExistingMachine throws after the ID is already persisted (verifying the ID survives for the next reconciliation retry), unparseable 409 body, unrelated 409, and non-409 errors.

Verification

  • pnpm test — 669 tests across 36 files, all pass

Visual Changes

N/A

Reviewer Notes

state.flyMachineId and ctx.storage.put(flyMachineId) both happen before startExistingMachine is awaited — matching the ordering in the normal createNewMachine path. A failure in startExistingMachine leaves the ID durably written so the reconciliation alarm can retry cleanly.

The regex /machine ID (\w+)/ is coupled to Fly's error message format. If the format ever changes, the code degrades gracefully: it logs the full body and re-throws.

When createMachine hits a 409 "already_exists" (e.g. DO state lost
flyMachineId and metadata recovery didn't run), adopt the existing
machine instead of crashing.

Adds logging when the regex fails to extract the machine ID from the
409 body, preventing silent failures if Fly changes their error format.
…runcation

Add missing test for partial-success state where flyMachineId is persisted
but startExistingMachine throws. Add status='running' assertion to happy-path
test. Assert status unchanged on error paths. Use realistic hex machine ID
fixture. Remove err.body.slice(0,500) so the full body is logged.
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 16, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (2 files)
  • kiloclaw/src/durable-objects/kiloclaw-instance.test.ts
  • kiloclaw/src/durable-objects/kiloclaw-instance/fly-machines.ts

Reviewed by gpt-5.4-20260305 · 578,593 tokens

After replaceStrandedVolume swaps to a new volume, Fly can still
return 409 already_exists for the old machine name. Blindly adopting
that machine would reattach to the old host/volume, undoing recovery.

The 409 handler now calls getMachine to verify the existing machine's
volume matches the expected one. On mismatch, destroys the stale
machine and re-throws so the alarm retries with a clean slate.
Inline the machine start logic after 409 adoption instead of calling
startExistingMachine, whose 404 fallback would recurse back into
createNewMachine and loop indefinitely if the adopted machine
disappears while Fly's name registry still returns 409.
The inline start after adopting a 409 machine was missing the legacy
machineSize backfill that startExistingMachine performs, which could
silently resize the machine to DEFAULT_MACHINE_GUEST on start.
@evanjacobson evanjacobson enabled auto-merge March 16, 2026 20:18
evanjacobson added a commit that referenced this pull request Mar 17, 2026
)

## Summary

Adds a controller-side pairing cache and HTTP API so the Durable Object
can resolve pairing requests via the controller's in-memory cache
instead of shelling out via `fly exec` every time.

- **Pairing cache** on the controller periodically runs `kilo pairing
list` / `kilo device-pairing list`, parses stdout, and serves cached
results. Refreshes reactively when the gateway logs a new pairing event
(via a new `onStdoutLine` supervisor hook).
- **HTTP routes** at `/_kilo/pairing/{channels,devices}` and
`/_kilo/pairing/{channels,devices}/approve`.
- **DO pairing methods** now try the controller HTTP API first and fall
back to `fly exec` on unknown-route errors, preserving backward
compatibility with older controller versions.
- **Zod schemas** for typed controller pairing responses.
- ~~**409 machine recovery**: if `createMachine` hits a 409
"already_exists", adopts the existing machine instead of crashing.~~
#1135
- **Volume existence check** after machine ID recovery to prevent
startups against a missing volume.

## Verification

- [x] Pairing cache tests (749 lines) — `pairing-cache.test.ts`
- [x] Pairing route tests (410 lines) — `routes/pairing.test.ts`
- [x] Supervisor `onStdoutLine` tests (135 lines) — `supervisor.test.ts`
- [x] Manual testing — paired Telegram channel

## Visual Changes

N/A

## Reviewer Notes

- The controller-first / fly-exec-fallback pattern uses
`isErrorUnknownRoute` to detect older controllers that don't have the
pairing routes, so rollout is safe without version gating.
- The pairing cache debounces rapid refresh triggers and caps concurrent
CLI invocations to avoid overwhelming the gateway process.
- Cache warmup is fire-and-forget — `start()` kicks off the initial
refresh but does not block `server.listen()`. This avoids a slow or
wedged OpenClaw CLI (45s timeout per path) holding the health endpoint
unreachable past the DO's 60s startup probe. The brief empty-cache
window is acceptable: the DO falls back to KV / fly exec, and the cache
self-heals via periodic refresh and log-triggered debounce.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant