fix(wake): defer Ready on CH+Windows dropNIC until DHCP lease lands#22
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR changes the CH+Windows dropNIC wake path so lifecycle Ready is deferred until a fresh VM IP is observed, aiming to prevent consumers from reading an empty/stale pod IP.
Changes:
- Adds dropNIC wake IP wait defaults and Provider tunables.
- Introduces
finalizeDropNICWake/waitForFreshIPand uses them asynchronously fromwake. - Adds tests for IP arrival, timeout, and forgotten-VM behavior plus a new Prometheus counter.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
provider/cocoon/update.go |
Defers dropNIC wake Ready until IP wait completes. |
provider/cocoon/update_test.go |
Adds unit coverage for the new finalizer outcomes. |
provider/cocoon/provider.go |
Adds wake IP wait tunable fields. |
metrics/metrics.go |
Registers wake IP wait outcome counter. |
Comments suppressed due to low confidence (1)
provider/cocoon/update.go:228
- The timeout branch only treats an untracked VM as a race, but hibernate keeps the VM tracked until Remove completes. If this waiter times out while a hibernate is in progress,
failOpcan publishFailedoverHibernatingfor the current generation. This should be gated on the same wake operation/current lifecycle (or VM ID) rather than justvmForPod != nil.
if p.vmForPod(pod.Namespace, pod.Name) == nil {
// Racing hibernate forgot the VM; do not overwrite its lifecycle.
return
}
metrics.WakeIPWaitTotal.WithLabelValues("timeout").Inc()
budget := cmp.Or(p.wakeFreshIPBudget, defaultWakeFreshIPBudget)
err := fmt.Errorf("wake %s: dhcp lease for fresh NIC not observed within %s", v.Name, budget)
p.failOp(ctx, pod, "WakeIPWaitTimeout", "update", err)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
93e22c4 to
520f4d6
Compare
520f4d6 to
a96c6ee
Compare
a96c6ee to
5c4354d
Compare
The dropNIC wake path (7dd2c7e) marked lifecycle=Ready as soon as Clone returned. With --nics 1 hot-add the freshly attached NIC has no lease yet, so resolveVMIP returns "" and pod.status.podIP is published empty. Consumers gated on lifecycle=Ready (e.g. vm-service.post_wake_secrets) latch the pre-DHCP empty string, fall through to None, skip the secret refresh, and keep dialing the pre-hibernate address. Mirror the non-dropNIC branch's shape: dispatch finalizeDropNICWake via goBackground, poll resolveVMIP (LeaseParser path) on a 200ms tick up to 15s, then refreshStatus + notify + markReady. On timeout failOp Failed with WakeIPWaitTimeout; on racing-hibernate VM untracking, return silent so we do not trample the hibernate lifecycle. Tunables hang off Provider so tests can shrink them. Closes #21.
5c4354d to
554aa1d
Compare
Comment on lines
+214
to
+216
| p.refreshStatus(ctx, pod) | ||
| p.notify(pod) | ||
| if p.markLifecycleStateForWake(ctx, pod, v.ID, meta.LifecycleStateReady, "") { |
Comment on lines
+260
to
+265
| for { | ||
| v := p.vmForPod(namespace, name) | ||
| if v == nil { | ||
| return false | ||
| } | ||
| if ip := p.resolveVMIP(namespace, name, v); ip != "" { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
wake's dropNIC branch was markinglifecycle=readysynchronously afterClone; the freshly hot-added NIC had no DHCP lease yet, soresolveVMIPreturned""andpod.status.podIPwas published empty. Consumers gated onlifecycle=readylatched that empty value and kept dialing the pre-hibernate address (full repro and root-cause in Wake races virtual-kubelet podIP publish on CH+Windows dropNIC path #21).finalizeDropNICWakeruns viagoBackground(matches the non-dropNICrunPostCloneSetupshape), pollsresolveVMIPevery 200ms for up to 15s, thenrefreshStatus+notify+markReady. On timeout,failOpwithWakeIPWaitTimeout; on a racing-hibernate VM untracking, silently returns so the hibernate lifecycle is not overwritten.Provider.wakeFreshIPBudget/wakeFreshIPIntervalso tests can shrink them.vk_cocoon_wake_ip_wait_total{result=ok|timeout}for observability.Closes #21.
Test plan
go test ./... -race -count=1make lint(darwin + linux, 0 issues)simular/win10-hot-testingpod, verifypod.status.podIPis the post-clone IP at the momentlifecycle=readyis patched