Skip to content

fix(wake): defer Ready on CH+Windows dropNIC until DHCP lease lands#22

Merged
CMGS merged 1 commit into
mainfrom
fix/wake-dropnic-ip-race
May 18, 2026
Merged

fix(wake): defer Ready on CH+Windows dropNIC until DHCP lease lands#22
CMGS merged 1 commit into
mainfrom
fix/wake-dropnic-ip-race

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 18, 2026

Summary

  • wake's dropNIC branch was marking lifecycle=ready synchronously after Clone; the freshly hot-added NIC had no DHCP lease yet, so resolveVMIP returned "" and pod.status.podIP was published empty. Consumers gated on lifecycle=ready latched that empty value and kept dialing the pre-hibernate address (full repro and root-cause in Wake races virtual-kubelet podIP publish on CH+Windows dropNIC path #21).
  • New finalizeDropNICWake runs via goBackground (matches the non-dropNIC runPostCloneSetup shape), polls resolveVMIP every 200ms for up to 15s, then refreshStatus + notify + markReady. On timeout, failOp with WakeIPWaitTimeout; on a racing-hibernate VM untracking, silently returns so the hibernate lifecycle is not overwritten.
  • Budgets tunable via Provider.wakeFreshIPBudget / wakeFreshIPInterval so tests can shrink them.
  • New counter vk_cocoon_wake_ip_wait_total{result=ok|timeout} for observability.

Closes #21.

Test plan

  • go test ./... -race -count=1
  • make lint (darwin + linux, 0 issues)
  • New unit tests: ready-on-IP-arrival, timeout→Failed, VM-forgotten→no-op
  • Manual e2e on testing: hibernate → wake a simular/win10-hot-testing pod, verify pod.status.podIP is the post-clone IP at the moment lifecycle=ready is patched

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the CH+Windows dropNIC wake path so lifecycle Ready is deferred until a fresh VM IP is observed, aiming to prevent consumers from reading an empty/stale pod IP.

Changes:

  • Adds dropNIC wake IP wait defaults and Provider tunables.
  • Introduces finalizeDropNICWake / waitForFreshIP and uses them asynchronously from wake.
  • Adds tests for IP arrival, timeout, and forgotten-VM behavior plus a new Prometheus counter.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
provider/cocoon/update.go Defers dropNIC wake Ready until IP wait completes.
provider/cocoon/update_test.go Adds unit coverage for the new finalizer outcomes.
provider/cocoon/provider.go Adds wake IP wait tunable fields.
metrics/metrics.go Registers wake IP wait outcome counter.
Comments suppressed due to low confidence (1)

provider/cocoon/update.go:228

  • The timeout branch only treats an untracked VM as a race, but hibernate keeps the VM tracked until Remove completes. If this waiter times out while a hibernate is in progress, failOp can publish Failed over Hibernating for the current generation. This should be gated on the same wake operation/current lifecycle (or VM ID) rather than just vmForPod != nil.
	if p.vmForPod(pod.Namespace, pod.Name) == nil {
		// Racing hibernate forgot the VM; do not overwrite its lifecycle.
		return
	}
	metrics.WakeIPWaitTotal.WithLabelValues("timeout").Inc()
	budget := cmp.Or(p.wakeFreshIPBudget, defaultWakeFreshIPBudget)
	err := fmt.Errorf("wake %s: dhcp lease for fresh NIC not observed within %s", v.Name, budget)
	p.failOp(ctx, pod, "WakeIPWaitTimeout", "update", err)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread provider/cocoon/update.go Outdated
Comment thread provider/cocoon/update_test.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread provider/cocoon/update.go Outdated
Comment thread provider/cocoon/update.go Outdated
Comment thread provider/cocoon/update_test.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comment thread metrics/metrics.go
Comment thread provider/cocoon/update.go
Comment thread provider/cocoon/update.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

The dropNIC wake path (7dd2c7e) marked lifecycle=Ready as soon as Clone
returned. With --nics 1 hot-add the freshly attached NIC has no lease
yet, so resolveVMIP returns "" and pod.status.podIP is published empty.
Consumers gated on lifecycle=Ready (e.g. vm-service.post_wake_secrets)
latch the pre-DHCP empty string, fall through to None, skip the secret
refresh, and keep dialing the pre-hibernate address.

Mirror the non-dropNIC branch's shape: dispatch finalizeDropNICWake via
goBackground, poll resolveVMIP (LeaseParser path) on a 200ms tick up to
15s, then refreshStatus + notify + markReady. On timeout failOp Failed
with WakeIPWaitTimeout; on racing-hibernate VM untracking, return silent
so we do not trample the hibernate lifecycle.

Tunables hang off Provider so tests can shrink them. Closes #21.
@CMGS CMGS force-pushed the fix/wake-dropnic-ip-race branch from 5c4354d to 554aa1d Compare May 18, 2026 08:38
@CMGS CMGS requested a review from Copilot May 18, 2026 08:38
@CMGS CMGS merged commit c5400a0 into main May 18, 2026
4 checks passed
@CMGS CMGS deleted the fix/wake-dropnic-ip-race branch May 18, 2026 08:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread provider/cocoon/update.go
Comment on lines +214 to +216
p.refreshStatus(ctx, pod)
p.notify(pod)
if p.markLifecycleStateForWake(ctx, pod, v.ID, meta.LifecycleStateReady, "") {
Comment thread provider/cocoon/update.go
Comment on lines +260 to +265
for {
v := p.vmForPod(namespace, name)
if v == nil {
return false
}
if ip := p.resolveVMIP(namespace, name, v); ip != "" {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wake races virtual-kubelet podIP publish on CH+Windows dropNIC path

2 participants