Skip to content

fix crio deadlock in getting crio sandbox containers#3838

Merged
dims merged 3 commits intogoogle:masterfrom
olyazavr:fix-crio-deadlock
Mar 2, 2026
Merged

fix crio deadlock in getting crio sandbox containers#3838
dims merged 3 commits intogoogle:masterfrom
olyazavr:fix-crio-deadlock

Conversation

@olyazavr
Copy link
Copy Markdown
Contributor

In cri-o/cri-o#8748, I found that cadvisor v0.48.1 had a bug that caused crio to essentially deadlock if there was a long-terminating container and a kubelet restart. However, with cadvisor v0.49.0 this was fixed. With 0.52.1 it was bugged again.

I found that #3457 was the fix, which was then later reverted: #3565

1.27 has cadvisor v0.47.2 (working)
1.29 has cadvisor v0.48.1 (broken)
1.30 has cadvisor v0.49.0 (working)
1.31 has cadvisor v0.49.0 (working)
1.33 has cadvisor 0.52.1 (broken)

What happens here is that cadvisor finds sandbox containers in cgroups (they exist as cgroup directories), calls cri-o, cri-o returns 404 because sandbox containers aren't returned by the inspect endpoint, and then I suspect something ends up holding a mutex that prevents anything else to go through, which causes kubelet to fail to start because crio/cadvisor are stuck.

This re-does the original PR (#3457) but addresses the systemd/cgroupfs issue

@google-cla
Copy link
Copy Markdown

google-cla bot commented Feb 11, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
@dims
Copy link
Copy Markdown
Collaborator

dims commented Feb 19, 2026

@haircommander can you please review?

Comment thread container/crio/factory.go
return false, false, nil
}

// When using systemd as the cgroup driver, sandbox containers don't have
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't the right way to check. users can still use cgroupfs if systemd is present. luckily, cri-o has for a while exposed the CgroupDriver in its info endpoint, which cadvisor is not currently using. can you update this to ask cri-o what cgroup driver its using, and branch that way?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah that's much nicer

@haircommander
Copy link
Copy Markdown
Contributor

thanks for picking this up @olyazavr !

Signed-off-by: Olga Shestopalova <oshestopalova1@gmail.com>
@haircommander
Copy link
Copy Markdown
Contributor

LGTM

@olyazavr
Copy link
Copy Markdown
Contributor Author

@haircommander is there anything else I need to do here? Merging is blocked for me

@haircommander
Copy link
Copy Markdown
Contributor

cc @dims 👀

@dims dims merged commit 74dda38 into google:master Mar 2, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants