feat(orchestrator): add concurrent sandbox creation benchmark#2280
feat(orchestrator): add concurrent sandbox creation benchmark#2280
Conversation
The current maxStartingInstancesPerNode limit of 3 was established from non-rigorous historical measurements on n1 machines. To find the optimal limit for n2 machines we need proper data. This adds BenchmarkConcurrentResume which launches N sandboxes simultaneously via barrier-synchronized goroutines and measures per-sandbox creation latency (avg, P50, P95, P99, min, max), wall-clock time, and success/failure counts. Concurrency levels default to 1,2,3,4,5,6,7,8,9,10 and can be overridden via the CONCURRENCY_LEVELS env var. Network and NBD pool sizes are scaled to the max concurrency level to avoid measuring pool starvation instead of actual creation overhead.
Replace cgroup.NewNoopManager() with the real cgroup.NewManager() so sandboxes get proper CPU and memory isolation via cgroupv2, matching production behavior.
Accumulate per-sandbox latencies and wall-clock times across all benchmark iterations instead of reporting only the last iteration. This makes percentiles (P50/P95/P99) statistically meaningful at high iteration counts (e.g. -benchtime=100x at concurrency-5 gives 500 latency samples instead of 5).
PR SummaryLow Risk Overview Reviewed by Cursor Bugbot for commit d86b43c. Bugbot is set up for automated code reviews on this repo. Configure here. |
- Use max/min builtins instead of if-chains in percentile() - Use WaitGroup.Go instead of manual Add/Done goroutine pattern - Replace sort.Slice/sort.Ints with slices.Sort - Add blank line before bare return (nlreturn) - Rename min/max variables to fastest/slowest to avoid shadowing builtins
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d98b658f8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
LGTM — benchmark-only addition with no production code impact; the two nit-level issues are captured in inline comments.
Extended reasoning...
Overview
This PR adds BenchmarkConcurrentResume (a new _test.go file) and bench2md.awk (an auxiliary reporting script). No production code paths are modified.
Security risks
None. The benchmark requires root and is not part of any deployed artifact. The AWK script is a local post-processing helper.
Level of scrutiny
Low. This is developer tooling for tuning maxStartingInstancesPerNode. Correctness matters for drawing valid conclusions from the benchmark, but bugs here cannot affect production behaviour.
Other factors
My two prior inline comments (tracerProvider shutdown, misleading ns/op) were both reasonably dismissed by the author with valid explanations and marked resolved. The two new nits — a misleading docstring about pool-size scaling and throughput overcounting on failures — are flagged as inline comments. At the default concurrency levels (1–10) neither issue affects the benchmark results in practice. The overall design (barrier-based concurrent launch, aggregated percentiles, LIFO cleanup) is sound.
sitole
left a comment
There was a problem hiding this comment.
Nits, lgtm. Please resolve the comments from AI.
Use a more descriptive filename for the awk script that converts Go benchmark output to markdown tables.
Keep benchmark files in a dedicated subdirectory to separate them from the main orchestrator source. Update relative paths to account for the extra directory level.
This PR adds a benchmark to measure how many sandboxes can be effectively
resumed in parallel on a single node, to help tune
maxStartingInstancesPerNodefor different machine types.BenchmarkConcurrentResumelaunches N sandboxes simultaneously andreports latency percentiles (P50/P95/P99) and wall-clock time
iteration counts (e.g.
-benchtime=100x)bench2md.awkhelper to convert raw output to markdown tables