De-flake builtin_tools E2E tests with a longer send timeout#1538
Merged
Conversation
The builtin_tools E2E test `should_capture_exit_code_in_output` flaked once on the Windows Rust CI job: `send_and_wait` hit its 60s default and the test panicked. The re-run passed. The failure clustered in a ~90s window where multiple unrelated tests were simultaneously slow, which points to transient Windows-runner contention (e2e runs 4-wide on a 4-vCPU runner, each test spawning a CLI + replay proxy + a real shell) rather than a deadlock or a genuinely slow command. Raise the per-send wait timeout for the builtin_tools E2E suites from 60s to 120s across all five SDKs (Rust, Python, Go, Node, .NET) so transient slowness no longer turns into a hard failure, while still failing fast on a real hang. Assertions, snapshots, Windows skips, and CI concurrency are unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces flakiness in the cross-SDK builtin_tools E2E suites by increasing the per-send_and_wait timeout from 60s to 120s, addressing intermittent Windows CI contention when running multiple CLI + proxy subprocesses concurrently.
Changes:
- Increased
send_and_wait/SendAndWaittimeouts to 120s for builtin tool E2E tests across Rust, Python, Go, Node.js, and .NET. - Added small per-suite constants/helpers (and brief rationale comments) to keep the timeout change consistent and easy to maintain.
- In Node.js, also increased the per-test Vitest timeout to exceed the longer send timeout (and the global 30s test timeout).
Show a summary per file
| File | Description |
|---|---|
| rust/tests/e2e/builtin_tools.rs | Adds a 120s SEND_TIMEOUT and uses MessageOptions::with_wait_timeout via a helper for all builtin tool sends. |
| python/e2e/test_builtin_tools_e2e.py | Introduces SEND_TIMEOUT = 120.0 and passes it to session.send_and_wait(..., timeout=...) for all tests. |
| nodejs/test/e2e/builtin_tools.e2e.test.ts | Adds SEND_TIMEOUT_MS and raises per-test Vitest timeout to avoid the global 30s timeout binding first. |
| go/internal/e2e/builtin_tools_e2e_test.go | Wraps each SendAndWait call with a 120s context.WithTimeout so Go’s default 60s deadline doesn’t apply. |
| dotnet/test/E2E/BuiltinToolsE2ETests.cs | Adds a 120s SendTimeout and passes it to each SendAndWaitAsync call. |
Copilot's findings
- Files reviewed: 5/5 changed files
- Comments generated: 0
Contributor
Cross-SDK Consistency Review ✅This PR maintains excellent cross-SDK consistency. The timeout increase is applied uniformly across all five applicable SDK implementations:
Java is correctly omitted as it has no Each implementation uses the appropriate language idioms while achieving the same semantic result. No consistency issues found.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The builtin_tools E2E test
should_capture_exit_code_in_output(prompt: "Run 'echo hello && echo world'. Tell me the exact output.") flaked once on the Windows Rust CI job and passed on re-run, so the failure looked like flakiness rather than a real regression.Root cause
I pulled the failed job log.
send_and_waithit its 60s default and the Rust test's.expect("send")panicked withSession(Timeout(60s)). The deeper question was whether 60s indicated a hang/deadlock. It does not:IdleWaiteris race-free (registered before send, RAII-cleared, timeout wraps the whole send+wait), and replay matching is deterministic (verified over repeated local runs).ask_user::should_receive_choices~50s,should_edit_a_file_successfully~62s but passed, our test 60s). A real deadlock would not cluster temporally; shared-resource contention does.The actual cause is transient Windows CI contention: the Rust suite runs e2e with
RUST_E2E_CONCURRENCY: 4and--test-threads=4on a 4-vCPU runner, and each test spawns its owncopilot.exeCLI plus a Node replay proxy (and the builtin_tools tests additionally spawn a real shell/PowerShell). Under that load a normally-few-second agent loop got starved past the 60s default. Only Rust flaked because only Rust runs its e2e tests concurrently; Node/Python/Go/.NET run them serially.Approach
Raise the per-send wait timeout for the builtin_tools E2E suites from 60s to 120s (double the default; still well under Rust's 300s Windows test budget, so a genuine hang still fails fast). Applied suite-wide because every builtin_tools test shares the same "spawn CLI + run a real builtin tool" exposure, so a sibling would otherwise flake next. Assertions, snapshots, Windows skips, and CI concurrency are left untouched, so coverage is unchanged.
Done consistently across all five SDKs that carry this test:
rust/tests/e2e/builtin_tools.rs):SEND_TIMEOUTconst + amessage()helper usingMessageOptions::with_wait_timeout.python/e2e/test_builtin_tools_e2e.py):SEND_TIMEOUT = 120.0passed astimeout=.go/internal/e2e/builtin_tools_e2e_test.go):const sendTimeout = 120 * time.Second; each subtest wraps the call ctx withcontext.WithTimeout(Go honors a ctx deadline, else 60s).nodejs/test/e2e/builtin_tools.e2e.test.ts): pass120_000as the send timeout and raise each test's vitest timeout to180_000, because the global 30stestTimeoutwould otherwise bind first.dotnet/test/E2E/BuiltinToolsE2ETests.cs):SendTimeout = TimeSpan.FromSeconds(120)passed to eachSendAndWaitAsync.Each language carries a short comment explaining the slow/concurrent-Windows-CI rationale so the larger value is not mysterious. Java has no builtin_tools e2e test, so it needs no change.
Validation
cargo build+cargo fmt; Gogofmt+go vet; Pythonruff; Nodetsc+eslint+prettier; .NETdotnet build(0 warnings) +dotnet format --verify-no-changes.Generated by Copilot