Skip to content

De-flake builtin_tools E2E tests with a longer send timeout#1538

Merged
stephentoub merged 1 commit into
mainfrom
stephentoub/deflake-builtin-exit-code-test
Jun 1, 2026
Merged

De-flake builtin_tools E2E tests with a longer send timeout#1538
stephentoub merged 1 commit into
mainfrom
stephentoub/deflake-builtin-exit-code-test

Conversation

@stephentoub
Copy link
Copy Markdown
Collaborator

Why

The builtin_tools E2E test should_capture_exit_code_in_output (prompt: "Run 'echo hello && echo world'. Tell me the exact output.") flaked once on the Windows Rust CI job and passed on re-run, so the failure looked like flakiness rather than a real regression.

Root cause

I pulled the failed job log. send_and_wait hit its 60s default and the Rust test's .expect("send") panicked with Session(Timeout(60s)). The deeper question was whether 60s indicated a hang/deadlock. It does not:

  • The Rust IdleWaiter is race-free (registered before send, RAII-cleared, timeout wraps the whole send+wait), and replay matching is deterministic (verified over repeated local runs).
  • The failure sat inside a ~90s window where multiple unrelated tests were simultaneously slow (ask_user::should_receive_choices ~50s, should_edit_a_file_successfully ~62s but passed, our test 60s). A real deadlock would not cluster temporally; shared-resource contention does.

The actual cause is transient Windows CI contention: the Rust suite runs e2e with RUST_E2E_CONCURRENCY: 4 and --test-threads=4 on a 4-vCPU runner, and each test spawns its own copilot.exe CLI plus a Node replay proxy (and the builtin_tools tests additionally spawn a real shell/PowerShell). Under that load a normally-few-second agent loop got starved past the 60s default. Only Rust flaked because only Rust runs its e2e tests concurrently; Node/Python/Go/.NET run them serially.

Approach

Raise the per-send wait timeout for the builtin_tools E2E suites from 60s to 120s (double the default; still well under Rust's 300s Windows test budget, so a genuine hang still fails fast). Applied suite-wide because every builtin_tools test shares the same "spawn CLI + run a real builtin tool" exposure, so a sibling would otherwise flake next. Assertions, snapshots, Windows skips, and CI concurrency are left untouched, so coverage is unchanged.

Done consistently across all five SDKs that carry this test:

  • Rust (rust/tests/e2e/builtin_tools.rs): SEND_TIMEOUT const + a message() helper using MessageOptions::with_wait_timeout.
  • Python (python/e2e/test_builtin_tools_e2e.py): SEND_TIMEOUT = 120.0 passed as timeout=.
  • Go (go/internal/e2e/builtin_tools_e2e_test.go): const sendTimeout = 120 * time.Second; each subtest wraps the call ctx with context.WithTimeout (Go honors a ctx deadline, else 60s).
  • Node (nodejs/test/e2e/builtin_tools.e2e.test.ts): pass 120_000 as the send timeout and raise each test's vitest timeout to 180_000, because the global 30s testTimeout would otherwise bind first.
  • .NET (dotnet/test/E2E/BuiltinToolsE2ETests.cs): SendTimeout = TimeSpan.FromSeconds(120) passed to each SendAndWaitAsync.

Each language carries a short comment explaining the slow/concurrent-Windows-CI rationale so the larger value is not mysterious. Java has no builtin_tools e2e test, so it needs no change.

Validation

  • Static checks clean in every language: Rust cargo build + cargo fmt; Go gofmt + go vet; Python ruff; Node tsc + eslint + prettier; .NET dotnet build (0 warnings) + dotnet format --verify-no-changes.
  • Ran the Node builtin_tools E2E suite end-to-end (same replay harness): 7 passed, 1 skipped on Windows, including the target test.

Generated by Copilot

The builtin_tools E2E test `should_capture_exit_code_in_output` flaked once
on the Windows Rust CI job: `send_and_wait` hit its 60s default and the test
panicked. The re-run passed. The failure clustered in a ~90s window where
multiple unrelated tests were simultaneously slow, which points to transient
Windows-runner contention (e2e runs 4-wide on a 4-vCPU runner, each test
spawning a CLI + replay proxy + a real shell) rather than a deadlock or a
genuinely slow command.

Raise the per-send wait timeout for the builtin_tools E2E suites from 60s to
120s across all five SDKs (Rust, Python, Go, Node, .NET) so transient slowness
no longer turns into a hard failure, while still failing fast on a real hang.
Assertions, snapshots, Windows skips, and CI concurrency are unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 1, 2026 20:46
@stephentoub stephentoub requested a review from a team as a code owner June 1, 2026 20:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces flakiness in the cross-SDK builtin_tools E2E suites by increasing the per-send_and_wait timeout from 60s to 120s, addressing intermittent Windows CI contention when running multiple CLI + proxy subprocesses concurrently.

Changes:

  • Increased send_and_wait / SendAndWait timeouts to 120s for builtin tool E2E tests across Rust, Python, Go, Node.js, and .NET.
  • Added small per-suite constants/helpers (and brief rationale comments) to keep the timeout change consistent and easy to maintain.
  • In Node.js, also increased the per-test Vitest timeout to exceed the longer send timeout (and the global 30s test timeout).
Show a summary per file
File Description
rust/tests/e2e/builtin_tools.rs Adds a 120s SEND_TIMEOUT and uses MessageOptions::with_wait_timeout via a helper for all builtin tool sends.
python/e2e/test_builtin_tools_e2e.py Introduces SEND_TIMEOUT = 120.0 and passes it to session.send_and_wait(..., timeout=...) for all tests.
nodejs/test/e2e/builtin_tools.e2e.test.ts Adds SEND_TIMEOUT_MS and raises per-test Vitest timeout to avoid the global 30s timeout binding first.
go/internal/e2e/builtin_tools_e2e_test.go Wraps each SendAndWait call with a 120s context.WithTimeout so Go’s default 60s deadline doesn’t apply.
dotnet/test/E2E/BuiltinToolsE2ETests.cs Adds a 120s SendTimeout and passes it to each SendAndWaitAsync call.

Copilot's findings

  • Files reviewed: 5/5 changed files
  • Comments generated: 0

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

Cross-SDK Consistency Review ✅

This PR maintains excellent cross-SDK consistency. The timeout increase is applied uniformly across all five applicable SDK implementations:

SDK Change
Rust SEND_TIMEOUT const + MessageOptions::with_wait_timeout
Python SEND_TIMEOUT = 120.0 passed as timeout=
Go const sendTimeout = 120 * time.Second + context.WithTimeout
Node.js 120_000 ms send timeout + 180_000 ms vitest test timeout
.NET SendTimeout = TimeSpan.FromSeconds(120)

Java is correctly omitted as it has no builtin_tools E2E test suite.

Each implementation uses the appropriate language idioms while achieving the same semantic result. No consistency issues found.

Generated by SDK Consistency Review Agent for issue #1538 · ● 1.2M ·

@stephentoub stephentoub enabled auto-merge June 1, 2026 22:23
@stephentoub stephentoub disabled auto-merge June 1, 2026 22:23
@stephentoub stephentoub merged commit 64c606f into main Jun 1, 2026
44 checks passed
@stephentoub stephentoub deleted the stephentoub/deflake-builtin-exit-code-test branch June 1, 2026 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants