Skip to content

Fix python expansion service startup timeout#38611

Draft
shunping wants to merge 6 commits into
apache:masterfrom
shunping:fix-expansion-service-timeout
Draft

Fix python expansion service startup timeout#38611
shunping wants to merge 6 commits into
apache:masterfrom
shunping:fix-expansion-service-timeout

Conversation

@shunping
Copy link
Copy Markdown
Collaborator

@shunping shunping commented May 23, 2026

This PR reverts the changes from #38572 and addresses the root cause of the expansion service timeouts by enabling pre-built wheel downloads and improving subprocess cleanup.

Further investigation into the expansion service test hangs revealed that the issue was not caused by network instability or port collisions. Instead, the service was timing out while downloading and building the numpy source tarball. This build process is heavily influenced by GitHub Runner load, leading to frequent timeouts during environment staging.

In this PR, we update Stager to support new manylinux tag, which allows the service to download pre-built wheels rather than compiling from source. This significanly accelerate the startup time of the expansion service. In local environments, these changes reduced the runtime of MLTest::test_ml_preprocessing_yaml from 110 seconds to 8 seconds!

Furthermore, we introduce force_remove() and stop_force() to ensure that if a SubprocessServer fails to start, the process is immediately terminated and removed from the cache, preventing leaks even when shared by multiple owners.

shunping added 3 commits May 22, 2026 19:24
When a SubprocessServer fails to start (e.g., due to a process exit or
startup error), the server process could leak if standard purging
is blocked by other active owners sharing the cached subprocess.

To fix this:
- Implement `_SharedCache.force_remove()` to immediately remove a key
  from the cache and run its destructor regardless of active owners.
- Add `SubprocessServer.stop_force()` which calls `force_remove()` to
  completely terminate the server's process.
- Call `stop_force()` in the `except` block of `SubprocessServer.start()`
This ensures we can download pre-built wheels for environment staging
rather than relying on tarball building, which is sometimes slow.
@shunping shunping changed the title Fix expansion service timeout Fix python expansion service startup timeout May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant