Fix python expansion service startup timeout#38611
Draft
shunping wants to merge 6 commits into
Draft
Conversation
…ilure (apache#38572)" This reverts commit 930b94c.
When a SubprocessServer fails to start (e.g., due to a process exit or startup error), the server process could leak if standard purging is blocked by other active owners sharing the cached subprocess. To fix this: - Implement `_SharedCache.force_remove()` to immediately remove a key from the cache and run its destructor regardless of active owners. - Add `SubprocessServer.stop_force()` which calls `force_remove()` to completely terminate the server's process. - Call `stop_force()` in the `except` block of `SubprocessServer.start()`
This ensures we can download pre-built wheels for environment staging rather than relying on tarball building, which is sometimes slow.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR reverts the changes from #38572 and addresses the root cause of the expansion service timeouts by enabling pre-built wheel downloads and improving subprocess cleanup.
Further investigation into the expansion service test hangs revealed that the issue was not caused by network instability or port collisions. Instead, the service was timing out while downloading and building the numpy source tarball. This build process is heavily influenced by GitHub Runner load, leading to frequent timeouts during environment staging.
In this PR, we update Stager to support new manylinux tag, which allows the service to download pre-built wheels rather than compiling from source. This significanly accelerate the startup time of the expansion service. In local environments, these changes reduced the runtime of
MLTest::test_ml_preprocessing_yamlfrom 110 seconds to 8 seconds!Furthermore, we introduce
force_remove()andstop_force()to ensure that if a SubprocessServer fails to start, the process is immediately terminated and removed from the cache, preventing leaks even when shared by multiple owners.