fix: optimize sandbox cleaning performance#796
Open
chrisburr wants to merge 13 commits intoDIRACGrid:mainfrom
Open
fix: optimize sandbox cleaning performance#796chrisburr wants to merge 13 commits intoDIRACGrid:mainfrom
chrisburr wants to merge 13 commits intoDIRACGrid:mainfrom
Conversation
da97dce to
b99d290
Compare
Split the single OR-based deletion query into two targeted queries: 1. Assigned but orphaned sandboxes — fast with Location key + NOT EXISTS (~4.7s on prod vs 140s before) 2. Unassigned sandboxes older than 15 days — benefits from new idx_assigned_lastaccesstime composite index Additional changes: - Add composite index on (Assigned, LastAccessTime) for the unassigned query - Fetch distinct SEName values upfront to enable the Location unique key - Replace non-sargable days_since(LastAccessTime) with a pre-computed threshold (LastAccessTime < NOW() - 15 days) - Increase default batch_size from 500 to 5000 - Chunk S3 bulk deletes into groups of 1000 (S3 API limit)
Half the workers prefer orphaned sandboxes first, half prefer unassigned first. This ensures both query types make progress concurrently instead of one starving the other.
Track the last SBId seen per worker and add SBId > cursor to each query. This avoids re-scanning millions of already-checked rows on every batch, turning O(total_rows) scans into O(batch_size) seeks.
Replace multi-worker approach with a single sequential query loop and larger batches (50k). S3 delete chunks are now sent concurrently via TaskGroup. This eliminates lock contention between workers while keeping S3 throughput high.
Split the SELECT FOR UPDATE into two steps: 1. SELECT without locking to find candidate SBIds (scans many rows) 2. SELECT FOR UPDATE SKIP LOCKED WHERE SBId IN (...) to lock only the exact matching rows by primary key Previously FOR UPDATE locked every row examined during the scan (82M rows for a 50k result), causing 143MB of lock memory and 12+ minute DELETE times. Now only the returned rows are locked.
Split the single long transaction into three phases: SELECT candidates (short transaction, no locks), S3 delete (no transaction), and DB DELETE (short transaction). This eliminates the SELECT FOR UPDATE that locked every row examined during the primary key scan.
A single DELETE of 50k rows caused MySQL to lock 30M rows during the index scan. Split into 1000-row chunks running up to 10 concurrently via asyncio.Semaphore, keeping each transaction short.
Log duration of each phase (SELECT, S3 delete, DB delete) per batch so bottlenecks are immediately visible. Demote per-chunk DB delete log to DEBUG to reduce noise.
The default max_pool_connections=10 serialized 50 concurrent delete requests into 5 rounds, adding ~4s latency per round against Ceph.
Add s3_max_pool_connections setting (default 50) to SandboxStoreSettings, configurable via DIRACX_SANDBOX_STORE_S3_MAX_POOL_CONNECTIONS.
Avoids redundant SELECT queries when the previous batch already exhausted all matching rows.
s3_bulk_delete_with_retry now returns failed keys instead of raising, and _s3_delete_chunk_with_retry retries only the failed keys. The caller skips DB deletion for sandboxes whose S3 objects were not removed, preventing dark data. Also removes the per-batch SELECT DISTINCT SEName query by passing settings.se_name directly, and updates the stale docstring.
b99d290 to
8a8e4b2
Compare
This index is not needed for the current cleanup queries.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Optimise sandbox cleanup to sustain ~8k deletes/s, clearing ~30M sandboxes in ~1 hour on LHCb prod.
Key changes:
SELECT FOR UPDATE SKIP LOCKED— use three short phases instead: SELECT (no locks), S3 delete, DB deletes3_max_pool_connectionssetting) — was serialising parallel S3 requestsBefore
The previous implementation did not scale to large sandbox stores. Query and locking behaviour became quadratic with table size, putting massive load on the database.
After
~480,000 sandboxes/minute (~8k/s), ~30M deleted in 1 hour.