Skip to content

Comments

fix: optimize sandbox cleaning performance#796

Open
chrisburr wants to merge 13 commits intoDIRACGrid:mainfrom
chrisburr:fix/sandbox-cleaning-performance
Open

fix: optimize sandbox cleaning performance#796
chrisburr wants to merge 13 commits intoDIRACGrid:mainfrom
chrisburr:fix/sandbox-cleaning-performance

Conversation

@chrisburr
Copy link
Member

@chrisburr chrisburr commented Feb 17, 2026

Summary

Optimise sandbox cleanup to sustain ~8k deletes/s, clearing ~30M sandboxes in ~1 hour on LHCb prod.

Key changes:

  • Remove SELECT FOR UPDATE SKIP LOCKED — use three short phases instead: SELECT (no locks), S3 delete, DB delete
  • Chunk DB deletes into 1000-row batches (up to 10 concurrent) to avoid locking millions of rows in a single DELETE
  • Increase S3 connection pool from default 10 to 50 (s3_max_pool_connections setting) — was serialising parallel S3 requests
  • Handle S3 partial delete failures gracefully — only remove DB rows for successfully deleted S3 objects, failed keys are retried next run
  • Add per-phase timing to logs for easy bottleneck identification

Before

The previous implementation did not scale to large sandbox stores. Query and locking behaviour became quadratic with table size, putting massive load on the database.

After

~480,000 sandboxes/minute (~8k/s), ~30M deleted in 1 hour.

@read-the-docs-community
Copy link

read-the-docs-community bot commented Feb 18, 2026

Documentation build overview

📚 diracx | 🛠️ Build #31490362 | 📁 Comparing f25667b against latest (b2ad8c6)


🔍 Preview build

Show files changed (1 files in total): 📝 1 modified | ➕ 0 added | ➖ 0 deleted
File Status
admin/reference/env-variables/index.html 📝 modified

@chrisburr chrisburr force-pushed the fix/sandbox-cleaning-performance branch 2 times, most recently from da97dce to b99d290 Compare February 20, 2026 15:33
Split the single OR-based deletion query into two targeted queries:

1. Assigned but orphaned sandboxes — fast with Location key + NOT EXISTS
   (~4.7s on prod vs 140s before)
2. Unassigned sandboxes older than 15 days — benefits from new
   idx_assigned_lastaccesstime composite index

Additional changes:
- Add composite index on (Assigned, LastAccessTime) for the unassigned
  query
- Fetch distinct SEName values upfront to enable the Location unique key
- Replace non-sargable days_since(LastAccessTime) with a pre-computed
  threshold (LastAccessTime < NOW() - 15 days)
- Increase default batch_size from 500 to 5000
- Chunk S3 bulk deletes into groups of 1000 (S3 API limit)
Half the workers prefer orphaned sandboxes first, half prefer
unassigned first. This ensures both query types make progress
concurrently instead of one starving the other.
Track the last SBId seen per worker and add SBId > cursor to each
query. This avoids re-scanning millions of already-checked rows on
every batch, turning O(total_rows) scans into O(batch_size) seeks.
Replace multi-worker approach with a single sequential query loop
and larger batches (50k). S3 delete chunks are now sent concurrently
via TaskGroup. This eliminates lock contention between workers while
keeping S3 throughput high.
Split the SELECT FOR UPDATE into two steps:
1. SELECT without locking to find candidate SBIds (scans many rows)
2. SELECT FOR UPDATE SKIP LOCKED WHERE SBId IN (...) to lock only
   the exact matching rows by primary key

Previously FOR UPDATE locked every row examined during the scan
(82M rows for a 50k result), causing 143MB of lock memory and
12+ minute DELETE times. Now only the returned rows are locked.
Split the single long transaction into three phases: SELECT candidates
(short transaction, no locks), S3 delete (no transaction), and DB DELETE
(short transaction). This eliminates the SELECT FOR UPDATE that locked
every row examined during the primary key scan.
A single DELETE of 50k rows caused MySQL to lock 30M rows during the
index scan. Split into 1000-row chunks running up to 10 concurrently
via asyncio.Semaphore, keeping each transaction short.
Log duration of each phase (SELECT, S3 delete, DB delete) per batch
so bottlenecks are immediately visible. Demote per-chunk DB delete
log to DEBUG to reduce noise.
The default max_pool_connections=10 serialized 50 concurrent delete
requests into 5 rounds, adding ~4s latency per round against Ceph.
Add s3_max_pool_connections setting (default 50) to SandboxStoreSettings,
configurable via DIRACX_SANDBOX_STORE_S3_MAX_POOL_CONNECTIONS.
Avoids redundant SELECT queries when the previous batch already
exhausted all matching rows.
s3_bulk_delete_with_retry now returns failed keys instead of raising,
and _s3_delete_chunk_with_retry retries only the failed keys. The
caller skips DB deletion for sandboxes whose S3 objects were not
removed, preventing dark data.

Also removes the per-batch SELECT DISTINCT SEName query by passing
settings.se_name directly, and updates the stale docstring.
@chrisburr chrisburr force-pushed the fix/sandbox-cleaning-performance branch from b99d290 to 8a8e4b2 Compare February 20, 2026 15:48
This index is not needed for the current cleanup queries.
@chrisburr chrisburr changed the title fix: optimize sandbox cleaning query performance fix: optimize sandbox cleaning performance Feb 20, 2026
@chrisburr chrisburr marked this pull request as ready for review February 20, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant