fix: optimize sandbox cleaning performance by chrisburr · Pull Request #796 · DIRACGrid/diracx

chrisburr · 2026-02-17T04:57:57Z

Summary

Optimise sandbox cleanup to sustain ~8k deletes/s, clearing ~30M sandboxes in ~1 hour on LHCb prod.

Key changes:

Remove SELECT FOR UPDATE SKIP LOCKED — use three short phases instead: SELECT (no locks), S3 delete, DB delete
Chunk DB deletes into 1000-row batches (up to 10 concurrent) to avoid locking millions of rows in a single DELETE
Increase S3 connection pool from default 10 to 50 (s3_max_pool_connections setting) — was serialising parallel S3 requests
Handle S3 partial delete failures gracefully — only remove DB rows for successfully deleted S3 objects, failed keys are retried next run
Add per-phase timing to logs for easy bottleneck identification

Before

The previous implementation did not scale to large sandbox stores. Query and locking behaviour became quadratic with table size, putting massive load on the database.

After

~480,000 sandboxes/minute (~8k/s), ~30M deleted in 1 hour.

read-the-docs-community · 2026-02-18T04:22:29Z

Documentation build overview

📚 diracx | 🛠️ Build #31490362 | 📁 Comparing f25667b against latest (b2ad8c6)

🔍 Preview build

Show files changed (1 files in total): 📝 1 modified | ➕ 0 added | ➖ 0 deleted

File	Status
admin/reference/env-variables/index.html	📝 modified

Split the single OR-based deletion query into two targeted queries: 1. Assigned but orphaned sandboxes — fast with Location key + NOT EXISTS (~4.7s on prod vs 140s before) 2. Unassigned sandboxes older than 15 days — benefits from new idx_assigned_lastaccesstime composite index Additional changes: - Add composite index on (Assigned, LastAccessTime) for the unassigned query - Fetch distinct SEName values upfront to enable the Location unique key - Replace non-sargable days_since(LastAccessTime) with a pre-computed threshold (LastAccessTime < NOW() - 15 days) - Increase default batch_size from 500 to 5000 - Chunk S3 bulk deletes into groups of 1000 (S3 API limit)

Half the workers prefer orphaned sandboxes first, half prefer unassigned first. This ensures both query types make progress concurrently instead of one starving the other.

Track the last SBId seen per worker and add SBId > cursor to each query. This avoids re-scanning millions of already-checked rows on every batch, turning O(total_rows) scans into O(batch_size) seeks.

Replace multi-worker approach with a single sequential query loop and larger batches (50k). S3 delete chunks are now sent concurrently via TaskGroup. This eliminates lock contention between workers while keeping S3 throughput high.

Split the SELECT FOR UPDATE into two steps: 1. SELECT without locking to find candidate SBIds (scans many rows) 2. SELECT FOR UPDATE SKIP LOCKED WHERE SBId IN (...) to lock only the exact matching rows by primary key Previously FOR UPDATE locked every row examined during the scan (82M rows for a 50k result), causing 143MB of lock memory and 12+ minute DELETE times. Now only the returned rows are locked.

Split the single long transaction into three phases: SELECT candidates (short transaction, no locks), S3 delete (no transaction), and DB DELETE (short transaction). This eliminates the SELECT FOR UPDATE that locked every row examined during the primary key scan.

A single DELETE of 50k rows caused MySQL to lock 30M rows during the index scan. Split into 1000-row chunks running up to 10 concurrently via asyncio.Semaphore, keeping each transaction short.

Log duration of each phase (SELECT, S3 delete, DB delete) per batch so bottlenecks are immediately visible. Demote per-chunk DB delete log to DEBUG to reduce noise.

The default max_pool_connections=10 serialized 50 concurrent delete requests into 5 rounds, adding ~4s latency per round against Ceph.

Add s3_max_pool_connections setting (default 50) to SandboxStoreSettings, configurable via DIRACX_SANDBOX_STORE_S3_MAX_POOL_CONNECTIONS.

Avoids redundant SELECT queries when the previous batch already exhausted all matching rows.

s3_bulk_delete_with_retry now returns failed keys instead of raising, and _s3_delete_chunk_with_retry retries only the failed keys. The caller skips DB deletion for sandboxes whose S3 objects were not removed, preventing dark data. Also removes the per-batch SELECT DISTINCT SEName query by passing settings.se_name directly, and updates the stale docstring.

This index is not needed for the current cleanup queries.

chrisburr force-pushed the fix/sandbox-cleaning-performance branch 2 times, most recently from da97dce to b99d290 Compare February 20, 2026 15:33

chrisburr added 12 commits February 20, 2026 16:36

fix: split workers across both deletion query types

836e84c

Half the workers prefer orphaned sandboxes first, half prefer unassigned first. This ensures both query types make progress concurrently instead of one starving the other.

fix: use cursor-based pagination for sandbox deletion queries

4df7001

Track the last SBId seen per worker and add SBId > cursor to each query. This avoids re-scanning millions of already-checked rows on every batch, turning O(total_rows) scans into O(batch_size) seeks.

fix: chunk DB deletes into 1000-row batches with concurrent execution

a7a60d0

A single DELETE of 50k rows caused MySQL to lock 30M rows during the index scan. Split into 1000-row chunks running up to 10 concurrently via asyncio.Semaphore, keeping each transaction short.

refactor: add per-phase timing to sandbox cleanup logging

5f85b94

Log duration of each phase (SELECT, S3 delete, DB delete) per batch so bottlenecks are immediately visible. Demote per-chunk DB delete log to DEBUG to reduce noise.

fix: increase S3 connection pool to 50 for parallel bulk deletes

95ef00a

The default max_pool_connections=10 serialized 50 concurrent delete requests into 5 rounds, adding ~4s latency per round against Ceph.

fix: make S3 connection pool size configurable

25b5325

Add s3_max_pool_connections setting (default 50) to SandboxStoreSettings, configurable via DIRACX_SANDBOX_STORE_S3_MAX_POOL_CONNECTIONS.

fix: stop iterating when batch returns fewer than batch_size candidates

92bf936

Avoids redundant SELECT queries when the previous batch already exhausted all matching rows.

chrisburr force-pushed the fix/sandbox-cleaning-performance branch from b99d290 to 8a8e4b2 Compare February 20, 2026 15:48

fix: remove idx_assigned_lastaccesstime index from schema

f25667b

This index is not needed for the current cleanup queries.

chrisburr changed the title ~~fix: optimize sandbox cleaning query performance~~ fix: optimize sandbox cleaning performance Feb 20, 2026

chrisburr marked this pull request as ready for review February 20, 2026 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: optimize sandbox cleaning performance#796

fix: optimize sandbox cleaning performance#796
chrisburr wants to merge 13 commits intoDIRACGrid:mainfrom
chrisburr:fix/sandbox-cleaning-performance

chrisburr commented Feb 17, 2026 •

edited

Loading

Uh oh!

read-the-docs-community bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

chrisburr commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Before

After

Uh oh!

read-the-docs-community bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chrisburr commented Feb 17, 2026 •

edited

Loading

read-the-docs-community bot commented Feb 18, 2026 •

edited

Loading