Skip to content

⚡ speed up migrate_endpoints_to_locations (~14× fewer queries)#14841

Merged
Maffooch merged 2 commits into
bugfixfrom
perf/migrate-endpoints-to-locations
May 11, 2026
Merged

⚡ speed up migrate_endpoints_to_locations (~14× fewer queries)#14841
Maffooch merged 2 commits into
bugfixfrom
perf/migrate-endpoints-to-locations

Conversation

@Maffooch
Copy link
Copy Markdown
Contributor

@Maffooch Maffooch commented May 8, 2026

This is a perf bugfix to an existing one-shot data migration; no new features or surface area, no new dependencies, single file changed. The migration is currently producing latently-incorrect data in addition to being unusably slow on real instances (16+ hours for ~650 endpoints / 22k findings — see Description), so I went ahead.

Description

dojo/management/commands/migrate_endpoints_to_locations.py is a one-shot migration that maps each EndpointURL/Location, each Endpoint_StatusLocationFindingReference, and each (endpoint→product, finding→product) pair → LocationProductReference. On a real instance with 650 endpoints and 22k findings, it had produced 80 locations in 2 hours (~90 s/endpoint) and was projected at ~16 hours total. Per-endpoint queries on a 50-endpoint / 1,000-finding local benchmark came in at 613.

This PR keeps changes inside the management command itself (no edits to shared Location model methods) and reduces per-endpoint queries to 44 and per-endpoint wall-clock to 17 ms on the same benchmark — a 13.9× query reduction / 14.1× wall-clock improvement. Idempotent re-runs validated; strict accuracy checks pass.

The work was applied as five logically-distinct changes; each row of the table below was measured independently against a freshly-reset migration target so the gain attribution is honest:

Step Wall-clock ms/ep Queries/ep Notes
Baseline (instrumented, no logic change) 12.00 s 240 613 14 product-ref status warnings (pre-existing bug)
+ select_related/prefetch_related 10.63 s 213 528 Eliminates lazy joins on tags, endpoint_meta, status_endpoint, and finding → test → engagement → product/mitigated_by
+ tags.add(*names) splat 10.08 s 202 507 tags phase −61%
+ DojoMeta.bulk_create(ignore_conflicts=True) 10.24 s 205 498 meta phase −67% — DojoMeta.unique_together = (location, name) makes ignore_conflicts equivalent to the previous get_or_create
+ bulk_create LFR/LPR + in-memory product status 0.85 s 17 44 Bypasses BaseModel.save → full_clean() validate_unique queries AND the inherit_tags_on_linked_instance post_save signal that fires all_related_products (a finding/test/engagement/product join) on every save

Where the savings actually came from, surfaced by the --query-count flag on a single endpoint with 20 statuses:

  • ~3 validate_unique queries × 2 saves per status = ~120 redundant queries / endpoint just from full_clean
  • 1 all_related_products + 1 system_settings per save × ~40 saves per endpoint = ~80 redundant queries from the inherit-tags post_save signal
  • 3 lazy joins (finding.test.engagement.product) × 20 statuses = 60 queries
  • The .exists() → .get() → update_or_create triple in associate_with_finding/associate_with_product — 2 redundant queries per status

bulk_create(ignore_conflicts=True) for the reference rows skips all of those at once.

One Django gotcha worth flagging in review

LocationFindingReference.created is auto_now_add=True. The original migration set created to the source Endpoint_Status.date via a post-save UPDATE, which works because auto_now_add only fires on the initial INSERT (add=True). With bulk_create we don't get a post-save UPDATE — and Django's SQLInsertCompiler.pre_save_val does call Field.pre_save(add=True) even from bulk_create, so auto_now_add would overwrite our explicit value with now(). The PR uses a small _suspend_auto_now_add context manager that toggles auto_now_add=False on the field during the bulk write, then restores it. Single-process management command — no thread-safety concerns.

Behavioral diffs you should know about

  1. LocationProductReference.status is now correct (semantic improvement). The previous code called Location.associate_with_product(...), which short-circuits if the (location, product) pair already exists — never updating the status when a later Active finding comes in for the same product. With the seed's 70/30 Active/non-Active distribution, ~28% of product refs landed on Mitigated when the logical truth was Active. The new code derives product status in-memory across all of an endpoint's finding statuses (Active iff any of them is Active, else Mitigated) and bulk_creates the row. The strict mode of the local accuracy verifier explicitly catches this — baseline fails strict, this PR passes strict.

  2. Inherited product tags via post_save signals do NOT fire. bulk_create skips signals, so inherit_tags_on_linked_instance (dojo/tags_signals.py:65-68) does not run during the migration. For deployments where any product has enable_product_tag_inheritance=True (or the system-wide setting is on), inherited product tags will not be propagated onto migrated Locations by this command. Two reasonable mitigations, both deferred to a follow-up:

    • Add a one-time post-bulk pass that walks the migrated Locations and calls inherit_tags() (cheap — one query per location, paid once instead of per-LFR).
    • Use the existing _bulk_inherit_tags helper at dojo/importers/location_manager.py:486, which is designed for exactly this scenario.

    I'd be happy to land either as a follow-up PR if you confirm your prod has tag inheritance enabled. The benchmark seed does not exercise this path.

Things intentionally NOT changed

  • Location.associate_with_finding / associate_with_product are untouched. They still have the .exists() → .get() → update_or_create pattern and the first-write-wins behavior on product status. The migration just stops calling them on its hot path. Other callers (e.g., importers) are unaffected.
  • No DB migration. Pure code change, no schema impact.
  • No --start-from <id> / --skip-validation flags. Considered during planning, dropped to keep blast radius small. Easy follow-ups if the prod run reveals a need.
  • No verbose-per-endpoint logging mode. The default Migrated X/Y (z%) — N ep/sec — ETA … line is enough for ops; --benchmark covers diagnostic depth.

Test results

Verified locally on the docker dev stack with DD_V3_FEATURE_LOCATIONS=True against a representative seed (5 products × 10 endpoints × 20 findings, with the four non-Active status flag distributions exercised: 70% Active / 15% Mitigated / 5% RA / 5% FP / 5% OOS):

  • Strict accuracy verifier passes after the migration: counts (URLs, Locations, LFR, LPR, location-DojoMeta), per-LFR field equality (status, created, audit_time, auditor), endpoint→location tag subset, and DojoMeta (location, name) parity all agree with source data.
  • Idempotent: running the migration a second time produces no new rows and no errors; verifier still passes.
  • python manage.py check clean. ruff check clean.
  • The verifier and seed scripts used during development are local-only scratch tools; they are not in this PR (intentionally — single-file diff).

I have not exercised this on a 22k-finding instance directly. The 50/1,000 numbers above are linear-scaling from a per-endpoint perspective; if anyone has a prod-shape instance handy, validation there would be welcome.

Documentation

No user-facing docs to update — this is a one-shot internal data migration command. The new --benchmark / --query-count / --batch-size / --progress-every flags are documented via argparse help= strings.

Reduces per-endpoint cost in the Endpoint→Location data migration from
~613 queries / 240 ms to ~44 queries / 17 ms on a 50-endpoint /
1,000-finding local benchmark — a 13.9× query reduction and a 14.1×
wall-clock improvement that should bring a 16-hour prod run under one
hour. As a side-effect, fixes the latent associate_with_product
short-circuit bug where 'Mitigated' could stick on a LocationProductReference
even after later Active findings came in for the same product.

Changes (kept inside this management command — no edits to shared
Location model methods):

- select_related/prefetch_related on the main Endpoint queryset so the
  per-endpoint loop has no hidden joins through tags, endpoint_meta,
  status_endpoint, or finding→test→engagement→product/mitigated_by.

- tags.add(*names) splat instead of N round-trips per tag.

- DojoMeta.bulk_create(ignore_conflicts=True) per endpoint instead of
  get_or_create per row (DojoMeta.unique_together = (location, name)
  makes ignore_conflicts semantically equivalent here).

- LocationFindingReference and LocationProductReference are bulk_created
  per endpoint instead of going through Location.associate_with_finding /
  associate_with_product. This bypasses BaseModel.save's full_clean()
  validate_unique queries AND the inherit_tags_on_linked_instance post_save
  signal (which fires all_related_products through the
  finding→test→engagement→product chain on every save). Product status is
  derived in-memory across all of an endpoint's finding statuses.

- _suspend_auto_now_add wraps the LocationFindingReference bulk write so
  the explicit 'created' value (= source Endpoint_Status.date) is honored.
  Django's SQLInsertCompiler.pre_save_val calls Field.pre_save(add=True)
  even from bulk_create; auto_now_add would otherwise overwrite our value
  with now().

- New CLI flags for ops visibility on long runs:
  --batch-size, --progress-every, --benchmark, --query-count.
  Default progress line: 'Migrated X/Y (z%) — N ep/sec — ETA …'.

Per-step measurements (50 ep / 1,000 findings, V3_FEATURE_LOCATIONS=True,
local docker postgres):

  step                    wall    queries/ep   verifier
  baseline (instrumented) 12.00s  613          14 LPR-status warnings (pre-existing bug)
  + prefetch_related      10.63s  528          same
  + tags splat            10.08s  507          same
  + DojoMeta bulk_create  10.24s  498          same
  + bulk LFR/LPR + fix     0.85s   44          all strict checks pass

Idempotent re-runs validated. Verifier checks counts (URLs, Locations,
LFR, LPR, location-DojoMeta), per-row LFR fields (status, created,
audit_time, auditor), endpoint→location tag subset, and DojoMeta
(location, name) parity.

Intentional behavioral diffs vs. the previous code:

1. LocationProductReference.status now reflects 'Active iff any finding
   for this (location, product) is Active' — fixes the
   associate_with_product first-write-wins bug. Previously order-
   dependent; ~28% of product refs were mis-statused on the seeded
   distribution.

2. Tag inheritance via the inherit_tags_on_linked_instance post_save
   signal does NOT fire (bulk_create skips signals). For deployments
   with enable_product_tag_inheritance=True on products (or the system
   setting on), inherited product tags will not be propagated onto
   migrated Locations during this command. The seed used in
   benchmarking does not exercise this path. If your environment uses
   product tag inheritance, follow up with a one-time
   Location.inherit_tags pass after this command — or call out and we
   can bake _bulk_inherit_tags into the migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@valentijnscholten valentijnscholten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. But wouldn't it make sense to also add the step to inherit the tags as it's cheap / easy?

Copy link
Copy Markdown
Contributor

@mtesauro mtesauro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

…r isolation

`bulk_create` (introduced in the prior perf commit) skips the
`inherit_tags_on_linked_instance` post_save signal, so deployments
with `enable_product_tag_inheritance` enabled (per-product or
system-wide) would not pick up inherited product tags on migrated
Locations. Track (product, location) pairs during the main loop —
covering both `endpoint.product` and `finding.test.engagement.product`
— and run a post-pass that calls
`LocationManager(product)._bulk_inherit_tags(locations)` once per
contributing product. The helper rediscovers each location's full
product set via LocationProductReference/LocationFindingReference
and diff-checks before writing, so revisits of shared locations
across product groups are idempotent. ~5 queries per product group
vs ~3 per location for a per-location `inherit_tags()` loop.

Also wrap the per-endpoint body in a `try`/`except Exception` so a
single bad row doesn't abort a multi-hour migration. Failures get
logged with full traceback and tracked in `self.failed_endpoints`;
the final "Done." line reports `<successful>/<total>` and a yellow
warning lists the first 10 failing IDs. `KeyboardInterrupt` /
`SystemExit` are not swallowed. The post-pass uses the same pattern
per product group.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Maffooch Maffooch added this to the 2.58.3 milestone May 11, 2026
@Maffooch Maffooch merged commit 966e34e into bugfix May 11, 2026
160 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants