fix: make async row groups lazy#729
Conversation
Review: PR #729 — fix: make async row groups lazySummaryReplaces eager
Threads the new types through The benchmark numbers in the PR description (253 MiB → 0.018 MiB peak for 1M FindingsCorrectness — looks solid, with one ambiguous protocol contract
Backward compatibility
Validation hardening
Style / project conventions
Test coverage
Performance
Nits (non-blocking)
VerdictApprove with minor suggestions. The refactor is well-scoped, preserves the |
Greptile SummaryThis PR replaces the eager O(N) pre-allocation of row-group metadata (lists, dicts of size/offset/tracker entries) with a compact, lazy
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/row_group_plan.py | New file: correct Protocol definition, clean branching logic for dense/sparse sets, arithmetic offsets are validated by tests including edge cases around the >½ frontier heuristic. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py | Removes pre-built size/offset dicts and row_group_start_offsets parameter; delegates all lookups to RowGroupPlanLike. No behavioral change on production code paths. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py | Replaces eager list/dict construction in build_row_group_resume_plan with CompactRowGroupPlan.resume; adds early DatasetGenerationError for corrupt original_target_num_records > target_num_records metadata. |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/completion.py | Replaces _row_group_sizes dict with _row_group_plan; adds _row_group_size_or_default helper for the is_column_complete path that previously used .get() with a 0 default. Semantics preserved. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py | Adds tracemalloc-based regression for large fresh-run preparation staying under 5 MiB peak. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py | Adds memory-bounded scheduler construction test for 1M row groups; migrates offset test from raw list + explicit offsets to CompactRowGroupPlan.resume. |
| packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py | Adds corrupt-metadata rejection test and sparse-resume memory test; updates existing assertions to use list(plan) rather than comparing against materialized lists. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[DatasetBuilder._prepare_async_run] --> B{precomputed_row_groups?}
B -- Fresh run --> C[CompactRowGroupPlan.fresh\nnum_records / buffer_size\narithmetic, zero allocation]
B -- Resume --> D[build_row_group_resume_plan\nCompactRowGroupPlan.resume]
D --> E{valid_completed_count\n> total // 2?}
E -- Dense: store scheduled IDs --> F[_id_filter = frozenset of\nremaining IDs\n_filter_includes_scheduled=True]
E -- Sparse: store completed IDs --> G[_id_filter = frozenset of\ncompleted IDs\n_filter_includes_scheduled=False]
C --> H[normalize_row_group_plan]
F --> H
G --> H
H --> I[RowGroupPlanLike]
I --> J[CompletionTracker.with_graph]
I --> K[AsyncTaskScheduler.__init__\n_scheduled_records = plan.scheduled_total_rows\nno size/offset dicts]
K --> L[_get_rg_size → plan.row_group_size]
K --> M[_get_rg_start_offset → plan.row_group_start_offset\narithmetic: rg_id × buffer_size]
Reviews (2): Last reviewed commit: "fix: make async row groups lazy" | Re-trigger Greptile
Avoid preallocating per-row-group list and dictionary metadata for huge async runs. The async builder now passes a compact row-group plan through the completion tracker and scheduler while preserving resume offsets and explicit small-list compatibility. Fixes NVIDIA-NeMo#726 Signed-off-by: Eric W. Tramel <1223539+eric-tramel@users.noreply.github.com>
90a2323 to
ff99564
Compare
📋 Summary
Fixes #726 by replacing eager async row-group metadata construction with a compact row-group plan. Scheduler/tracker preparation now stays proportional to the active/sparse row groups instead of materializing list and dictionary metadata for every logical row group, while preserving resume offsets for ordered seed datasets.
🔗 Related Issue
Fixes #726
🔄 Changes
original_target_num_recordsexceeds the requested target.🧪 Testing
make lint-enginemake test-engine— 2220 passed in 36.28sPerformance demonstration, measured locally with
tracemallocfor 2,000,000 records,buffer_size=2, and 1,000,000 logical row groups:[(999998, 2), (999999, 2)]; avoids retaining the near-full completed-ID set in the plan.✅ Checklist