fix: processor artifacts type, discovery, and loading by andreatgretel · Pull Request #366 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-03-03T16:27:48Z

Summary

Fix PreviewResults.processor_artifacts type annotation from dict[str, list[str] | str] to dict[str, list[dict]], matching the actual output of df.to_dict(orient="records")
Add list_processor_names() and load_processor_dataset() as standalone functions in data_designer.config.utils.io_helpers - they take a Path and handle both directory-based (batched) and single-file (preview) storage layouts
Fix preview bug: discovery now uses disk contents instead of iterating get_processor_configs(), which would crash for processors that don't write artifacts (e.g. DropColumnsProcessor)
Simplify DatasetCreationResults.load_processor_dataset() to delegate to ArtifactStorage
Align get_processor_file_paths() with list_processor_names() so both methods discover directories and single parquet files consistently

Changes

Fixed

PreviewResults.processor_artifacts type annotation (dict[str, list[dict]])
Preview processor discovery no longer crashes on processors without artifacts
get_processor_file_paths() now discovers single-file processors, not just directories

Changed

io_helpers.py - New list_processor_names() and load_processor_dataset() standalone functions
artifact_storage.py - Wrapper methods that convert FileNotFoundError to ArtifactStorageError
data_designer.py - Preview uses list_processor_names() instead of get_processor_configs()
results.py - Delegates to ArtifactStorage.load_processor_dataset()

Test plan

Unit tests for list_processor_names (empty, directory, single file, mixed, dedup)
Parametrized test for load_processor_dataset (directory and single file layouts)
Error type tests (standalone raises FileNotFoundError, ArtifactStorage wraps as ArtifactStorageError)
Test for get_processor_file_paths with single-file processors
All engine tests pass
Pre-commit hooks pass

Description updated with AI

- Fix PreviewResults.processor_artifacts type from dict[str, list[str] | str] to dict[str, list[dict]], matching the actual data produced by df.to_dict(orient="records") - Add ArtifactStorage.load_processor_dataset() to centralize loading logic, handling both directory-based (batched) and single-file (preview) layouts - Add ArtifactStorage.list_processor_names() to discover processor outputs from disk rather than iterating config, fixing a bug where processors that don't write artifacts (e.g. DropColumnsProcessor) would crash the library's preview - Simplify DatasetCreationResults.load_processor_dataset() to delegate to ArtifactStorage

greptile-apps · 2026-03-03T16:32:44Z

Greptile Summary

This PR fixes three critical bugs around processor artifact discovery, type annotation, and loading:

Type annotation fix in PreviewResults.processor_artifacts: corrected from dict[str, list[str] | str] to dict[str, list[dict]], accurately reflecting the output of df.to_dict(orient="records").
Preview bug fix in data_designer.py: processor discovery now uses disk-based list_processor_names() instead of iterating get_processor_configs(), which would crash for processors that don't write artifacts (e.g., DropColumnsProcessor).
Dual-layout support: new standalone helpers list_processor_names() and load_processor_dataset() in io_helpers.py handle both directory-based (batched) and single-file (preview) storage layouts, with dict-based deduplication to prevent duplicate names when both layouts coexist.
Simplification: DatasetCreationResults.load_processor_dataset() now delegates to ArtifactStorage, gaining single-file layout support and removing duplicated logic.

The changes are well-tested (1269 engine tests pass, new parametrized and unit tests added), cleanly integrated into the existing ArtifactStorage/io_helpers layering, and fix real bugs that prevented preview generation for certain processor configurations.

Confidence Score: 5/5

Safe to merge — fixes real bugs with comprehensive test coverage and clean implementation.
All three stated bugs are fixed correctly with good test coverage. The type annotation matches the actual data produced, the preview discovery uses disk-based detection (fixing the crash), and the new helpers properly support both storage layouts with deduplication. The code is clean and well-integrated into existing patterns. No functional issues identified.
No files require special attention

_{Last reviewed commit: f74a8e5}

greptile-apps

_{6 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py

…ading Extract list_processor_names() and load_processor_dataset() as module-level functions that accept a Path, so consumers can use them without constructing an ArtifactStorage. The ArtifactStorage methods now delegate to these.

Move list_processor_names() and load_processor_dataset() to data_designer.config.utils.io_helpers so they're usable by any consumer that depends on the lightweight config package, without needing the engine's ArtifactStorage. The standalone functions raise FileNotFoundError; the ArtifactStorage methods wrap this as ArtifactStorageError.

packages/data-designer-config/src/data_designer/config/utils/io_helpers.py

packages/data-designer/src/data_designer/interface/data_designer.py

Reuse list_processor_names so both methods discover directories and single parquet files consistently.

andreatgretel · 2026-03-03T17:49:37Z

packages/data-designer-config/src/data_designer/config/utils/io_helpers.py

        )


+def list_processor_names(processors_outputs_path: Path) -> list[str]:


These processor helpers are standalone (not on ArtifactStorage) because the service layer uses them directly without an ArtifactStorage instance.

mikeknep

One question, otherwise LGTM, thanks!

packages/data-designer-config/src/data_designer/config/preview_results.py

andreatgretel requested a review from a team as a code owner March 3, 2026 16:27

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py Outdated Show resolved Hide resolved

andreatgretel added 3 commits March 3, 2026 13:35

fix: deduplicate list_processor_names when both dir and file exist

531c761

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

packages/data-designer-config/src/data_designer/config/utils/io_helpers.py Show resolved Hide resolved

chore: trim docstrings and deduplicate tests

0a740f3

greptile-apps bot reviewed Mar 3, 2026

View reviewed changes

packages/data-designer/src/data_designer/interface/data_designer.py Show resolved Hide resolved

fix: align get_processor_file_paths with list_processor_names

f74a8e5

Reuse list_processor_names so both methods discover directories and single parquet files consistently.

andreatgretel commented Mar 3, 2026

View reviewed changes

mikeknep approved these changes Mar 3, 2026

View reviewed changes

packages/data-designer-config/src/data_designer/config/preview_results.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: processor artifacts type, discovery, and loading#366

fix: processor artifacts type, discovery, and loading#366
andreatgretel wants to merge 6 commits intomainfrom
andreatgretel/fix/processor-artifacts-cleanup

andreatgretel commented Mar 3, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 3, 2026 •

edited

Loading

Confidence Score: 5/5

Uh oh!

greptile-apps bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel Mar 3, 2026

Uh oh!

mikeknep left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		)


		def list_processor_names(processors_outputs_path: Path) -> list[str]:

Conversation

andreatgretel commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Fixed

Changed

Test plan

Uh oh!

greptile-apps bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mikeknep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andreatgretel commented Mar 3, 2026 •

edited

Loading

greptile-apps bot commented Mar 3, 2026 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading