Skip to content

fix: processor artifacts type, discovery, and loading#366

Open
andreatgretel wants to merge 6 commits intomainfrom
andreatgretel/fix/processor-artifacts-cleanup
Open

fix: processor artifacts type, discovery, and loading#366
andreatgretel wants to merge 6 commits intomainfrom
andreatgretel/fix/processor-artifacts-cleanup

Conversation

@andreatgretel
Copy link
Contributor

@andreatgretel andreatgretel commented Mar 3, 2026

Summary

  • Fix PreviewResults.processor_artifacts type annotation from dict[str, list[str] | str] to dict[str, list[dict]], matching the actual output of df.to_dict(orient="records")
  • Add list_processor_names() and load_processor_dataset() as standalone functions in data_designer.config.utils.io_helpers - they take a Path and handle both directory-based (batched) and single-file (preview) storage layouts
  • Fix preview bug: discovery now uses disk contents instead of iterating get_processor_configs(), which would crash for processors that don't write artifacts (e.g. DropColumnsProcessor)
  • Simplify DatasetCreationResults.load_processor_dataset() to delegate to ArtifactStorage
  • Align get_processor_file_paths() with list_processor_names() so both methods discover directories and single parquet files consistently

Changes

Fixed

  • PreviewResults.processor_artifacts type annotation (dict[str, list[dict]])
  • Preview processor discovery no longer crashes on processors without artifacts
  • get_processor_file_paths() now discovers single-file processors, not just directories

Changed

  • io_helpers.py - New list_processor_names() and load_processor_dataset() standalone functions
  • artifact_storage.py - Wrapper methods that convert FileNotFoundError to ArtifactStorageError
  • data_designer.py - Preview uses list_processor_names() instead of get_processor_configs()
  • results.py - Delegates to ArtifactStorage.load_processor_dataset()

Test plan

  • Unit tests for list_processor_names (empty, directory, single file, mixed, dedup)
  • Parametrized test for load_processor_dataset (directory and single file layouts)
  • Error type tests (standalone raises FileNotFoundError, ArtifactStorage wraps as ArtifactStorageError)
  • Test for get_processor_file_paths with single-file processors
  • All engine tests pass
  • Pre-commit hooks pass

Description updated with AI

- Fix PreviewResults.processor_artifacts type from dict[str, list[str] | str]
  to dict[str, list[dict]], matching the actual data produced by
  df.to_dict(orient="records")
- Add ArtifactStorage.load_processor_dataset() to centralize loading logic,
  handling both directory-based (batched) and single-file (preview) layouts
- Add ArtifactStorage.list_processor_names() to discover processor outputs
  from disk rather than iterating config, fixing a bug where processors
  that don't write artifacts (e.g. DropColumnsProcessor) would crash the
  library's preview
- Simplify DatasetCreationResults.load_processor_dataset() to delegate to
  ArtifactStorage
@andreatgretel andreatgretel requested a review from a team as a code owner March 3, 2026 16:27
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR fixes three critical bugs around processor artifact discovery, type annotation, and loading:

  1. Type annotation fix in PreviewResults.processor_artifacts: corrected from dict[str, list[str] | str] to dict[str, list[dict]], accurately reflecting the output of df.to_dict(orient="records").

  2. Preview bug fix in data_designer.py: processor discovery now uses disk-based list_processor_names() instead of iterating get_processor_configs(), which would crash for processors that don't write artifacts (e.g., DropColumnsProcessor).

  3. Dual-layout support: new standalone helpers list_processor_names() and load_processor_dataset() in io_helpers.py handle both directory-based (batched) and single-file (preview) storage layouts, with dict-based deduplication to prevent duplicate names when both layouts coexist.

  4. Simplification: DatasetCreationResults.load_processor_dataset() now delegates to ArtifactStorage, gaining single-file layout support and removing duplicated logic.

The changes are well-tested (1269 engine tests pass, new parametrized and unit tests added), cleanly integrated into the existing ArtifactStorage/io_helpers layering, and fix real bugs that prevented preview generation for certain processor configurations.

Confidence Score: 5/5

  • Safe to merge — fixes real bugs with comprehensive test coverage and clean implementation.
  • All three stated bugs are fixed correctly with good test coverage. The type annotation matches the actual data produced, the preview discovery uses disk-based detection (fixing the crash), and the new helpers properly support both storage layouts with deduplication. The code is clean and well-integrated into existing patterns. No functional issues identified.
  • No files require special attention

Last reviewed commit: f74a8e5

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

…ading

Extract list_processor_names() and load_processor_dataset() as module-level
functions that accept a Path, so consumers can use them without constructing
an ArtifactStorage. The ArtifactStorage methods now delegate to these.
Move list_processor_names() and load_processor_dataset() to
data_designer.config.utils.io_helpers so they're usable by any consumer
that depends on the lightweight config package, without needing the
engine's ArtifactStorage. The standalone functions raise FileNotFoundError;
the ArtifactStorage methods wrap this as ArtifactStorageError.
Reuse list_processor_names so both methods discover directories and
single parquet files consistently.
)


def list_processor_names(processors_outputs_path: Path) -> list[str]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These processor helpers are standalone (not on ArtifactStorage) because the service layer uses them directly without an ArtifactStorage instance.

Copy link
Contributor

@mikeknep mikeknep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, otherwise LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants