fix: processor artifacts type, discovery, and loading#366
fix: processor artifacts type, discovery, and loading#366andreatgretel wants to merge 6 commits intomainfrom
Conversation
- Fix PreviewResults.processor_artifacts type from dict[str, list[str] | str] to dict[str, list[dict]], matching the actual data produced by df.to_dict(orient="records") - Add ArtifactStorage.load_processor_dataset() to centralize loading logic, handling both directory-based (batched) and single-file (preview) layouts - Add ArtifactStorage.list_processor_names() to discover processor outputs from disk rather than iterating config, fixing a bug where processors that don't write artifacts (e.g. DropColumnsProcessor) would crash the library's preview - Simplify DatasetCreationResults.load_processor_dataset() to delegate to ArtifactStorage
Greptile SummaryThis PR fixes three critical bugs around processor artifact discovery, type annotation, and loading:
The changes are well-tested (1269 engine tests pass, new parametrized and unit tests added), cleanly integrated into the existing
|
packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py
Outdated
Show resolved
Hide resolved
…ading Extract list_processor_names() and load_processor_dataset() as module-level functions that accept a Path, so consumers can use them without constructing an ArtifactStorage. The ArtifactStorage methods now delegate to these.
Move list_processor_names() and load_processor_dataset() to data_designer.config.utils.io_helpers so they're usable by any consumer that depends on the lightweight config package, without needing the engine's ArtifactStorage. The standalone functions raise FileNotFoundError; the ArtifactStorage methods wrap this as ArtifactStorageError.
packages/data-designer-config/src/data_designer/config/utils/io_helpers.py
Show resolved
Hide resolved
Reuse list_processor_names so both methods discover directories and single parquet files consistently.
| ) | ||
|
|
||
|
|
||
| def list_processor_names(processors_outputs_path: Path) -> list[str]: |
There was a problem hiding this comment.
These processor helpers are standalone (not on ArtifactStorage) because the service layer uses them directly without an ArtifactStorage instance.
mikeknep
left a comment
There was a problem hiding this comment.
One question, otherwise LGTM, thanks!
Summary
PreviewResults.processor_artifactstype annotation fromdict[str, list[str] | str]todict[str, list[dict]], matching the actual output ofdf.to_dict(orient="records")list_processor_names()andload_processor_dataset()as standalone functions indata_designer.config.utils.io_helpers- they take aPathand handle both directory-based (batched) and single-file (preview) storage layoutsget_processor_configs(), which would crash for processors that don't write artifacts (e.g.DropColumnsProcessor)DatasetCreationResults.load_processor_dataset()to delegate toArtifactStorageget_processor_file_paths()withlist_processor_names()so both methods discover directories and single parquet files consistentlyChanges
Fixed
PreviewResults.processor_artifactstype annotation (dict[str, list[dict]])get_processor_file_paths()now discovers single-file processors, not just directoriesChanged
io_helpers.py- Newlist_processor_names()andload_processor_dataset()standalone functionsartifact_storage.py- Wrapper methods that convertFileNotFoundErrortoArtifactStorageErrordata_designer.py- Preview useslist_processor_names()instead ofget_processor_configs()results.py- Delegates toArtifactStorage.load_processor_dataset()Test plan
list_processor_names(empty, directory, single file, mixed, dedup)load_processor_dataset(directory and single file layouts)FileNotFoundError,ArtifactStoragewraps asArtifactStorageError)get_processor_file_pathswith single-file processorsDescription updated with AI