feat(asr): Support Parquet/Arrow datasets with embedded audio bytes via Lhotse#15303
Conversation
ec24400 to
cc0d47c
Compare
2caccc3 to
b92f5d1
Compare
pzelasko
left a comment
There was a problem hiding this comment.
Looks good to me. Can you add unit test coverage for this?
We need a pytest fixture that creates temporary test data and a check that it is loadable end to end.
See test_lhotse_dataloading.py file for examples with other formats.
BTW Rather than extend that file, can you create test_lhotse_dataloading_parquet.py for these tests?
Thanks.
| cuts = cutsets[0] | ||
| else: | ||
| # Simple concatenation for now (or mux if weights provided, but let's keep it simple) | ||
| cuts = CutSet.mux(*cutsets) |
There was a problem hiding this comment.
Hmmmm if multiple files here are shards of one dataset, then it would be better to use LazyIteratorChain from Lhotse: see https://github.com/lhotse-speech/lhotse/blob/53b759fe9f7a7ac89ccd3b6e581d2a4723670cba/lhotse/lazy.py#L322
Make sure to handle shuffle and shard_seed option similarly to other registered data_type_parser to ensure the shards will be shuffled as expected across training jobs and multiple DP ranks - crucial for achieving a decent randomness.
6d907fb to
6bf5836
Compare
|
Hi @pzelasko, Addressed feedback!
Verified all tests pass locally: tests/collections/common/test_lhotse_dataloading_parquet.py::test_read_parquet_manifest PASSED [ 50%] |
pzelasko
left a comment
There was a problem hiding this comment.
Thanks, a few more comments
| if config.get("input_cfg") is not None: | ||
| cuts, is_tarred = read_dataset_config(config) | ||
| # --- NEW: Check for explicit 'type' (like 'parquet') in the config --- | ||
| elif config.get("type") in get_known_config_data_types(): |
There was a problem hiding this comment.
This branch is impossible to enter, remove this change.
| config = OmegaConf.create( | ||
| { | ||
| "manifest_filepath": [parquet_dataset, parquet_dataset], | ||
| "type": "parquet", |
There was a problem hiding this comment.
use input_cfg with one item list instead of defining a new key type in the top level config
|
Hi @pzelasko, Addressed the latest feedback!
|
35a7871 to
d21e500
Compare
|
Note on the failing reformat_with_isort_and_black check: The CI is currently failing with a I have already run |
|
Can you fix the failed formatting/copyright checks? Ignore black/isort for now, maybe it will resolve itself with the next commit. |
Signed-off-by: v4xsh <vanshdobhal11@gmail.com>
Signed-off-by: v4xsh <vanshdobhal11@gmail.com>
d21e500 to
b48eace
Compare
|
Hi @pzelasko, I've pushed an update that addresses the formatting and logic blockers:
The implementation for Parquet support is fully verified with the new suite of tests. Ready for your review! |
…ia Lhotse (NVIDIA-NeMo#15303) * fix(asr): align parquet loading with input_cfg and fix formatting Signed-off-by: v4xsh <vanshdobhal11@gmail.com> * feat(asr): add Lhotse Parquet dataloading support and resolve linting Signed-off-by: v4xsh <vanshdobhal11@gmail.com> --------- Signed-off-by: v4xsh <vanshdobhal11@gmail.com>

Description
This PR addresses Issue #15059 by introducing a
LazyParquetIteratoradapter for Lhotse-based data loading. This allows NeMo to stream ASR datasets directly from Parquet files (local or remote) where audio is stored as embedded bytes, eliminating the need to materialize millions of audio files to disk.Changes
nemo/collections/common/data/lhotse/nemo_adapters.py:
LazyParquetIterator. It usespyarrowto stream batches and yields LhotseCutobjects with in-memoryRecordingsources.pyarrowis imported conditionally. If the user does not have it installed, the module will load fine, but trying to initializeLazyParquetIteratorwill raise a clearImportErrorinstructing them to install it.nemo/collections/common/data/lhotse/cutset.py:
parquetdata type parser via@data_type_parser(["parquet"]).Usage
Users can now specify
type: parquetin their data config:Verification
Integration Test: Verified locally with a synthetic Parquet file containing valid WAV bytes.
Audio Integrity: Confirmed that LazyParquetIterator correctly yields cuts with source_type="memory" and that the audio data is decodable by Lhotse/SoundFile without corruption.
Registry: Verified that cutset.read_cutset_from_config correctly routes type="parquet" to the new adapter using a mock OmegaConf configuration.