Skip to content

feat(asr): Support Parquet/Arrow datasets with embedded audio bytes via Lhotse#15303

Merged
pzelasko merged 3 commits intoNVIDIA-NeMo:mainfrom
v4xsh:feature/parquet-lhotse-adaptor
Feb 3, 2026
Merged

feat(asr): Support Parquet/Arrow datasets with embedded audio bytes via Lhotse#15303
pzelasko merged 3 commits intoNVIDIA-NeMo:mainfrom
v4xsh:feature/parquet-lhotse-adaptor

Conversation

@v4xsh
Copy link
Copy Markdown
Contributor

@v4xsh v4xsh commented Jan 17, 2026

Description

This PR addresses Issue #15059 by introducing a LazyParquetIterator adapter for Lhotse-based data loading. This allows NeMo to stream ASR datasets directly from Parquet files (local or remote) where audio is stored as embedded bytes, eliminating the need to materialize millions of audio files to disk.

Changes

  1. nemo/collections/common/data/lhotse/nemo_adapters.py:

    • Added LazyParquetIterator. It uses pyarrow to stream batches and yields Lhotse Cut objects with in-memory Recording sources.
    • Safety: pyarrow is imported conditionally. If the user does not have it installed, the module will load fine, but trying to initialize LazyParquetIterator will raise a clear ImportError instructing them to install it.
  2. nemo/collections/common/data/lhotse/cutset.py:

    • Registered the parquet data type parser via @data_type_parser(["parquet"]).

Usage

Users can now specify type: parquet in their data config:

train_ds:
  manifest_filepath: "/path/to/dataset.parquet"
  type: "parquet"
  audio_field: "audio"  # Column containing bytes
  text_field: "text"
  batch_size: 16

Verification

  • Integration Test: Verified locally with a synthetic Parquet file containing valid WAV bytes.

  • Audio Integrity: Confirmed that LazyParquetIterator correctly yields cuts with source_type="memory" and that the audio data is decodable by Lhotse/SoundFile without corruption.

  • Registry: Verified that cutset.read_cutset_from_config correctly routes type="parquet" to the new adapter using a mock OmegaConf configuration.

@v4xsh v4xsh force-pushed the feature/parquet-lhotse-adaptor branch 2 times, most recently from ec24400 to cc0d47c Compare January 17, 2026 13:09
@github-actions github-actions Bot removed the ASR label Jan 17, 2026
@v4xsh v4xsh force-pushed the feature/parquet-lhotse-adaptor branch from 2caccc3 to b92f5d1 Compare January 17, 2026 13:14
Copy link
Copy Markdown
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Can you add unit test coverage for this?
We need a pytest fixture that creates temporary test data and a check that it is loadable end to end.
See test_lhotse_dataloading.py file for examples with other formats.
BTW Rather than extend that file, can you create test_lhotse_dataloading_parquet.py for these tests?
Thanks.

cuts = cutsets[0]
else:
# Simple concatenation for now (or mux if weights provided, but let's keep it simple)
cuts = CutSet.mux(*cutsets)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm if multiple files here are shards of one dataset, then it would be better to use LazyIteratorChain from Lhotse: see https://github.com/lhotse-speech/lhotse/blob/53b759fe9f7a7ac89ccd3b6e581d2a4723670cba/lhotse/lazy.py#L322

Make sure to handle shuffle and shard_seed option similarly to other registered data_type_parser to ensure the shards will be shuffled as expected across training jobs and multiple DP ranks - crucial for achieving a decent randomness.

@v4xsh v4xsh force-pushed the feature/parquet-lhotse-adaptor branch from 6d907fb to 6bf5836 Compare January 28, 2026 06:41
@v4xsh
Copy link
Copy Markdown
Contributor Author

v4xsh commented Jan 28, 2026

Hi @pzelasko,

Addressed feedback!

  1. Sharding Logic: Switched to LazyIteratorChain in read_parquet_manifest to properly handle sharding and shuffling for distributed training, as suggested.
  2. Unit Tests: Added tests/collections/common/test_lhotse_dataloading_parquet.py with full coverage for loading and sharding.
  3. Bug Fix: I also updated read_cutset_from_config in cutset.py to correctly respect config.type. Previously, it was defaulting to the NeMo manifest parser even when type='parquet' was set, which caused the new tests to fail.

Verified all tests pass locally:

tests/collections/common/test_lhotse_dataloading_parquet.py::test_read_parquet_manifest PASSED [ 50%]
tests/collections/common/test_lhotse_dataloading_parquet.py::test_read_parquet_sharding PASSED [100%]
image

Copy link
Copy Markdown
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a few more comments

if config.get("input_cfg") is not None:
cuts, is_tarred = read_dataset_config(config)
# --- NEW: Check for explicit 'type' (like 'parquet') in the config ---
elif config.get("type") in get_known_config_data_types():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is impossible to enter, remove this change.

config = OmegaConf.create(
{
"manifest_filepath": [parquet_dataset, parquet_dataset],
"type": "parquet",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use input_cfg with one item list instead of defining a new key type in the top level config

@v4xsh
Copy link
Copy Markdown
Contributor Author

v4xsh commented Jan 28, 2026

Hi @pzelasko,

Addressed the latest feedback!

  1. Registry & Routing: Reverted the manual routing logic in read_cutset_from_config. Parquet datasets now correctly route through the standard read_dataset_config flow using the input_cfg pattern.
  2. Unit Tests: Refactored tests/collections/common/test_lhotse_dataloading_parquet.py to use the input_cfg list format instead of top-level keys.
  3. Sharding Logic: Switched to LazyIteratorChain in read_parquet_manifest to properly handle shard-level shuffling and distributed ranks.

@v4xsh v4xsh force-pushed the feature/parquet-lhotse-adaptor branch from 35a7871 to d21e500 Compare January 28, 2026 17:16
@v4xsh
Copy link
Copy Markdown
Contributor Author

v4xsh commented Jan 28, 2026

Note on the failing reformat_with_isort_and_black check:

The CI is currently failing with a fatal: bad object / Unable to locate commit sha error. This appears to be a GitHub Actions infrastructure issue related to the shallow clone depth and the recently rewritten history (force push) on this branch.

I have already run isort and black locally on all modified files, so they are fully compliant. You may need to manually trigger a re-sync or a deeper fetch of the branch history to clear this infra error.

@v4xsh v4xsh requested a review from pzelasko January 30, 2026 15:04
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 30, 2026
@pzelasko
Copy link
Copy Markdown
Collaborator

pzelasko commented Feb 2, 2026

Can you fix the failed formatting/copyright checks? Ignore black/isort for now, maybe it will resolve itself with the next commit.

Comment thread nemo/collections/common/data/lhotse/nemo_adapters.py Fixed
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Feb 2, 2026
v4xsh added 2 commits February 2, 2026 22:01
Signed-off-by: v4xsh <vanshdobhal11@gmail.com>
Signed-off-by: v4xsh <vanshdobhal11@gmail.com>
@v4xsh v4xsh force-pushed the feature/parquet-lhotse-adaptor branch from d21e500 to b48eace Compare February 2, 2026 17:07
@v4xsh
Copy link
Copy Markdown
Contributor Author

v4xsh commented Feb 2, 2026

Hi @pzelasko,

I've pushed an update that addresses the formatting and logic blockers:

  • Fixed Redefined Function: Resolved the E0102 error in cutset.py where _resolve_shar_inputs was duplicated.
  • Linting Improvement: Fixed several line-length violations and broad exceptions. The cutset.py score is now 9.11/10 and nemo_adapters.py is at 8.89/10.
  • Formatting: I've run black and isort locally on the modified files to ensure the formatting check passes on this commit.
  • Copyrights: I’ve kept the 2025 year for consistency with the existing codebase as requested.

The implementation for Parquet support is fully verified with the new suite of tests. Ready for your review!


@pzelasko pzelasko merged commit 636aef4 into NVIDIA-NeMo:main Feb 3, 2026
190 checks passed
nune-tadevosyan pushed a commit to nune-tadevosyan/NeMo that referenced this pull request Mar 13, 2026
…ia Lhotse (NVIDIA-NeMo#15303)

* fix(asr): align parquet loading with input_cfg and fix formatting

Signed-off-by: v4xsh <vanshdobhal11@gmail.com>

* feat(asr): add Lhotse Parquet dataloading support and resolve linting

Signed-off-by: v4xsh <vanshdobhal11@gmail.com>

---------

Signed-off-by: v4xsh <vanshdobhal11@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants