Skip to content

feat(flink): Backport Flink 2.1 Dremel nested Parquet reader rewrite to hudi-flink1.19.x (FLINK-35702)#18809

Open
skywalker0618 wants to merge 2 commits into
apache:masterfrom
skywalker0618:shihuanl/nested_parquet_schema_support_flink_1_19
Open

feat(flink): Backport Flink 2.1 Dremel nested Parquet reader rewrite to hudi-flink1.19.x (FLINK-35702)#18809
skywalker0618 wants to merge 2 commits into
apache:masterfrom
skywalker0618:shihuanl/nested_parquet_schema_support_flink_1_19

Conversation

@skywalker0618
Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

Five PRs (#18552, #18567, #18636, #18700, #18701) already landed the Flink 2.1 Dremel-style nested Parquet reader rewrite for hudi-flink1.18.x. This PR ports the same set of changes to hudi-flink1.19.x so both Flink-version modules share the same read path. Tracking JIRA: FLINK-35702.

Summary and Changelog

Almost the entire change is a verbatim copy of the corresponding files from hudi-flink1.18.x into hudi-flink1.19.x. No semantic adjustments were required because the affected classes only use stable Flink core APIs that are identical across 1.18 and 1.19.

Detailed file mapping (mirrors the five upstream PRs as a single squashed commit):

After the port, the only residual differences between hudi-flink1.18.x and hudi-flink1.19.x are the pre-existing Flink-version adapter shims (MaskingOutputAdapter, SupportsPreWriteTopologyAdapter, Utils, test CollectOutputAdapter, test MockTaskInfo), which are unrelated to this change and reflect legitimate Flink 1.18-vs-1.19 API differences.

Impact

No public API changes and no user-facing behavior changes for users of hudi-flink1.19.x. The internal Parquet read path now matches hudi-flink1.18.x, picking up the same correctness fix for small-precision decimals (previously a ClassCastException for INT32/INT64-encoded decimals) and the same nested-schema read support via Flink 2.1's Dremel-style column readers.

Risk Level

Low. The change is a verbatim port of code that has already landed and stabilized in hudi-flink1.18.x. Validation performed in this branch:

  • mvn -pl hudi-flink-datasource/hudi-flink1.19.x -am -Pflink1.19 test-compile succeeds.
  • All 36 unit tests in the ported test classes pass under the flink1.19 profile (TestParquetDecimalVector 12/12, TestHeapColumnVectorAccessors 4/4, TestParquetDataColumnReaderFactory 13/13, TestParquetGroupField 7/7).
  • Static checks: no remaining references to the deleted legacy readers anywhere in hudi-flink1.19.x/ or the shared hudi-flink/ module; no imports introduced that depend on Flink-1.18-only APIs.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR backports the Flink 2.1 Dremel-style nested Parquet reader rewrite (FLINK-35702) from hudi-flink1.18.x to hudi-flink1.19.x, keeping both modules on a shared read path. The diff is a verbatim port of the 1.18.x sources, including the Hudi-specific divergences from upstream Flink 2.1 (schema-evolution null ParquetField children in ROW types and the "all children null → null row" collapse in NestedColumnReader.readRow, both pinned by ITTestHoodieDataSource#testParquetNullChildColumnsRowTypes). No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small naming inconsistency in the new int64ToTimestamp helper; the rest of the code is clean and well-documented. one Javadoc param description swap in NestedPositionUtil worth fixing; the rest of the port is clean.

cc @yihua

return ChronoUnit.MICROS;
}
return null;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you rename utcTimestamp to isUtcTimestamp here? Every other use of this flag in the class — the field declaration, both constructor parameters, and the getDataColumnReaderByTypeHelper parameter — uses the is prefix, so dropping it in this private helper is a small inconsistency that may trip up a future reader.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed

* @param field field that contains array/map column message include max repetition level and
* definition level.
* @param definitionLevels int array with each value's repetition level.
* @param repetitionLevels int array with each value's definition level.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the descriptions for definitionLevels and repetitionLevels look swapped here — definitionLevels is described as "repetition level" and repetitionLevels as "definition level". Could you flip the descriptions so they match their param names?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.92%. Comparing base (facb517) to head (e476d4a).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18809   +/-   ##
=========================================
  Coverage     68.91%   68.92%           
+ Complexity    29089    29079   -10     
=========================================
  Files          2509     2509           
  Lines        139470   139470           
  Branches      17114    17117    +3     
=========================================
+ Hits          96115    96127   +12     
+ Misses        35601    35591   -10     
+ Partials       7754     7752    -2     
Flag Coverage Δ
common-and-other-modules 44.42% <ø> (+<0.01%) ⬆️
hadoop-mr-java-client 44.91% <ø> (+0.06%) ⬆️
spark-client-hadoop-common 48.23% <ø> (+<0.01%) ⬆️
spark-java-tests 49.37% <ø> (+0.01%) ⬆️
spark-scala-tests 45.27% <ø> (-0.01%) ⬇️
utilities 37.46% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! This PR backports the Flink 2.1 Dremel nested Parquet reader rewrite (FLINK-35702) from hudi-flink1.18.x to hudi-flink1.19.x, introducing the new ParquetField/ParquetGroupField/ParquetPrimitiveField data holders, accompanying unit tests, and removing the legacy RowColumnReader. The cow/ directory and tests are byte-identical to the 1.18.x source, and prior-round nits (isUtcTimestamp rename, NestedPositionUtil javadoc fix) are already carried over. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants