feat(flink): Backport Flink 2.1 Dremel nested Parquet reader rewrite to hudi-flink1.19.x (FLINK-35702)#18809
Conversation
…to hudi-flink1.19.x (FLINK-35702)
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR backports the Flink 2.1 Dremel-style nested Parquet reader rewrite (FLINK-35702) from hudi-flink1.18.x to hudi-flink1.19.x, keeping both modules on a shared read path. The diff is a verbatim port of the 1.18.x sources, including the Hudi-specific divergences from upstream Flink 2.1 (schema-evolution null ParquetField children in ROW types and the "all children null → null row" collapse in NestedColumnReader.readRow, both pinned by ITTestHoodieDataSource#testParquetNullChildColumnsRowTypes). No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small naming inconsistency in the new int64ToTimestamp helper; the rest of the code is clean and well-documented. one Javadoc param description swap in NestedPositionUtil worth fixing; the rest of the port is clean.
cc @yihua
| return ChronoUnit.MICROS; | ||
| } | ||
| return null; | ||
| } |
There was a problem hiding this comment.
🤖 nit: could you rename utcTimestamp to isUtcTimestamp here? Every other use of this flag in the class — the field declaration, both constructor parameters, and the getDataColumnReaderByTypeHelper parameter — uses the is prefix, so dropping it in this private helper is a small inconsistency that may trip up a future reader.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| * @param field field that contains array/map column message include max repetition level and | ||
| * definition level. | ||
| * @param definitionLevels int array with each value's repetition level. | ||
| * @param repetitionLevels int array with each value's definition level. |
There was a problem hiding this comment.
🤖 nit: the descriptions for definitionLevels and repetitionLevels look swapped here — definitionLevels is described as "repetition level" and repetitionLevels as "definition level". Could you flip the descriptions so they match their param names?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18809 +/- ##
=========================================
Coverage 68.91% 68.92%
+ Complexity 29089 29079 -10
=========================================
Files 2509 2509
Lines 139470 139470
Branches 17114 17117 +3
=========================================
+ Hits 96115 96127 +12
+ Misses 35601 35591 -10
+ Partials 7754 7752 -2
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! This PR backports the Flink 2.1 Dremel nested Parquet reader rewrite (FLINK-35702) from hudi-flink1.18.x to hudi-flink1.19.x, introducing the new ParquetField/ParquetGroupField/ParquetPrimitiveField data holders, accompanying unit tests, and removing the legacy RowColumnReader. The cow/ directory and tests are byte-identical to the 1.18.x source, and prior-round nits (isUtcTimestamp rename, NestedPositionUtil javadoc fix) are already carried over. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Describe the issue this Pull Request addresses
Five PRs (#18552, #18567, #18636, #18700, #18701) already landed the Flink 2.1 Dremel-style nested Parquet reader rewrite for
hudi-flink1.18.x. This PR ports the same set of changes tohudi-flink1.19.xso both Flink-version modules share the same read path. Tracking JIRA: FLINK-35702.Summary and Changelog
Almost the entire change is a verbatim copy of the corresponding files from
hudi-flink1.18.xintohudi-flink1.19.x. No semantic adjustments were required because the affected classes only use stable Flink core APIs that are identical across 1.18 and 1.19.Detailed file mapping (mirrors the five upstream PRs as a single squashed commit):
hudi-flink1.18.x(Dremel reader support code introduced by feat(flink): Vendor Flink 2.1 Dremel nested-reader support classes #18567 / feat(flink): Backport Flink 2.1 nested Parquet column readers and INT64 timestamp dispatch (FLINK-35702) #18636 / feat(flink): Wire Flink 2.1 nested Parquet readers into the Hudi read path (FLINK-35702) #18700):cow/utils/{BooleanArrayList,IntArrayList,LongArrayList,NestedPositionUtil}.javacow/vector/position/{CollectionPosition,LevelDelegation,RowPosition}.javacow/vector/type/{ParquetField,ParquetGroupField,ParquetPrimitiveField}.javacow/vector/reader/{NestedColumnReader,NestedPrimitiveColumnReader}.javahudi-flink1.18.xversion (decimal fix from fix: Parquet small-precision decimals decode ClassCastException #18552 plus the wire-up from feat(flink): Wire Flink 2.1 nested Parquet readers into the Hudi read path (FLINK-35702) #18700):cow/ParquetSplitReaderUtil.javacow/vector/{HeapArrayVector,HeapMapColumnVector,HeapRowColumnVector,ParquetDecimalVector}.javacow/vector/reader/{ParquetColumnarRowSplitReader,ParquetDataColumnReaderFactory}.javacow/vector/{ColumnarGroupArrayData,ColumnarGroupMapData,ColumnarGroupRowData,HeapArrayGroupColumnVector}.javacow/vector/reader/{ArrayColumnReader,ArrayGroupReader,MapColumnReader,RowColumnReader}.javahudi-flink1.18.x:TestParquetDecimalVector(12 tests)TestHeapColumnVectorAccessors(4 tests)TestParquetDataColumnReaderFactory(13 tests)TestParquetGroupField(7 tests)After the port, the only residual differences between
hudi-flink1.18.xandhudi-flink1.19.xare the pre-existing Flink-version adapter shims (MaskingOutputAdapter,SupportsPreWriteTopologyAdapter,Utils, testCollectOutputAdapter, testMockTaskInfo), which are unrelated to this change and reflect legitimate Flink 1.18-vs-1.19 API differences.Impact
No public API changes and no user-facing behavior changes for users of
hudi-flink1.19.x. The internal Parquet read path now matcheshudi-flink1.18.x, picking up the same correctness fix for small-precision decimals (previously aClassCastExceptionforINT32/INT64-encoded decimals) and the same nested-schema read support via Flink 2.1's Dremel-style column readers.Risk Level
Low. The change is a verbatim port of code that has already landed and stabilized in
hudi-flink1.18.x. Validation performed in this branch:mvn -pl hudi-flink-datasource/hudi-flink1.19.x -am -Pflink1.19 test-compilesucceeds.flink1.19profile (TestParquetDecimalVector 12/12, TestHeapColumnVectorAccessors 4/4, TestParquetDataColumnReaderFactory 13/13, TestParquetGroupField 7/7).hudi-flink1.19.x/or the sharedhudi-flink/module; no imports introduced that depend on Flink-1.18-only APIs.Documentation Update
none
Contributor's checklist