[fix](hive) Fix Hive DATE timezone shift in external readers#61330
[fix](hive) Fix Hive DATE timezone shift in external readers#61330zhangstar333 merged 6 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 26989 ms |
TPC-DS: Total hot run time: 168095 ms |
|
run buildall |
TPC-H: Total hot run time: 26614 ms |
TPC-DS: Total hot run time: 167570 ms |
|
run buildall |
|
run buildall |
TPC-H: Total hot run time: 26929 ms |
TPC-DS: Total hot run time: 168705 ms |
|
/review |
There was a problem hiding this comment.
Code Review Summary
Goal
Fix Hive external table DATE columns being shifted by one day in west time zones (e.g., -06:00) when reading ORC/Parquet files. The root cause was that _offset_days was computed from VecDateTimeValue::from_unixtime(0, ctz) -- in west timezones this maps epoch 0 to Dec 31, 1969, yielding an offset of -1 that was then applied to every DATE value. Since ORC/Parquet DATE is a logical day count (days since epoch) with no timezone semantics, this offset was incorrect.
Critical Checkpoint Conclusions
1. Does the code accomplish the goal? Is there a test that proves it?
Yes. The fix correctly removes the timezone-based day offset from both ORC and Parquet DATE decode paths. Both the ORC spec and the Parquet spec define DATE as a logical day count without timezone semantics. BE unit tests and regression tests are added to prove the fix.
2. Is this modification as small, clear, and focused as possible?
Yes. The change is minimal and surgical: removes 6 lines of offset computation in vorc_reader.cpp (two constructors), removes the _offset_days member from vorc_reader.h, removes offset_days from ConvertParams in parquet_column_convert.h, and simplifies the usage sites. No unrelated changes.
3. Concurrency?
Not applicable. The removed _offset_days was a per-reader instance member computed in the constructor. No concurrency concerns.
4. Lifecycle management?
Not applicable. The removed member was a simple int32_t. No lifecycle issues.
5. Configuration items?
No new configuration items added. Not applicable.
6. Incompatible changes / rolling upgrades?
No incompatible changes. This is a bug fix in the read path -- the stored data (in ORC/Parquet files) is unchanged. No compatibility code needed.
7. Parallel code paths?
Thoroughly checked. The _offset_days/offset_days pattern has been completely removed from both ORC and Parquet readers. The ORC predicate pushdown path for DATE (vorc_reader.cpp ~line 633) correctly uses cctz::utc_time_zone() (not session timezone) and was never affected by _offset_days. No other file format readers (Iceberg, Paimon, Hudi) use the same offset pattern. All other DATE conversions use date_day_offset_dict which is a pure arithmetic lookup with no timezone logic.
8. Special conditional checks?
Not applicable. The fix removes conditional logic rather than adding it.
9. Test coverage?
- BE unit test for ORC:
date_should_not_shift_in_west_timezonetests reading withAmerica/Mexico_Citytimezone and verifies the date1900-01-01is preserved. Good. - BE unit test for Parquet:
date_should_not_shift_in_west_timezonetests reading with-06:00timezone, verifying2020-01-01and2020-01-06are present and2019-12-31is absent. Good. - Regression test:
test_hive_date_timezone.groovytests both ORC and Parquet tables with UTC andAmerica/Mexico_City, verifying identical results. Usesorder by idfor deterministic output. Uses pre-existingschema_changetest tables. Good. - The
.outfile matches the existingorc_date_to_date/parquet_date_to_dateexpected outputs from the schema change tests, confirming correctness. - Note: PR author states tests were not run locally. The
.outfile was likely constructed manually but the values are consistent with existing verified test data.
10. Observability?
Not applicable. This is a bug fix removing incorrect logic, not adding new paths.
11. Transaction/persistence?
Not applicable. This is a read-path-only change.
12. Data writes?
Not applicable.
13. FE-BE variable passing?
Not applicable. No new variables added.
14. Performance?
Slightly positive. The fix removes computation (VecDateTimeValue::from_unixtime(0, ctz) and t.day() == 31 check) from both ORC reader constructors, and removes an addition (+ offset_days) from the hot decode loop in Parquet's Int32ToDate. Marginal but net positive.
15. Any other issues?
No other issues found. The test improvements in orc_read_lines.cpp (replacing static_cast<void> with proper EXPECT_TRUE(st.ok()) << st for Status checks on set_fill_columns and get_next_block) are a welcome improvement to error reporting in tests.
Verdict
Clean, correct, and well-tested fix. No issues found.
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
TPC-H: Total hot run time: 26595 ms |
TPC-DS: Total hot run time: 169199 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
What problem does this PR solve?
Hive external tables currently apply session time zone day offsets when decoding ORC/Parquet DATE columns. In west time zones such as -06:00, this shifts DATE values by one day earlier, while Spark keeps the original logical date.
This PR removes the incorrect time zone day adjustment from Hive DATE decoding paths in ORC and Parquet readers. TIMESTAMP-related time zone handling is unchanged.
It also adds:
Local BE build / regression execution was not run on this machine because the current environment does not support BE compilation or running those tests; pipeline validation is expected to cover execution.
Issue Number: N/A
Related PR: N/A
Problem Summary:
Release note
Fix Hive external table DATE columns being shifted by one day in west time zones when reading ORC/Parquet files.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)