Skip to content

feat(compaction): support multi-level lookup in LSM tree#179

Merged
lxy-9602 merged 9 commits intoalibaba:mainfrom
lszskye:create_lookup_levels
Mar 16, 2026
Merged

feat(compaction): support multi-level lookup in LSM tree#179
lxy-9602 merged 9 commits intoalibaba:mainfrom
lszskye:create_lookup_levels

Conversation

@lszskye
Copy link
Collaborator

@lszskye lszskye commented Mar 13, 2026

Purpose

This PR introduces a comprehensive lookup mechanism for the MergeTree data structure, enabling efficient key-based lookups across multiple levels of data files. The implementation provides various value processing strategies.

Linked issue: #93

Tests

core/mergetree/levels_test.cpp
core/mergetree/lookup_file_test.cpp
core/mergetree/lookup_levels_test.cpp

API and Format

In class FileBatchReader:
virtual uint64_t GetPreviousBatchFirstRowNumber() const = 0;
=> virtual Result<uint64_t> GetPreviousBatchFirstRowNumber() const = 0;

@lszskye lszskye force-pushed the create_lookup_levels branch from 2926a75 to 28f8fde Compare March 14, 2026 04:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a multi-level key lookup facility for MergeTree/LSM levels (including Level-0 handling), introduces supporting data structures (Levels, SortedRun utilities, local lookup-file caching), and updates the file-reading API to make GetPreviousBatchFirstRowNumber() fallible (Result<uint64_t>), with cascading updates across formats/readers and tests.

Changes:

  • Introduce MergeTree level modeling + lookup implementation (Levels, LookupLevels, LookupFile, LookupUtils) with new unit tests.
  • Update FileBatchReader::GetPreviousBatchFirstRowNumber() to return Result<uint64_t> and propagate the change through format readers, wrappers, and tests.
  • Extend KeyValue data-file reading to expose file positions (NextWithFilePos) and add small utility improvements (IOManager, EqualsIgnoreNullable, MoveVector, operator!=).

Reviewed changes

Copilot reviewed 72 out of 72 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/paimon/testing/mock/mock_key_value_data_file_record_reader.h Adjust mock record reader to accept FileBatchReader.
src/paimon/testing/mock/mock_file_batch_reader.h Update mock to return Result<uint64_t> for previous batch row.
src/paimon/format/parquet/parquet_file_batch_reader_test.cpp Update tests for Result<uint64_t> return type.
src/paimon/format/parquet/parquet_file_batch_reader.h Update override signature to Result<uint64_t>.
src/paimon/format/parquet/file_reader_wrapper_test.cpp Update wrapper tests for Result<uint64_t>.
src/paimon/format/parquet/file_reader_wrapper.h Update wrapper accessor to Result<uint64_t>.
src/paimon/format/orc/orc_file_batch_reader_test.cpp Update ORC tests for Result<uint64_t>.
src/paimon/format/orc/orc_file_batch_reader.h Update override signature to Result<uint64_t>.
src/paimon/format/lance/lance_format_reader_writer_test.cpp Add coverage for previous-batch row tracking + bitmap pushdown invalid case.
src/paimon/format/lance/lance_file_batch_reader.h Implement GetPreviousBatchFirstRowNumber() with bitmap-pushdown restriction.
src/paimon/format/lance/lance_file_batch_reader.cpp Track previous/last batch row numbers and reset on schema change.
src/paimon/format/blob/blob_file_batch_reader_test.cpp Update tests for Result<uint64_t> and adjust bitmap case coverage.
src/paimon/format/blob/blob_file_batch_reader.h Return invalid status when previous-batch row number is undefined under bitmap pushdown.
src/paimon/format/avro/avro_file_batch_reader_test.cpp Update tests for Result<uint64_t>.
src/paimon/format/avro/avro_file_batch_reader.h Update override signature to Result<uint64_t>.
src/paimon/core/operation/raw_file_split_read.h Adjust DV/index application helper to return FileBatchReader.
src/paimon/core/operation/raw_file_split_read.cpp Convert vectors of FileBatchReader to BatchReader via ObjectUtils::MoveVector.
src/paimon/core/operation/merge_file_split_read.h Adjust DV/index application helper to return FileBatchReader.
src/paimon/core/operation/merge_file_split_read.cpp Push down bitmap DV for BitmapDeletionVector where supported; use MoveVector.
src/paimon/core/operation/data_evolution_split_read.h Adjust DV/index application helper to return FileBatchReader.
src/paimon/core/operation/data_evolution_split_read.cpp Use MoveVector to bridge FileBatchReader vectors into ConcatBatchReader.
src/paimon/core/operation/abstract_split_read.h Make raw file reader creation return FileBatchReader and expose it for lookup use.
src/paimon/core/operation/abstract_split_read.cpp Propagate FileBatchReader through field-mapping reader creation.
src/paimon/core/mergetree/sorted_run_test.cpp Add tests for sorting/validating runs and equality.
src/paimon/core/mergetree/sorted_run.h Add Empty/FromUnsorted helpers and equality operator.
src/paimon/core/mergetree/lookup_utils.h New: shared lookup helpers for Level-0 and sorted levels.
src/paimon/core/mergetree/lookup_levels_test.cpp New: multi-level lookup tests (Level-0 and multi-file scenarios).
src/paimon/core/mergetree/lookup_levels.h New: lookup-by-key facade across Levels with caching and processor management.
src/paimon/core/mergetree/lookup_levels.cpp New: implementation that builds local lookup files from data files and looks up keys.
src/paimon/core/mergetree/lookup_file_test.cpp New: tests for LookupFile behavior and local file prefix generation.
src/paimon/core/mergetree/lookup_file.h New: local lookup-file wrapper with auto-close and delete-on-close.
src/paimon/core/mergetree/lookup/default_lookup_serializer_factory.h Use ArrowUtils::EqualsIgnoreNullable and improve schema mismatch error message.
src/paimon/core/mergetree/levels_test.cpp New: tests for Levels construction, update, and ordering.
src/paimon/core/mergetree/levels.h New: Levels container for Level-0 set + per-level sorted runs.
src/paimon/core/mergetree/levels.cpp New: Levels implementation (create/update/query).
src/paimon/core/mergetree/level_sorted_run.h Add equality operator for test comparisons.
src/paimon/core/mergetree/compact/merge_tree_compact_rewriter_test.cpp Adjust rewriter creation to new signature (no executor arg).
src/paimon/core/mergetree/compact/merge_tree_compact_rewriter.h Remove executor from factory signature.
src/paimon/core/mergetree/compact/merge_tree_compact_rewriter.cpp Use default executor internally and simplify ReadContext builder.
src/paimon/core/key_value.h Add default constructor for KeyValue.
src/paimon/core/io/key_value_data_file_record_reader_test.cpp Add tests for selected-bitmap + file-position iteration.
src/paimon/core/io/key_value_data_file_record_reader.h Switch to FileBatchReader, add iterator file-position support.
src/paimon/core/io/key_value_data_file_record_reader.cpp Implement NextWithFilePos() using previous-batch row offset.
src/paimon/core/io/field_mapping_reader.h Make FieldMappingReader a FileBatchReader and forward required APIs.
src/paimon/core/io/field_mapping_reader.cpp Adjust ctor signature to accept FileBatchReader.
src/paimon/core/io/data_file_meta_test.cpp Add equality/inequality assertions.
src/paimon/core/io/data_file_meta.h Add operator!=.
src/paimon/core/io/data_file_meta.cpp Implement operator!=.
src/paimon/core/io/complete_row_tracking_fields_reader.h Update override signature to Result<uint64_t>.
src/paimon/core/io/complete_row_tracking_fields_reader.cpp Handle Result<uint64_t> for previous batch row.
src/paimon/core/disk/io_manager.cpp New: IOManager implementation for temp file path generation.
src/paimon/core/deletionvectors/apply_deletion_vector_batch_reader.h Make wrapper a FileBatchReader and forward required APIs.
src/paimon/common/utils/object_utils_test.cpp Add tests for MoveVector.
src/paimon/common/utils/object_utils.h Add MoveVector helper for moving vectors of convertible pointer types.
src/paimon/common/utils/binary_row_partition_computer_test.cpp Add tests for PartToSimpleString.
src/paimon/common/utils/binary_row_partition_computer.h Add PartToSimpleString API.
src/paimon/common/utils/binary_row_partition_computer.cpp Implement PartToSimpleString.
src/paimon/common/utils/arrow/arrow_utils_test.cpp Add tests for EqualsIgnoreNullable.
src/paimon/common/utils/arrow/arrow_utils.h Add EqualsIgnoreNullable declaration.
src/paimon/common/utils/arrow/arrow_utils.cpp Implement EqualsIgnoreNullable.
src/paimon/common/sst/sst_file_reader.cpp Silence unused variable with [[maybe_unused]].
src/paimon/common/sst/block_reader.h Make index shared_ptr param const-ref.
src/paimon/common/reader/prefetch_file_batch_reader_impl_test.cpp Update tests for Result<uint64_t>.
src/paimon/common/reader/prefetch_file_batch_reader_impl.h Update signature to Result<uint64_t>.
src/paimon/common/reader/prefetch_file_batch_reader_impl.cpp Propagate Result<uint64_t> in internal handling.
src/paimon/common/reader/delegating_prefetch_reader.h Update override signature to Result<uint64_t>.
src/paimon/common/file_index/bitmap/apply_bitmap_index_batch_reader.h Convert wrapper to FileBatchReader and propagate previous-row result.
src/paimon/common/data/timestamp_test.cpp Use ASSERT_NE for inequality.
src/paimon/CMakeLists.txt Register new core sources and tests.
include/paimon/reader/file_batch_reader.h API change: GetPreviousBatchFirstRowNumber() returns Result<uint64_t>.
include/paimon/disk/io_manager.h New public IOManager interface.
include/paimon/data/timestamp.h Add operator!=.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@lszskye lszskye force-pushed the create_lookup_levels branch from ae0b687 to 073bcc1 Compare March 16, 2026 03:08
@lszskye lszskye force-pushed the create_lookup_levels branch from 478e82d to b9800ae Compare March 16, 2026 05:43
Copy link
Collaborator

@lucasfang lucasfang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@lxy-9602 lxy-9602 merged commit 8924152 into alibaba:main Mar 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants