feat(btree): refactor GlobalIndexScan & add inte test for btree global index#262
Merged
lxy-9602 merged 2 commits intoalibaba:mainfrom May 7, 2026
Merged
feat(btree): refactor GlobalIndexScan & add inte test for btree global index#262lxy-9602 merged 2 commits intoalibaba:mainfrom
lxy-9602 merged 2 commits intoalibaba:mainfrom
Conversation
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
lxy-9602
reviewed
May 6, 2026
There was a problem hiding this comment.
Pull request overview
This PR refactors the global-index scanning flow to build GlobalIndexReader instances directly (including local-to-global row id rewriting and unioning across ranges) and updates downstream scan code to use RowRangeIndex for row-range filtering. It also adds new integration test coverage and test datasets for btree global index behavior (including partitioned tables / multi-meta cases).
Changes:
- Introduces
OffsetGlobalIndexReaderandUnionGlobalIndexReader, and refactorsGlobalIndexScanImplto build per-field readers by index type and row-range shards. - Replaces various
std::vector<Range>row-range plumbing withRowRangeIndex, and updates file-store filtering accordingly. - Adds/updates integration tests and test datasets for btree global index (parquet/orc).
Reviewed changes
Copilot reviewed 60 out of 147 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
include/paimon/global_index/global_index_scan.h |
Replaces range-scan APIs with CreateReaders(...) APIs and adds RowRangeIndex support. |
src/paimon/core/global_index/global_index_scan_impl.{h,cpp} |
New implementation that builds reader sets using offset + union wrappers and supports evaluator-based predicate scanning. |
src/paimon/common/global_index/{offset,union}_global_index_reader.{h,cpp} |
Adds reader wrappers for global row-id rewriting and unioning results across shards/readers. |
src/paimon/core/table/source/data_evolution_batch_scan.{h,cpp} |
Switches from vector-search-aware evaluation to predicate-only + RowRangeIndex filtering and indexed split wrapping. |
src/paimon/core/operation/file_store_scan.{h,cpp} |
Moves manifest row-range pruning into a shared RowRangeIndex-based filter. |
include/paimon/scan_context.h + src/paimon/core/operation/scan_context.cpp |
Removes vector search from scan context/filter APIs. |
test/inte/global_index_test.cpp |
Updates tests for new reader/evaluator APIs and adds btree global index integration tests. |
test/test_data/**/append_with_btree_with_partition.db/** |
Adds new snapshot/schema/test-data fixtures for btree global index integration tests. |
src/paimon/common/global_index/CMakeLists.txt + src/paimon/CMakeLists.txt |
Registers new reader wrapper sources for builds/tests. |
Comments suppressed due to low confidence (1)
include/paimon/scan_context.h:156
- This PR removes the public ScanContextBuilder::SetVectorSearch() API and ScanFilter::GetVectorSearch(), which is a breaking change for vector-search consumers. The PR description focuses on btree/global index scan refactor, but doesn’t mention removing vector search support or the replacement API; please either keep backward compatibility (or provide an alternative) and update the public API docs / PR description accordingly.
std::shared_ptr<Executor> executor_;
std::shared_ptr<FileSystem> specific_file_system_;
std::map<std::string, std::string> options_;
};
/// Filter configuration for table scan operations
class PAIMON_EXPORT ScanFilter {
public:
ScanFilter(const std::shared_ptr<Predicate>& predicate,
const std::vector<std::map<std::string, std::string>>& partition_filters,
const std::optional<int32_t>& bucket_filter)
: predicates_(predicate),
bucket_filter_(bucket_filter),
partition_filters_(partition_filters) {}
std::shared_ptr<Predicate> GetPredicate() const {
return predicates_;
}
std::optional<int32_t> GetBucketFilter() const {
return bucket_filter_;
}
const std::vector<std::map<std::string, std::string>>& GetPartitionFilters() const {
return partition_filters_;
}
private:
std::shared_ptr<Predicate> predicates_;
std::optional<int32_t> bucket_filter_;
std::vector<std::map<std::string, std::string>> partition_filters_;
};
/// `ScanContextBuilder` used to build a `ScanContext`, has input validation.
class PAIMON_EXPORT ScanContextBuilder {
public:
/// Constructs a `ScanContextBuilder` with required parameters.
/// @param path The root path of the table.
explicit ScanContextBuilder(const std::string& path);
~ScanContextBuilder();
/// If limit is not set, it defaults to unlimited.
ScanContextBuilder& SetLimit(int32_t limit);
/// Set a bucket filter to scan only specific bucket.
ScanContextBuilder& SetBucketFilter(int32_t bucket_filter);
/// partition_filters in vector is supposed to be OR, filter in map is supposed to be AND, e.g.,
/// partition_filters is {{k1=1,k2=10}, {k1=2,k2=20}} => OR(AND(k1=1,k2=10), AND(k1=2,k2=20))
ScanContextBuilder& SetPartitionFilter(
const std::vector<std::map<std::string, std::string>>& partition_filters);
/// Set a predicate for filtering data.
ScanContextBuilder& SetPredicate(const std::shared_ptr<Predicate>& predicate);
/// Sets the result of a global index search (e.g., row ids (may with scores) from a distributed
/// index lookup). This is used to push down index-filtered row ids into the scan for efficient
/// data retrieval.
ScanContextBuilder& SetGlobalIndexResult(
const std::shared_ptr<GlobalIndexResult>& global_index_result);
/// The options added or set in `ScanContextBuilder` have high priority and will be merged with
/// the options in table schema.
ScanContextBuilder& AddOption(const std::string& key, const std::string& value);
/// Set a configuration options map to set some option entries which are not defined in the
/// table schema or whose values you want to overwrite.
/// @note The options map will clear the options added by `AddOption()` before.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: #38
Introduce:
OffsetGlobalIndexReader: wraps anyGlobalIndexReaderand rewritesthe row ids in its results by adding a fixed offset. This is how local
row ids become global row ids.
UnionGlobalIndexReader: wraps a list ofGlobalIndexReaders andunions their results via
GlobalIndexResult::Or. It can run sub-readersin parallel through an
Executor, or sequentially when no executor isgiven.
Add inte test for btree global index
Tests
GlobalIndexTest