feat(blob): Support blob-descriptor-field for inline blob descriptor storage#300
Open
lxy-9602 wants to merge 12 commits into
Open
feat(blob): Support blob-descriptor-field for inline blob descriptor storage#300lxy-9602 wants to merge 12 commits into
lxy-9602 wants to merge 12 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds “blob-descriptor-field” support so selected BLOB columns can be stored inline as serialized BlobDescriptor bytes in the main data files, optionally writing the underlying blob payloads to an external storage path. It also aligns write-path descriptor detection with Java semantics (auto-detect via magic header) and adds LARGE_BINARY support across ORC/Parquet/Avro.
Changes:
- Introduces external-storage descriptor writing (
ExternalStorageBlobWriter) and updates append-only write flow to validate inline descriptor fields before writing. - Refactors blob writing to auto-detect descriptor vs raw bytes at write time;
blob-as-descriptornow controls read behavior only. - Adds
arrow::large_binary()support in ORC adapter, Parquet/ORC/Avro readers, Avro encoding/schema/stats, plus related tests.
Reviewed changes
Copilot reviewed 48 out of 48 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/inte/blob_table_inte_test.cpp | Expands integration tests for descriptor-inline + external storage behaviors and metadata expectations. |
| src/paimon/testing/utils/test_helper.h | Updates blob schema/array separation calls to new inline-aware APIs. |
| src/paimon/format/parquet/parquet_file_batch_reader_test.cpp | Adds coverage for reading BINARY/LARGE_BINARY written data as binary. |
| src/paimon/format/orc/orc_file_batch_reader_test.cpp | Adds coverage for reading BINARY/LARGE_BINARY written data as binary. |
| src/paimon/format/orc/orc_adapter.cpp | Adds ORC write/type mapping support for LARGE_BINARY. |
| src/paimon/format/orc/orc_adapter_test.cpp | Extends ORC adapter tests for large_binary support. |
| src/paimon/format/blob/blob_writer_builder.h | Adds WriteConsumer injection support; removes config-driven descriptor write behavior. |
| src/paimon/format/blob/blob_writer_builder_test.cpp | Tests WriteConsumer wiring from builder into writer. |
| src/paimon/format/blob/blob_format_writer.h | Adds WriteConsumer; removes blob_as_descriptor from writer API. |
| src/paimon/format/blob/blob_format_writer.cpp | Implements descriptor auto-detection and consumer callback descriptor emission. |
| src/paimon/format/blob/blob_format_writer_test.cpp | Updates tests for new API and validates consumer-received descriptors. |
| src/paimon/format/blob/blob_file_batch_reader_test.cpp | Updates writer creation for new BlobFormatWriter API. |
| src/paimon/format/avro/avro_stats_extractor.cpp | Treats LARGE_BINARY like BINARY in stats extraction. |
| src/paimon/format/avro/avro_schema_converter.cpp | Maps LARGE_BINARY to Avro bytes schema. |
| src/paimon/format/avro/avro_file_batch_reader_test.cpp | Adds BINARY/LARGE_BINARY cross-write/read test coverage. |
| src/paimon/format/avro/avro_direct_encoder.cpp | Encodes Avro BYTES from both binary and large_binary arrays. |
| src/paimon/core/utils/field_mapping.h | Extends field mapping creation with blob-inline field awareness. |
| src/paimon/core/utils/field_mapping.cpp | Converts configured inline blob fields to binary for file-schema mapping. |
| src/paimon/core/utils/field_mapping_test.cpp | Adds tests for inline blob field conversion behavior. |
| src/paimon/core/schema/schema_validation.cpp | Simplifies blob option presence check for validation entry. |
| src/paimon/core/operation/file_store_scan.cpp | Passes inline blob fields into field mapping creation. |
| src/paimon/core/operation/append_only_file_store_write.h | Removes blob-driven compaction disable flag from state. |
| src/paimon/core/operation/append_only_file_store_write.cpp | Removes blob-driven compaction disable path (see review comment). |
| src/paimon/core/operation/abstract_split_read.cpp | Builds field mapping with inline blob fields derived from per-file schema options. |
| src/paimon/core/io/rolling_blob_file_writer.h | Adds inline-field awareness to rolling blob writer. |
| src/paimon/core/io/rolling_blob_file_writer.cpp | Uses inline-aware blob separation when writing main vs blob arrays. |
| src/paimon/core/io/field_mapping_reader_test.cpp | Adds test ensuring inline blob columns can be read as binary from data files. |
| src/paimon/core/io/external_storage_blob_writer.h | Adds new writer to externalize blob payload and inline-serialize descriptors. |
| src/paimon/core/io/external_storage_blob_writer.cpp | Implements batch transform by writing external blobs and replacing columns with descriptors. |
| src/paimon/core/io/external_storage_blob_writer_test.cpp | Adds tests for transform/abort behavior of external storage writer. |
| src/paimon/core/io/data_file_path_factory.h | Adds helper to generate external-storage blob paths. |
| src/paimon/core/io/data_file_path_factory_test.cpp | Adds tests for uniqueness/format of external-storage blob paths. |
| src/paimon/core/casting/cast_executor_test.cpp | Adds tests for binary→blob array cast behavior and constraints. |
| src/paimon/core/casting/cast_executor_factory.cpp | Registers new binary→blob cast executor. |
| src/paimon/core/casting/cast_executor_factory_test.cpp | Adds factory registration test for binary→blob executor. |
| src/paimon/core/casting/binary_to_blob_cast_executor.h | Introduces cast executor interface for binary→(large)blob conversion. |
| src/paimon/core/casting/binary_to_blob_cast_executor.cpp | Implements zero-copy-ish buffer reuse with widened offset buffer. |
| src/paimon/core/append/append_only_writer.h | Adds external-storage writer and inline-descriptor validation state. |
| src/paimon/core/append/append_only_writer.cpp | Applies external storage transform + inline validation in write path; inline-aware blob writer creation. |
| src/paimon/common/data/blob.cpp | Extends Blob::ArrowField to support nullable fields. |
| src/paimon/common/data/blob_utils.h | Adds inline-aware blob separation + inline descriptor validation API. |
| src/paimon/common/data/blob_utils.cpp | Implements inline-aware schema/array separation and inline descriptor validation. |
| src/paimon/common/data/blob_utils_test.cpp | Adds tests for inline separation and descriptor validation. |
| src/paimon/common/data/blob_test.cpp | Updates tests for nullable Blob::ArrowField. |
| src/paimon/common/data/blob_descriptor.h | Exports BlobDescriptor symbol for external use. |
| src/paimon/CMakeLists.txt | Adds new core sources/tests for external storage writer and cast executor. |
| include/paimon/defs.h | Updates option doc comment to reflect read-only blob-as-descriptor semantics. |
| include/paimon/data/blob.h | Updates public API for nullable Blob::ArrowField. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lxy-9602
commented
May 26, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 28, 2026
lxy-9602
commented
May 29, 2026
lxy-9602
commented
May 29, 2026
lxy-9602
commented
May 29, 2026
lszskye
reviewed
May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: #283
Description:
This PR implements the
blob-descriptor-fieldfeature in paimon-cpp, allowing blob fields to be stored as inlineBlobDescriptorbytes in the main data file, with optional external storage support.Key Changes
External Storage Blob Writer: New
ExternalStorageBlobWriterthat writes configured blob fields to external storage and replaces column data with serializedBlobDescriptorbytes in the main file.blob-as-descriptorsemantics aligned with Java: Write path auto-detects descriptor via magic header (no config needed).BLOB_AS_DESCRIPTORonly controls read behavior.Inline blob validation at framework layer:
AppendOnlyWriter::Writevalidates inline descriptor fields contain validBlobDescriptordata before writing, replacing per-format validation.BlobFormatWriter simplification: Removed
blob_as_descriptorparameter — writer now usesBlobDescriptor::IsBlobDescriptor()for dynamic detection.LARGE_BINARY: support across all formats: Added
arrow::LARGE_BINARYhandling in ORC adapter, Parquet reader, Avro schema converter, Avro encoder, and Avro stats extractor.Tests
BlobUtilsTest.ValidateInlineBlobDescriptorsEmptyFieldsBlobUtilsTest.ValidateInlineBlobDescriptorsNullArrayBlobUtilsTest.ValidateInlineBlobDescriptorsFieldNotPresentBlobUtilsTest.ValidateInlineBlobDescriptorsWithValidDescriptorBlobUtilsTest.ValidateInlineBlobDescriptorsWithNullValueBlobUtilsTest.ValidateInlineBlobDescriptorsWithRawBytesBlobUtilsTest.ValidateInlineBlobDescriptorsMixedValidAndInvalidBlobUtilsTest.ValidateInlineBlobDescriptorsMultipleFieldsAvroFileBatchReaderTest.TestReadBinaryWrittenFromBinaryAndLargeBinaryOrcFileBatchReaderTest.TestReadBinaryWrittenFromBinaryAndLargeBinaryParquetFileBatchReaderTest.TestReadBinaryWrittenFromBinaryAndLargeBinaryExternalStorageBlobWriterTest.*BlobTableInteTest.TestBlobDescriptorFieldWithoutExternalStorageBlobTableInteTest.TestBlobDescriptorFieldWithExternalStorageBlobTableInteTest.TestBlobDescriptorFieldPartialExternalStorageBlobTableInteTest.TestBlobDescriptorFieldPartialInlineBlobTableInteTest.TestBlobDescriptorFieldPartialExternalStorageRepackBlobTableInteTest.TestBlobDescriptorFieldPartialExternalStorageSingleFieldBlobTableInteTest.TestBlobDescriptorFieldPartialExternalStorageNoAsDescriptorBlobTableInteTest.TestBlobDescriptorFieldWriteRawBytesDirectlyBlobTableInteTest.TestBlobDescriptorMultiCommitAndShuffledReadSchemaBlobTableInteTest.TestDataEvolutionWithBlobDescriptorFieldAPI and Format
Documentation
Generative AI tooling
Generated-by: Claude-4.7-Opus