Skip to content

feat(blob): Support blob-descriptor-field for inline blob descriptor storage#300

Open
lxy-9602 wants to merge 12 commits into
alibaba:mainfrom
lxy-9602:add-blob-desc-field
Open

feat(blob): Support blob-descriptor-field for inline blob descriptor storage#300
lxy-9602 wants to merge 12 commits into
alibaba:mainfrom
lxy-9602:add-blob-desc-field

Conversation

@lxy-9602
Copy link
Copy Markdown
Collaborator

@lxy-9602 lxy-9602 commented May 25, 2026

Purpose

Linked issue: #283

Description:

This PR implements the blob-descriptor-field feature in paimon-cpp, allowing blob fields to be stored as inline BlobDescriptor bytes in the main data file, with optional external storage support.

Key Changes

  • External Storage Blob Writer: New ExternalStorageBlobWriter that writes configured blob fields to external storage and replaces column data with serialized BlobDescriptor bytes in the main file.

  • blob-as-descriptor semantics aligned with Java: Write path auto-detects descriptor via magic header (no config needed). BLOB_AS_DESCRIPTOR only controls read behavior.

  • Inline blob validation at framework layer: AppendOnlyWriter::Write validates inline descriptor fields contain valid BlobDescriptor data before writing, replacing per-format validation.

  • BlobFormatWriter simplification: Removed blob_as_descriptor parameter — writer now uses BlobDescriptor::IsBlobDescriptor() for dynamic detection.

  • LARGE_BINARY: support across all formats: Added arrow::LARGE_BINARY handling in ORC adapter, Parquet reader, Avro schema converter, Avro encoder, and Avro stats extractor.

Tests

  • BlobUtilsTest.ValidateInlineBlobDescriptorsEmptyFields
  • BlobUtilsTest.ValidateInlineBlobDescriptorsNullArray
  • BlobUtilsTest.ValidateInlineBlobDescriptorsFieldNotPresent
  • BlobUtilsTest.ValidateInlineBlobDescriptorsWithValidDescriptor
  • BlobUtilsTest.ValidateInlineBlobDescriptorsWithNullValue
  • BlobUtilsTest.ValidateInlineBlobDescriptorsWithRawBytes
  • BlobUtilsTest.ValidateInlineBlobDescriptorsMixedValidAndInvalid
  • BlobUtilsTest.ValidateInlineBlobDescriptorsMultipleFields
  • AvroFileBatchReaderTest.TestReadBinaryWrittenFromBinaryAndLargeBinary
  • OrcFileBatchReaderTest.TestReadBinaryWrittenFromBinaryAndLargeBinary
  • ParquetFileBatchReaderTest.TestReadBinaryWrittenFromBinaryAndLargeBinary
  • ExternalStorageBlobWriterTest.*
  • BlobTableInteTest.TestBlobDescriptorFieldWithoutExternalStorage
  • BlobTableInteTest.TestBlobDescriptorFieldWithExternalStorage
  • BlobTableInteTest.TestBlobDescriptorFieldPartialExternalStorage
  • BlobTableInteTest.TestBlobDescriptorFieldPartialInline
  • BlobTableInteTest.TestBlobDescriptorFieldPartialExternalStorageRepack
  • BlobTableInteTest.TestBlobDescriptorFieldPartialExternalStorageSingleField
  • BlobTableInteTest.TestBlobDescriptorFieldPartialExternalStorageNoAsDescriptor
  • BlobTableInteTest.TestBlobDescriptorFieldWriteRawBytesDirectly
  • BlobTableInteTest.TestBlobDescriptorMultiCommitAndShuffledReadSchema
  • BlobTableInteTest.TestDataEvolutionWithBlobDescriptorField

API and Format

Documentation

Generative AI tooling

Generated-by: Claude-4.7-Opus

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds “blob-descriptor-field” support so selected BLOB columns can be stored inline as serialized BlobDescriptor bytes in the main data files, optionally writing the underlying blob payloads to an external storage path. It also aligns write-path descriptor detection with Java semantics (auto-detect via magic header) and adds LARGE_BINARY support across ORC/Parquet/Avro.

Changes:

  • Introduces external-storage descriptor writing (ExternalStorageBlobWriter) and updates append-only write flow to validate inline descriptor fields before writing.
  • Refactors blob writing to auto-detect descriptor vs raw bytes at write time; blob-as-descriptor now controls read behavior only.
  • Adds arrow::large_binary() support in ORC adapter, Parquet/ORC/Avro readers, Avro encoding/schema/stats, plus related tests.

Reviewed changes

Copilot reviewed 48 out of 48 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/inte/blob_table_inte_test.cpp Expands integration tests for descriptor-inline + external storage behaviors and metadata expectations.
src/paimon/testing/utils/test_helper.h Updates blob schema/array separation calls to new inline-aware APIs.
src/paimon/format/parquet/parquet_file_batch_reader_test.cpp Adds coverage for reading BINARY/LARGE_BINARY written data as binary.
src/paimon/format/orc/orc_file_batch_reader_test.cpp Adds coverage for reading BINARY/LARGE_BINARY written data as binary.
src/paimon/format/orc/orc_adapter.cpp Adds ORC write/type mapping support for LARGE_BINARY.
src/paimon/format/orc/orc_adapter_test.cpp Extends ORC adapter tests for large_binary support.
src/paimon/format/blob/blob_writer_builder.h Adds WriteConsumer injection support; removes config-driven descriptor write behavior.
src/paimon/format/blob/blob_writer_builder_test.cpp Tests WriteConsumer wiring from builder into writer.
src/paimon/format/blob/blob_format_writer.h Adds WriteConsumer; removes blob_as_descriptor from writer API.
src/paimon/format/blob/blob_format_writer.cpp Implements descriptor auto-detection and consumer callback descriptor emission.
src/paimon/format/blob/blob_format_writer_test.cpp Updates tests for new API and validates consumer-received descriptors.
src/paimon/format/blob/blob_file_batch_reader_test.cpp Updates writer creation for new BlobFormatWriter API.
src/paimon/format/avro/avro_stats_extractor.cpp Treats LARGE_BINARY like BINARY in stats extraction.
src/paimon/format/avro/avro_schema_converter.cpp Maps LARGE_BINARY to Avro bytes schema.
src/paimon/format/avro/avro_file_batch_reader_test.cpp Adds BINARY/LARGE_BINARY cross-write/read test coverage.
src/paimon/format/avro/avro_direct_encoder.cpp Encodes Avro BYTES from both binary and large_binary arrays.
src/paimon/core/utils/field_mapping.h Extends field mapping creation with blob-inline field awareness.
src/paimon/core/utils/field_mapping.cpp Converts configured inline blob fields to binary for file-schema mapping.
src/paimon/core/utils/field_mapping_test.cpp Adds tests for inline blob field conversion behavior.
src/paimon/core/schema/schema_validation.cpp Simplifies blob option presence check for validation entry.
src/paimon/core/operation/file_store_scan.cpp Passes inline blob fields into field mapping creation.
src/paimon/core/operation/append_only_file_store_write.h Removes blob-driven compaction disable flag from state.
src/paimon/core/operation/append_only_file_store_write.cpp Removes blob-driven compaction disable path (see review comment).
src/paimon/core/operation/abstract_split_read.cpp Builds field mapping with inline blob fields derived from per-file schema options.
src/paimon/core/io/rolling_blob_file_writer.h Adds inline-field awareness to rolling blob writer.
src/paimon/core/io/rolling_blob_file_writer.cpp Uses inline-aware blob separation when writing main vs blob arrays.
src/paimon/core/io/field_mapping_reader_test.cpp Adds test ensuring inline blob columns can be read as binary from data files.
src/paimon/core/io/external_storage_blob_writer.h Adds new writer to externalize blob payload and inline-serialize descriptors.
src/paimon/core/io/external_storage_blob_writer.cpp Implements batch transform by writing external blobs and replacing columns with descriptors.
src/paimon/core/io/external_storage_blob_writer_test.cpp Adds tests for transform/abort behavior of external storage writer.
src/paimon/core/io/data_file_path_factory.h Adds helper to generate external-storage blob paths.
src/paimon/core/io/data_file_path_factory_test.cpp Adds tests for uniqueness/format of external-storage blob paths.
src/paimon/core/casting/cast_executor_test.cpp Adds tests for binary→blob array cast behavior and constraints.
src/paimon/core/casting/cast_executor_factory.cpp Registers new binary→blob cast executor.
src/paimon/core/casting/cast_executor_factory_test.cpp Adds factory registration test for binary→blob executor.
src/paimon/core/casting/binary_to_blob_cast_executor.h Introduces cast executor interface for binary→(large)blob conversion.
src/paimon/core/casting/binary_to_blob_cast_executor.cpp Implements zero-copy-ish buffer reuse with widened offset buffer.
src/paimon/core/append/append_only_writer.h Adds external-storage writer and inline-descriptor validation state.
src/paimon/core/append/append_only_writer.cpp Applies external storage transform + inline validation in write path; inline-aware blob writer creation.
src/paimon/common/data/blob.cpp Extends Blob::ArrowField to support nullable fields.
src/paimon/common/data/blob_utils.h Adds inline-aware blob separation + inline descriptor validation API.
src/paimon/common/data/blob_utils.cpp Implements inline-aware schema/array separation and inline descriptor validation.
src/paimon/common/data/blob_utils_test.cpp Adds tests for inline separation and descriptor validation.
src/paimon/common/data/blob_test.cpp Updates tests for nullable Blob::ArrowField.
src/paimon/common/data/blob_descriptor.h Exports BlobDescriptor symbol for external use.
src/paimon/CMakeLists.txt Adds new core sources/tests for external storage writer and cast executor.
include/paimon/defs.h Updates option doc comment to reflect read-only blob-as-descriptor semantics.
include/paimon/data/blob.h Updates public API for nullable Blob::ArrowField.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/paimon/core/operation/append_only_file_store_write.cpp
Comment thread src/paimon/core/io/external_storage_blob_writer.cpp
Comment thread src/paimon/common/data/blob_utils.cpp
Comment thread src/paimon/common/data/blob_utils.cpp
Comment thread src/paimon/core/casting/binary_to_blob_cast_executor.cpp
Comment thread src/paimon/core/io/rolling_blob_file_writer.h Outdated
Comment thread src/paimon/format/blob/blob_file_batch_reader_test.cpp
Comment thread src/paimon/format/blob/blob_format_writer.cpp Outdated
Comment thread src/paimon/format/blob/blob_format_writer.cpp
Comment thread src/paimon/format/blob/blob_format_writer.cpp
Comment thread src/paimon/format/parquet/parquet_file_batch_reader_test.cpp
Comment thread src/paimon/testing/utils/test_helper.h Outdated
Comment thread src/paimon/common/data/blob_utils.cpp
Comment thread src/paimon/core/io/field_mapping_reader_test.cpp
Comment thread src/paimon/core/io/external_storage_blob_writer.h Outdated
Comment thread src/paimon/core/io/external_storage_blob_writer.cpp
Comment thread test/inte/blob_table_inte_test.cpp
Comment thread test/inte/blob_table_inte_test.cpp
Comment thread test/inte/blob_table_inte_test.cpp
Comment thread src/paimon/core/utils/field_mapping.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants