Skip to content

[Feature] Support multiple blob fields and BlobView/descriptor-inline mode in paimon-cpp #283

@lxy-9602

Description

@lxy-9602

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Paimon Java already provides a rich set of Blob capabilities beyond basic single-field read/write:

  1. Multiple Blob Fields per Table — Java's MultipleBlobFileWriter creates an independent blob file writer for each blob column in a row. When a table schema contains several BLOB columns (e.g., image BLOB, video BLOB, embedding BLOB), each column is written to its own blob file, and the normal data file stores only the non-blob columns. During reads, DedicatedFormatRollingFileWriter coordinates multiple blob file readers to merge results back into a complete row. In paimon-cpp, both BlobFormatWriter and BlobFileBatchReader enforce num_fields == 1, meaning only a single blob column is allowed per table.

  2. BlobView Mode (cross-table blob reference) — Java supports BlobView, where a blob field does not store actual data but holds a BlobViewStruct that references a blob in another upstream table (identified by Identifier + fieldId + rowId). At read time, BlobViewLookup batch-preloads the BlobDescriptors from the upstream table using row-range parallelism, and BlobViewResolver lazily resolves each BlobView into a readable Blob. This avoids duplicating large blob data across derived tables. paimon-cpp has no equivalent of BlobView, BlobViewStruct, or BlobViewLookup.

  3. Blob Descriptor-Inline Mode (blob-descriptor-field) — Java allows certain blob fields to be designated as "descriptor fields" via the blob-descriptor-field table option. These blob fields are serialized as compact BlobDescriptor bytes (URI + offset + length) and stored inline within the normal data file (e.g., Parquet/ORC), rather than being split out to a separate .blob file. This is useful when the blob data is already externally managed (e.g., on OSS/S3) and only a lightweight reference needs to be persisted. BlobType.fieldsInBlobFile() / fieldsNotInBlobFile() partitions the schema accordingly. paimon-cpp does not implement this partitioning logic.

These gaps prevent paimon-cpp users from:

  • Defining tables with more than one BLOB column.
  • Leveraging cross-table blob references for data lineage / deduplication use cases.
  • Using lightweight descriptor-inline storage for externally managed blob data.

Solution

  • Phase 1: Multiple Blob Fields Support
  • Phase 2: Blob Descriptor-Inline Mode
  • Phase 3: BlobView Mode

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions