Search before asking
Motivation
Paimon Java already provides a rich set of Blob capabilities beyond basic single-field read/write:
-
Multiple Blob Fields per Table — Java's MultipleBlobFileWriter creates an independent blob file writer for each blob column in a row. When a table schema contains several BLOB columns (e.g., image BLOB, video BLOB, embedding BLOB), each column is written to its own blob file, and the normal data file stores only the non-blob columns. During reads, DedicatedFormatRollingFileWriter coordinates multiple blob file readers to merge results back into a complete row. In paimon-cpp, both BlobFormatWriter and BlobFileBatchReader enforce num_fields == 1, meaning only a single blob column is allowed per table.
-
BlobView Mode (cross-table blob reference) — Java supports BlobView, where a blob field does not store actual data but holds a BlobViewStruct that references a blob in another upstream table (identified by Identifier + fieldId + rowId). At read time, BlobViewLookup batch-preloads the BlobDescriptors from the upstream table using row-range parallelism, and BlobViewResolver lazily resolves each BlobView into a readable Blob. This avoids duplicating large blob data across derived tables. paimon-cpp has no equivalent of BlobView, BlobViewStruct, or BlobViewLookup.
-
Blob Descriptor-Inline Mode (blob-descriptor-field) — Java allows certain blob fields to be designated as "descriptor fields" via the blob-descriptor-field table option. These blob fields are serialized as compact BlobDescriptor bytes (URI + offset + length) and stored inline within the normal data file (e.g., Parquet/ORC), rather than being split out to a separate .blob file. This is useful when the blob data is already externally managed (e.g., on OSS/S3) and only a lightweight reference needs to be persisted. BlobType.fieldsInBlobFile() / fieldsNotInBlobFile() partitions the schema accordingly. paimon-cpp does not implement this partitioning logic.
These gaps prevent paimon-cpp users from:
- Defining tables with more than one BLOB column.
- Leveraging cross-table blob references for data lineage / deduplication use cases.
- Using lightweight descriptor-inline storage for externally managed blob data.
Solution
- Phase 1: Multiple Blob Fields Support
- Phase 2: Blob Descriptor-Inline Mode
- Phase 3: BlobView Mode
Anything else?
No response
Are you willing to submit a PR?
Search before asking
Motivation
Paimon Java already provides a rich set of Blob capabilities beyond basic single-field read/write:
Multiple Blob Fields per Table — Java's
MultipleBlobFileWritercreates an independent blob file writer for each blob column in a row. When a table schema contains several BLOB columns (e.g.,image BLOB, video BLOB, embedding BLOB), each column is written to its own blob file, and the normal data file stores only the non-blob columns. During reads,DedicatedFormatRollingFileWritercoordinates multiple blob file readers to merge results back into a complete row. In paimon-cpp, bothBlobFormatWriterandBlobFileBatchReaderenforcenum_fields == 1, meaning only a single blob column is allowed per table.BlobView Mode (cross-table blob reference) — Java supports
BlobView, where a blob field does not store actual data but holds aBlobViewStructthat references a blob in another upstream table (identified byIdentifier+fieldId+rowId). At read time,BlobViewLookupbatch-preloads theBlobDescriptors from the upstream table using row-range parallelism, andBlobViewResolverlazily resolves eachBlobViewinto a readableBlob. This avoids duplicating large blob data across derived tables. paimon-cpp has no equivalent ofBlobView,BlobViewStruct, orBlobViewLookup.Blob Descriptor-Inline Mode (
blob-descriptor-field) — Java allows certain blob fields to be designated as "descriptor fields" via theblob-descriptor-fieldtable option. These blob fields are serialized as compactBlobDescriptorbytes (URI + offset + length) and stored inline within the normal data file (e.g., Parquet/ORC), rather than being split out to a separate.blobfile. This is useful when the blob data is already externally managed (e.g., on OSS/S3) and only a lightweight reference needs to be persisted.BlobType.fieldsInBlobFile()/fieldsNotInBlobFile()partitions the schema accordingly. paimon-cpp does not implement this partitioning logic.These gaps prevent paimon-cpp users from:
Solution
Anything else?
No response
Are you willing to submit a PR?