Core: Fix data loss in partial variant shredding#15087
Core: Fix data loss in partial variant shredding#15087amogh-jahagirdar merged 6 commits intoapache:mainfrom
Conversation
|
@huaxingao @aihuaxu could you guys please take a look? |
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/variants/ShreddedObject.java
Outdated
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Show resolved
Hide resolved
parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
Show resolved
Hide resolved
| records.add(record); | ||
| } | ||
|
|
||
| // Shredding function that only shreds the "id" field |
There was a problem hiding this comment.
This is what i'm not sure about, this shouldn't be relevant to the fix correct? We would lose the unshredded fields even if no transform is applied. The Transform in the writer isn't relevant to the broken serialization method correct?
There was a problem hiding this comment.
@RussellSpitzer I may not have enough knowledge about iceberg code, and pardon me if I'm wrong because I just do some tests and observe results, and try to connect them and find reasonable explanation. Maybe my explanation is not the root cause.
I think it's relevant to the fix. The bug is not "we lose the unshredded fields", but is "we will lose shredded fields if we partially shred some fields". If I write the test case like following "shred all fields", then the test case works.
VariantShreddingFunction partialShredding =
(id, name) -> {
- VariantMetadata shreddedMetadata = Variants.metadata("id");
- ShreddedObject shreddedObject = Variants.object(shreddedMetadata);
- shreddedObject.put("id", Variants.of(1234L));
- return ParquetVariantUtil.toParquetSchema(shreddedObject);
+ if (name.equals("var")) {
+ ShreddedObject obj = Variants.object(metadata);
+ obj.put("id", Variants.of(1000L));
+ obj.put("name", Variants.of("user"));
+ obj.put("city", Variants.of("city"));
+ return ParquetVariantUtil.toParquetSchema(obj);
+ }
+ return null;
};But if I just do partially shredding, then exception happens.
Actually, the fix should be following, only to shred the id field in variant. But I guess it does not matter much because id field of primitive type does not support shredding.
--- a/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
+++ b/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java
@@ -302,13 +302,16 @@ public class TestVariantWriters {
records.add(record);
}
- // Shredding function that only shreds the "id" field
+ // Shredding function that only shreds the "id" field in the variant
VariantShreddingFunction partialShredding =
(id, name) -> {
- VariantMetadata shreddedMetadata = Variants.metadata("id");
- ShreddedObject shreddedObject = Variants.object(shreddedMetadata);
- shreddedObject.put("id", Variants.of(1234L));
- return ParquetVariantUtil.toParquetSchema(shreddedObject);
+ if (name.equals("var")) {
+ VariantMetadata shreddedMetadata = Variants.metadata("id");
+ ShreddedObject shreddedObject = Variants.object(shreddedMetadata);
+ shreddedObject.put("id", Variants.of(1234L));
+ return ParquetVariantUtil.toParquetSchema(shreddedObject);
+ }
+ return null;
};There was a problem hiding this comment.
My question was whether -
// Shredding function that only shreds the "id" field
VariantShreddingFunction partialShredding =
(id, name) -> null; // No ShreddingWould also be broken, but it looks like it isn't from my internal test. So It must be going down a different serialization path I guess
There was a problem hiding this comment.
OK I think I get it, the test code is essentially creating a fully shredded object at the top with
obj.put("id", Variants.of(1000L + i));
obj.put("name", Variants.of("user_" + i));
obj.put("city", Variants.of("city_" + i));
Then this code is transforming that object into a partially shredded one
There was a problem hiding this comment.
yes, tranformig that object into a partially shredded one.
creating a fully shredded object
I'm not sure if I want to create a fully shredded object at this moment, although the returned type here is indeed called ShreddedObject.
ShreddedObject obj = Variants.object(metadata);
obj.put("id", Variants.of(1000L + i));
obj.put("name", Variants.of("user_" + i));
obj.put("city", Variants.of("city_" + i));
I'm not sure this is the right way(or expected way) to construct a variant, or should I construct a SerializedObject like what you wrote in this issue (#15086)
|
@rdblue @RussellSpitzer @huaxingao do you have more comments or is it okay to merge this pr? |
|
Thanks @dirtysalt , I did another pass and the new tests look right to me. I'm going to go ahead and merge. Thank you @aihuaxu @huaxingao @rdblue @RussellSpitzer for reviewing. |
When using variantShreddingFunc to partially shred variant fields, unshredded fields were being lost during serialization. The bug was in ShreddedObject constructor where local variable
shreddedFieldsshadowed the instance fieldthis.shreddedFields, causing unshredded fields to be added to the local map instead of the instance field.This resulted in the binary value field containing only metadata headers without actual field data, causing IndexOutOfBoundsException on read and permanent data loss.
Fix: Changed all references in the problematic code block to use
this.shreddedFieldsexplicitly, ensuring unshredded fields are properly preserved in the instance field and serialized correctly.Added test case testPartialShreddingWithShreddedObject that reproduces the exact scenario from issue #15086.