Data: Add TCK tests for ReadBuilder in BaseFormatModelTests#15633
Data: Add TCK tests for ReadBuilder in BaseFormatModelTests#15633Guosmilesmile wants to merge 2 commits intoapache:mainfrom
Conversation
| throws IOException { | ||
|
|
||
| // Orc does not support batch reading. | ||
| assumeThat(fileFormat != FileFormat.ORC).isTrue(); |
There was a problem hiding this comment.
iceberg/orc/src/main/java/org/apache/iceberg/orc/ORCFormatModel.java
Lines 265 to 268 in a50c2d9
I found in ORC internal don't have reuseContainers and only set an use no way. So I skip in this place.
| throws IOException { | ||
|
|
||
| // Avro does not support batch reading. | ||
| assumeThat(fileFormat != FileFormat.AVRO).isTrue(); |
There was a problem hiding this comment.
iceberg/core/src/main/java/org/apache/iceberg/avro/AvroFormatModel.java
Lines 251 to 253 in a50c2d9
AVRO don't support recordsPerBatch
|
|
||
| // Avro does not support filter push down; caseSensitive has no effect on it. | ||
| // Skip this test for Avro to avoid false failures. | ||
| assumeThat(fileFormat != FileFormat.AVRO).isTrue(); |
There was a problem hiding this comment.
iceberg/core/src/main/java/org/apache/iceberg/avro/AvroFormatModel.java
Lines 225 to 229 in a50c2d9
AVRO don't support caseSensitive
|
|
||
| // Avro does not support filter push down | ||
| // Skip this test for Avro to avoid false failures. | ||
| assumeThat(fileFormat != FileFormat.AVRO).isTrue(); |
There was a problem hiding this comment.
iceberg/core/src/main/java/org/apache/iceberg/avro/AvroFormatModel.java
Lines 232 to 236 in a50c2d9
AVRO don't support filter
There was a problem hiding this comment.
I think this would worth a letter to the dev list about how to handle the missing features. We should collect as many examples as we can, and ask the community what to do about it.
I don't like hiding them here.
- Maybe the FileFormat enum can contain some methods like
supportsFilter,supportsCaseSensitive? - Maybe we can have a matrix in the beginning of the test case like
MISSING_FEATURES = ImmutableMap.of(FileFormat.AVRO, String[] {"filter", "caseSensitive"}); - Maybe a separate test class for every format instead of storing this directly in the FileFormat enum?
I prefer 1 or 2 but open for other suggestions
There was a problem hiding this comment.
I chose option 2. For now, this part is only used by the test cases, so I’m opting to close the loop in tests first.
472bdce to
41836f9
Compare
data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java
Outdated
Show resolved
Hide resolved
| Schema fullSchema = dataGenerator.schema(); | ||
|
|
||
| List<Types.NestedField> columns = fullSchema.columns(); | ||
| Schema projectedSchema = new Schema(columns.get(columns.size() - 1)); |
There was a problem hiding this comment.
I have seen cases where the issue happened when the projection was in the beginning or in the middle of the column list.
Please create a projected schema where we skip some columns in the beginning and in the middle as well, like:
- Original: (a, b, c, d, e)
- Projected: (b, d)
This could be random with the same seed or whatever deterministic technic you would like to use.
There was a problem hiding this comment.
I chang for the project part, select the odd-numbered fields.
data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java
Outdated
Show resolved
Hide resolved
| List<Types.NestedField> columns = fullSchema.columns(); | ||
| Schema projectedSchema = new Schema(columns.get(columns.size() - 1)); | ||
|
|
||
| List<Record> genericRecords = dataGenerator.generateRecords(); |
There was a problem hiding this comment.
Do we want to run these tests agains all of the generators?
I expect that we will have a default schema for most of the tests, and we only want to run this on a single reasonably wide, and reasonably narrow schema.
|
|
||
| // Construct a filter condition that is smaller than the minimum value to achieve file-level | ||
| // filtering. | ||
| Types.NestedField firstField = schema.columns().get(0); |
There was a problem hiding this comment.
This heavily builds on the schema of the records generated by the generator.
Maybe run the tests for all of the FileFormats, but only with a single generator.
So the tests should be like:
@ParameterizedTest
@FieldSource("FORMATS")
void testReaderBuilderFilter(FileFormat fileFormat) {
DataGenerator dataGenerator = new StructOfPrimitive(); // or DataGenerator.structOfPrimitive()
There was a problem hiding this comment.
Add a default schema, and update the ReadBuilder-related test cases to test the default schema.
| /** | ||
| * Write with Generic Record, then read using split to verify that the split range is respected. | ||
| * Reading with a zero-length split at the end of the file should return no records, while reading | ||
| * with the full file range should return all records. |
There was a problem hiding this comment.
If we have a DataFile instead of the InputFile only then we have splitOffsets, and based on that we can verify if we have enough data to have multiple splits, and if so, we can check if reading the splits produces more than 0 and fewer then all records, and also reading all of the splits produces the expected records
There was a problem hiding this comment.
The way to trigger multiple groups within a single DataFile varies across different file formats — such as row groups in Parquet, stripes in ORC, etc.
Would we trigger this by writing 1000 or more records? Though explicitly setting this size feels a bit awkward.
There was a problem hiding this comment.
I'm not sure what is the best method for this.
We definitely need to check if there are multiple splits in the file before running a test.
Maybe we can start the generation check the number of splits and use a constant row number which is high enough to generate multiple splits? The check could state clearly, that not enough splits, increase the row number until we have enough splits?
There was a problem hiding this comment.
I tried writing continuously, but before a new split was generated, the unit test failed due to running out of memory. I’m keeping the original approach here and haven’t made any changes for now.
| * Verifies the contract of recordsPerBatch: recordsPerBatch is a hint for vectorized readers. The | ||
| * total number of records returned must be unaffected regardless of the batch size value. | ||
| */ | ||
| void testReaderBuilderRecordsPerBatch(FileFormat fileFormat, DataGenerator dataGenerator) |
There was a problem hiding this comment.
Are we testing non-vectorized readers with recordsPerBatch?
This seems strange. We might want to restrict this to vectorized models, an verify with a batch size bigger than 1.
Maybe we should throw an exception when the recordsPerBatch is set and the reader is non-vectorized.
There was a problem hiding this comment.
You’re right—I should adjust this. Non-vectorized readers should throw an exception when recordsPerBatch is set. I’ll update it accordingly. Thanks for pointing it out.
There was a problem hiding this comment.
Fix and the test for it could go in another PR
There was a problem hiding this comment.
Right now Avro throws an exception. I think I should adjust my assertions to verify that it throws the exception as expected, instead of skipping the test.
There was a problem hiding this comment.
Either that, or fix the other readers and add the test and the fix in another PR
This PR adds TCK test coverage for the ReadBuilder API in BaseFormatModelTests. The new tests verify the contract of each ReadBuilder configuration method across all supported file formats (Avro, Parquet, ORC) and data generators.
Part of #15415