Spark 4.1: [WIP] Geometry / Geography end-to-end support#16650
Draft
huan233usc wants to merge 2 commits into
Draft
Spark 4.1: [WIP] Geometry / Geography end-to-end support#16650huan233usc wants to merge 2 commits into
huan233usc wants to merge 2 commits into
Conversation
added 2 commits
May 28, 2026 23:37
Iceberg v3 stores geometry and geography lower/upper bounds as binary using the x:y:z:m encoding defined in the Bound Serialization section of the spec (Appendix D). The encoding is already implemented in GeospatialBound, but Conversions.toByteBuffer / fromByteBuffer have no GEOMETRY / GEOGRAPHY cases and throw UnsupportedOperationException, blocking ManifestEvaluator and any metric-based use of geo bounds. Wire GeospatialBound through Conversions for both geometry and geography, and add round-trip coverage in TestConversions for the 16-, 24-, and 32-byte shapes (including the x:y:NaN:m case for XYM bounds), CRS variants, and nulls.
Build on top of apache#16607 to enable end-to-end CREATE / INSERT / SELECT / DELETE on Iceberg geometry and geography columns from Spark 4.1. Parquet schema mapping - TypeToMessageType: emit Iceberg GEOMETRY / GEOGRAPHY as Parquet BINARY with LogicalTypeAnnotation.geometryType / geographyType, propagating CRS and (for geography) the EdgeAlgorithm. - MessageTypeToType: read those annotations back into Iceberg Types.GeometryType / Types.GeographyType. - ParquetMetrics: skip lex min/max bounds for geometry / geography (Comparators.forType has no ordering for spatial WKB) and fall back to value / null counts only. Spatial bounding boxes (X:Y:Z:M) will be plumbed through FieldMetrics in a follow-up. Generic Parquet readers / writers - BaseParquetReaders / BaseParquetWriter: dispatch the geometry and geography logical type annotations to ParquetValueReaders.byteBuffers and ParquetValueWriters.byteBuffers, surfacing WKB as ByteBuffer for the engine-agnostic data path. Spark 4.1 type bridge - TypeToSparkType / SparkTypeToType: bidirectional mapping between Iceberg GEOMETRY / GEOGRAPHY and Spark's GeometryType / GeographyType, preserving the CRS string. Geography defaults to the SPHERICAL edge algorithm on the Iceberg side. - PruneColumnsWithoutReordering: register GEOMETRY and GEOGRAPHY in the type-id table so column pruning does not reject geo columns. Spark 4.1 Parquet readers / writers - SparkParquetReaders: new GeometryReader / GeographyReader that read pure WKB from Parquet and prepend the 4-byte little-endian SRID header expected by Spark's internal GeometryVal / GeographyVal. The SRID is derived from the column's CRS via Spark's CartesianSpatialReference- SystemMapper / GeographicSpatialReferenceSystemMapper. - SparkParquetWriters: new GeometryWriter / GeographyWriter that strip Spark's SRID header and write only the WKB body to Parquet, matching the Iceberg / Parquet on-disk representation. Tests - TestSparkGeoTypes (Spark 4.1): 11 end-to-end SQL tests covering - flat geometry / geography round-trip, - ST_Srid predicate (validates SRID re-attachment from CRS), - DELETE on a v3 merge-on-read table producing a Puffin DV, - NULL geometry mixed with non-NULL, - mixed geometry + geography in the same table, - STRUCT<..., loc: GEOMETRY>, ARRAY<GEOMETRY>, MAP<STRING, GEOMETRY>, STRUCT<..., points: ARRAY<GEOMETRY>>, - DELETE + DV on a table whose geometry sits inside a struct. Vectorized reads are disabled per-table because the vectorized geo path is not yet implemented. Topological predicates such as ST_Intersects are not part of stock Spark 4.1, so predicate coverage is limited to ST_Srid for now.
Author
|
cc @szehon-ho |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
WIP / Draft. Stacked on top of #16607 (
API: Single-value binary serialization for geometry and geography). The first 1 commit of this PR is from #16607; please review only the most recent commit (Spark 4.1: WIP geometry / geography end-to-end support) here. This PR will be rebased / re-targeted once #16607 lands.Summary
Wires up Iceberg's geometry and geography types end-to-end on Spark 4.1, so
CREATE TABLE / INSERT / SELECT / DELETEwork over Parquet with the in-progress geospatial column types.The change has three layers:
GEOMETRY/GEOGRAPHYthroughLogicalTypeAnnotation.geometryType/geographyType(TypeToMessageType,MessageTypeToType).BaseParquetReaders/BaseParquetWritersurface WKB asByteBufferfor the engine-agnostic data path.TypeToSparkType/SparkTypeToTypebridge Iceberg ↔ SparkGeometryType/GeographyType, preserving CRS.PruneColumnsWithoutReorderingaccepts geo column types during schema pruning.SparkParquetReaders/SparkParquetWriterstranslate between Spark's internalGeometryVal/GeographyVal(4-byte little-endian SRID header + WKB) and the pure-WKB Parquet representation. SRID is derived per-column from the CRS via Spark'sCartesianSpatialReferenceSystemMapper/GeographicSpatialReferenceSystemMapper.ParquetMetricsskips lexbounds()for geo (noComparators.forTypeordering for spatial WKB) and falls back to value / null counts. Real spatial bounding-box stats (X:Y:Z:M) will be plumbed throughFieldMetricsin a follow-up.What's intentionally out of scope (follow-ups)
GeospatialBoundX:Y:Z:M) plumbed throughFieldMetricsand used for partition / file-skipping pruning.read.parquet.vectorization.enabled = falseper-table for now.ST_Intersects,ST_Within, ...) — not part of stock Spark 4.1.Test plan
TestSparkGeoTypes(Spark 4.1, 11 cases):testGeometryRoundTrip— flatGEOMETRYround-triptestGeographyRoundTrip— flatGEOGRAPHYround-triptestSridFilterRoundtrip—ST_Srid(geom) = Npredicate validates CRS → SRID re-attachment on readtestDeleteWithDeletionVector— v3 + MoR DELETE produces a Puffin DV (added-dvs=1,added-position-deletes=1)testNullGeometryValue— NULL geometry rows mixed with non-NULL in the same data filetestMultipleGeoColumnsInOneTable—GEOMETRY+GEOGRAPHYside by side, per-column CRS metadata honoredtestStructWithGeometry—STRUCT<eid, loc: GEOMETRY>testArrayOfGeometry—ARRAY<GEOMETRY>testMapOfGeometry—MAP<STRING, GEOMETRY>testStructOfArrayOfGeometry—STRUCT<tid, points: ARRAY<GEOMETRY>>testDeleteWithDeletionVectorOnNestedGeometry— DV path on a table where geometry sits inside a struct./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:spotlessCheck :iceberg-parquet:spotlessCheckis clean.