@@ -69,6 +69,32 @@ this can be overridden by setting `spark.comet.regexp.allowIncompatible=true`.
6969Comet's support for window functions is incomplete and known to be incorrect. It is disabled by default and
7070should not be used in production. The feature will be enabled in a future release. Tracking issue: [ #2721 ] ( https://github.com/apache/datafusion-comet/issues/2721 ) .
7171
72+ ## Round-Robin Partitioning
73+
74+ Comet's native shuffle implementation of round-robin partitioning (` df.repartition(n) ` ) is not compatible with
75+ Spark's implementation and is disabled by default. It can be enabled by setting
76+ ` spark.comet.native.shuffle.partitioning.roundrobin.enabled=true ` .
77+
78+ ** Why the incompatibility exists:**
79+
80+ Spark's round-robin partitioning sorts rows by their binary ` UnsafeRow ` representation before assigning them to
81+ partitions. This ensures deterministic output for fault tolerance (task retries produce identical results).
82+ Comet uses Arrow format internally, which has a completely different binary layout than ` UnsafeRow ` , making it
83+ impossible to match Spark's exact partition assignments.
84+
85+ ** Comet's approach:**
86+
87+ Instead of true round-robin assignment, Comet implements round-robin as hash partitioning on ALL columns. This
88+ achieves the same semantic goals:
89+
90+ - ** Even distribution** : Rows are distributed evenly across partitions (as long as the hash varies sufficiently -
91+ in some cases there could be skew)
92+ - ** Deterministic** : Same input always produces the same partition assignments (important for fault tolerance)
93+ - ** No semantic grouping** : Unlike hash partitioning on specific columns, this doesn't group related rows together
94+
95+ The only difference is that Comet's partition assignments will differ from Spark's. When results are sorted,
96+ they will be identical to Spark. Unsorted results may have different row ordering.
97+
7298## Cast
7399
74100Cast operations in Comet fall into three levels of support:
@@ -84,104 +110,17 @@ Cast operations in Comet fall into three levels of support:
84110
85111### Legacy Mode
86112
87- <!-- WARNING! DO NOT MANUALLY MODIFY CONTENT BETWEEN THE BEGIN AND END TAGS -->
88-
89113<!-- BEGIN:CAST_LEGACY_TABLE-->
90- <!-- prettier-ignore-start -->
91- | | binary | boolean | byte | date | decimal | double | float | integer | long | short | string | timestamp |
92- | ---| ---| ---| ---| ---| ---| ---| ---| ---| ---| ---| ---| ---|
93- | binary | - | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | C | N/A |
94- | boolean | N/A | - | C | N/A | U | C | C | C | C | C | C | U |
95- | byte | U | C | - | N/A | C | C | C | C | C | C | C | U |
96- | date | N/A | U | U | - | U | U | U | U | U | U | C | U |
97- | decimal | N/A | C | C | N/A | - | C | C | C | C | C | C | U |
98- | double | N/A | C | C | N/A | I | - | C | C | C | C | C | U |
99- | float | N/A | C | C | N/A | I | C | - | C | C | C | C | U |
100- | integer | U | C | C | N/A | C | C | C | - | C | C | C | U |
101- | long | U | C | C | N/A | C | C | C | C | - | C | C | U |
102- | short | U | C | C | N/A | C | C | C | C | C | - | C | U |
103- | string | C | C | C | C | I | C | C | C | C | C | - | I |
104- | timestamp | N/A | U | U | C | U | U | U | U | C | U | C | - |
105- <!-- prettier-ignore-end -->
106-
107- ** Notes:**
108- - ** decimal -> string** : There can be formatting differences in some case due to Spark using scientific notation where Comet does not
109- - ** double -> decimal** : There can be rounding differences
110- - ** double -> string** : There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
111- - ** float -> decimal** : There can be rounding differences
112- - ** float -> string** : There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
113- - ** string -> date** : Only supports years between 262143 BC and 262142 AD
114- - ** string -> decimal** : Does not support fullwidth unicode digits (e.g \\ uFF10)
115- or strings containing null bytes (e.g \\ u0000)
116- - ** string -> timestamp** : Not all valid formats are supported
117114<!-- END:CAST_LEGACY_TABLE-->
118115
119116### Try Mode
120117
121- <!-- WARNING! DO NOT MANUALLY MODIFY CONTENT BETWEEN THE BEGIN AND END TAGS -->
122-
123118<!-- BEGIN:CAST_TRY_TABLE-->
124- <!-- prettier-ignore-start -->
125- | | binary | boolean | byte | date | decimal | double | float | integer | long | short | string | timestamp |
126- | ---| ---| ---| ---| ---| ---| ---| ---| ---| ---| ---| ---| ---|
127- | binary | - | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | C | N/A |
128- | boolean | N/A | - | C | N/A | U | C | C | C | C | C | C | U |
129- | byte | U | C | - | N/A | C | C | C | C | C | C | C | U |
130- | date | N/A | U | U | - | U | U | U | U | U | U | C | U |
131- | decimal | N/A | C | C | N/A | - | C | C | C | C | C | C | U |
132- | double | N/A | C | C | N/A | I | - | C | C | C | C | C | U |
133- | float | N/A | C | C | N/A | I | C | - | C | C | C | C | U |
134- | integer | U | C | C | N/A | C | C | C | - | C | C | C | U |
135- | long | U | C | C | N/A | C | C | C | C | - | C | C | U |
136- | short | U | C | C | N/A | C | C | C | C | C | - | C | U |
137- | string | C | C | C | C | I | C | C | C | C | C | - | I |
138- | timestamp | N/A | U | U | C | U | U | U | U | C | U | C | - |
139- <!-- prettier-ignore-end -->
140-
141- ** Notes:**
142- - ** decimal -> string** : There can be formatting differences in some case due to Spark using scientific notation where Comet does not
143- - ** double -> decimal** : There can be rounding differences
144- - ** double -> string** : There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
145- - ** float -> decimal** : There can be rounding differences
146- - ** float -> string** : There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
147- - ** string -> date** : Only supports years between 262143 BC and 262142 AD
148- - ** string -> decimal** : Does not support fullwidth unicode digits (e.g \\ uFF10)
149- or strings containing null bytes (e.g \\ u0000)
150- - ** string -> timestamp** : Not all valid formats are supported
151119<!-- END:CAST_TRY_TABLE-->
152120
153121### ANSI Mode
154122
155- <!-- WARNING! DO NOT MANUALLY MODIFY CONTENT BETWEEN THE BEGIN AND END TAGS -->
156-
157123<!-- BEGIN:CAST_ANSI_TABLE-->
158- <!-- prettier-ignore-start -->
159- | | binary | boolean | byte | date | decimal | double | float | integer | long | short | string | timestamp |
160- | ---| ---| ---| ---| ---| ---| ---| ---| ---| ---| ---| ---| ---|
161- | binary | - | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | C | N/A |
162- | boolean | N/A | - | C | N/A | U | C | C | C | C | C | C | U |
163- | byte | U | C | - | N/A | C | C | C | C | C | C | C | U |
164- | date | N/A | U | U | - | U | U | U | U | U | U | C | U |
165- | decimal | N/A | C | C | N/A | - | C | C | C | C | C | C | U |
166- | double | N/A | C | C | N/A | I | - | C | C | C | C | C | U |
167- | float | N/A | C | C | N/A | I | C | - | C | C | C | C | U |
168- | integer | U | C | C | N/A | C | C | C | - | C | C | C | U |
169- | long | U | C | C | N/A | C | C | C | C | - | C | C | U |
170- | short | U | C | C | N/A | C | C | C | C | C | - | C | U |
171- | string | C | C | C | C | I | C | C | C | C | C | - | I |
172- | timestamp | N/A | U | U | C | U | U | U | U | C | U | C | - |
173- <!-- prettier-ignore-end -->
174-
175- ** Notes:**
176- - ** decimal -> string** : There can be formatting differences in some case due to Spark using scientific notation where Comet does not
177- - ** double -> decimal** : There can be rounding differences
178- - ** double -> string** : There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
179- - ** float -> decimal** : There can be rounding differences
180- - ** float -> string** : There can be differences in precision. For example, the input "1.4E-45" will produce 1.0E-45 instead of 1.4E-45
181- - ** string -> date** : Only supports years between 262143 BC and 262142 AD
182- - ** string -> decimal** : Does not support fullwidth unicode digits (e.g \\ uFF10)
183- or strings containing null bytes (e.g \\ u0000)
184- - ** string -> timestamp** : ANSI mode not supported
185124<!-- END:CAST_ANSI_TABLE-->
186125
187126See the [ tracking issue] ( https://github.com/apache/datafusion-comet/issues/286 ) for more details.
0 commit comments