GH-50223: [C++][Compute] Support string_view/binary_view keys in the hash-aggregate Grouper#50224
GH-50223: [C++][Compute] Support string_view/binary_view keys in the hash-aggregate Grouper#50224fangchenli wants to merge 2 commits into
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
There was a problem hiding this comment.
Pull request overview
This PR extends the C++ hash-aggregate Grouper to accept string_view / binary_view group keys (fixing Table.group_by failures on view-typed keys) by adding a dedicated key encoder and routing view types through the non-fast GrouperImpl path.
Changes:
- Add
BinaryViewKeyEncoderto encode/decode view keys using the existing variable-length row format. - Update
RowEncoderandGrouperImplto dispatch*_viewkey types to the new encoder, while ensuringGrouperFastImplrejects view keys. - Add unit tests (row encoder + grouper) and update benchmarks to generate view-typed inputs appropriately.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/arrow/compute/row/row_encoder_internal.h | Declares BinaryViewKeyEncoder for view key encoding/decoding. |
| cpp/src/arrow/compute/row/row_encoder_internal.cc | Implements BinaryViewKeyEncoder and dispatches view types in RowEncoder::Init. |
| cpp/src/arrow/compute/row/row_encoder_internal_test.cc | Adds round-trip tests for array/scalar view key encoding. |
| cpp/src/arrow/compute/row/grouper.cc | Routes view keys to BinaryViewKeyEncoder; makes fast path reject view keys. |
| cpp/src/arrow/compute/row/grouper_test.cc | Adds functional coverage for view keys (consume/lookup/uniques + randomized tests). |
| cpp/src/arrow/compute/row/grouper_benchmark.cc | Generates view inputs via cast to benchmark the fallback path. |
| g.ExpectConsume(R"([["be"], [null]])", "[1, 2]"); | ||
| g.ExpectConsume(R"([["a long out-of-line view"], ["a long out-of-line view"]])", | ||
| "[3, 3]"); | ||
| g.ExpectUniques(R"([["eh"], ["be"], [null], ["a long out-of-line view"]])"); |
There was a problem hiding this comment.
Test with several different out-of-line views as well?
|
|
||
| // Encodes BinaryView/StringView keys in the same variable-length row format as | ||
| // VarLengthKeyEncoder. The encoded row copies the key bytes, so Decode builds a | ||
| // fresh view array rather than aliasing the input's variadic buffers. |
There was a problem hiding this comment.
Since the encode path is the same as for VarLengthKeyEncoder (only the decoding differs), can we factor out the encoding methods in a common base class?
|
@zanmato1984 Do you also want to take a look at this? |
|
I can look into this. |
Rationale for this change
Grouper rejected
string_viewandbinary_viewgroup keys withNotImplemented: Keys of type string_view, soTable.group_byon a view-typed key failed.What changes are included in this PR?
BinaryViewKeyEncoderinrow_encoder_internal.{h,cc}. It encodes view keys using the existing variable-length row format and decodes them into a fresh view array of the original type.GrouperImpl::Makedispatches view keys to the new encoder.GrouperFastImpl::CanUserejects view keys, so they fall back to GrouperImpl.Are these changes tested?
Yes, unit tests added.
Are there any user-facing changes?
Yes.
Table.group_bynow acceptsstring_viewandbinary_viewgroup keys