GH-49740: [C++][Python] Fix casts to view types leaving null variadic buffers#50166
GH-49740: [C++][Python] Fix casts to view types leaving null variadic buffers#50166fenfeng9 wants to merge 5 commits into
Conversation
| util::ToInlineBinaryView("hello"), | ||
| util::ToInlineBinaryView("world"), | ||
| }), | ||
| Raises(StatusCode::Invalid)); |
There was a problem hiding this comment.
Can we test the error message somehow?
There was a problem hiding this comment.
Sure, I'll update the tests later.
| nullptr})); | ||
|
|
||
| struct ArrowArray c_export; | ||
| ASSERT_RAISES(Invalid, ExportArray(*arr, &c_export)); |
There was a problem hiding this comment.
Can we test the error message too?
|
Hmm, I think we should try to produce some additional columns without any variadic buffers in the integration tests. Are you comfortable doing that @fenfeng9 ? Otherwise I can do it. |
Yes, I'm willing to do that. Should I open a new issue for it? |
Or you can do it in this PR, because it would help validate that all implementations are in agreement. |
Sounds good, I'll update it in this PR later. Thanks! |
|
We may want to wait for the outcome of the discussion in #50172, actually |
It's OK, I'll update the code after the discussion is concluded. |
|
Ok, in #50172 the consensus is that null variadic buffers should be allowed, at least when importing from the C Data Interface. So we can either:
|
Agree. Then we just need to check for arrow/cpp/src/arrow/c/bridge.cc Lines 606 to 614 in 445851a |
|
I'll update this PR when that documentation PR is merged. |
Sorry, it seems we only agreed to allow nullptr variadic buffers for the C Data Interface #50172 , but not for Arrow C++ internally. |
|
Well, allowing null variadic buffers inside Arrow C++ might lead to hidden issues if we dereference them. Perhaps it's better to normalize null variadic buffers when importing them over the C Data Interface. |
It looks like the import path already handles null buffers — when the data pointer is null and size is 0, it gets normalized to
arrow/cpp/src/arrow/c/bridge.cc Lines 1767 to 1774 in 445851a
arrow/cpp/src/arrow/c/bridge.cc Lines 1912 to 1935 in 445851a arrow/cpp/src/arrow/c/bridge.cc Lines 1944 to 1947 in 445851a |
|
Ok, so this implies we probably don't want to produce null variadic buffers in other Arrow functionality (such as casting to binary-view). |
|
Also, we need to add tests for null variadic buffers in |
Okay, so we keep the other changes? The And do we also need to update other documentation to state that we do not allow generating null variadic buffers? |
I will update pr later. |
Yes, I think validation should still reject them.
I don't think we have a piece of documentation that covers this, do we? |
No, there isn't. Thanks for the patient reply. Now I understand: we don't generate null variadic buffers inside Arrow C++. |
Rationale for this change
Casting to
binary_vieworstring_viewcould leave a null variadic buffer slot when all values were inline. This could happen for casts frombinary,large_binary,string,large_string, andfixed_size_binary.The C Data Interface exporter reads every variadic buffer to get its size. Because of that, exporting such an array could crash, for example through PyArrow
_export_to_c.Validation also passed for these arrays. For all-inline view arrays, validation never needed to read an out-of-line data buffer.
What changes are included in this PR?
This PR fixes the cast kernels so all-inline view arrays do not keep a null variadic buffer slot.
It also makes validation reject null variadic buffer slots, and makes C Data export return an error instead of crashing.
C++ and Python regression tests cover the cast, validation, and export paths.
Are these changes tested?
Yes.
Are there any user-facing changes?
No.
This PR contains a "Critical Fix" Exporting an all-inline view array through the C Data Interface could crash the process while using only public APIs.
_export_to_csegmentation fault forbinary_viewarray #49740