Skip to content

GH-49740: [C++][Python] Fix casts to view types leaving null variadic buffers#50166

Open
fenfeng9 wants to merge 5 commits into
apache:mainfrom
fenfeng9:fix-gh-49740-view-export-to-c
Open

GH-49740: [C++][Python] Fix casts to view types leaving null variadic buffers#50166
fenfeng9 wants to merge 5 commits into
apache:mainfrom
fenfeng9:fix-gh-49740-view-export-to-c

Conversation

@fenfeng9

@fenfeng9 fenfeng9 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

Casting to binary_view or string_view could leave a null variadic buffer slot when all values were inline. This could happen for casts from binary, large_binary, string, large_string, and fixed_size_binary.

The C Data Interface exporter reads every variadic buffer to get its size. Because of that, exporting such an array could crash, for example through PyArrow _export_to_c.

Validation also passed for these arrays. For all-inline view arrays, validation never needed to read an out-of-line data buffer.

What changes are included in this PR?

This PR fixes the cast kernels so all-inline view arrays do not keep a null variadic buffer slot.

It also makes validation reject null variadic buffer slots, and makes C Data export return an error instead of crashing.

C++ and Python regression tests cover the cast, validation, and export paths.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

This PR contains a "Critical Fix" Exporting an all-inline view array through the C Data Interface could crash the process while using only public APIs.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @fenfeng9 . Two minor suggestions, but otherwise this PR looks good to me.

util::ToInlineBinaryView("hello"),
util::ToInlineBinaryView("world"),
}),
Raises(StatusCode::Invalid));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test the error message somehow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update the tests later.

nullptr}));

struct ArrowArray c_export;
ASSERT_RAISES(Invalid, ExportArray(*arr, &c_export));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test the error message too?

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 15, 2026
@pitrou

pitrou commented Jun 15, 2026

Copy link
Copy Markdown
Member

Hmm, I think we should try to produce some additional columns without any variadic buffers in the integration tests. Are you comfortable doing that @fenfeng9 ? Otherwise I can do it.

@fenfeng9

Copy link
Copy Markdown
Contributor Author

Hmm, I think we should try to produce some additional columns without any variadic buffers in the integration tests. Are you comfortable doing that @fenfeng9 ? Otherwise I can do it.

Yes, I'm willing to do that. Should I open a new issue for it?

@pitrou

pitrou commented Jun 15, 2026

Copy link
Copy Markdown
Member

Yes, I'm willing to do that. Should I open a new issue for it?

Or you can do it in this PR, because it would help validate that all implementations are in agreement.

@fenfeng9

Copy link
Copy Markdown
Contributor Author

Yes, I'm willing to do that. Should I open a new issue for it?

Or you can do it in this PR, because it would help validate that all implementations are in agreement.

Sounds good, I'll update it in this PR later. Thanks!

@pitrou

pitrou commented Jun 15, 2026

Copy link
Copy Markdown
Member

We may want to wait for the outcome of the discussion in #50172, actually

@fenfeng9

Copy link
Copy Markdown
Contributor Author

We may want to wait for the outcome of the discussion in #50172, actually

It's OK, I'll update the code after the discussion is concluded.

@pitrou

pitrou commented Jun 25, 2026

Copy link
Copy Markdown
Member

Ok, in #50172 the consensus is that null variadic buffers should be allowed, at least when importing from the C Data Interface.

So we can either:

  • detect null variadic buffers when importing from the C Data Interface, and normalize them as empty non-null buffers
  • or allow null variadic buffers everywhere in Arrow C++ (which may be more tedious and more fragile, as we would have to test for this condition explicitly)

@fenfeng9

Copy link
Copy Markdown
Contributor Author

Ok, in #50172 the consensus is that null variadic buffers should be allowed, at least when importing from the C Data Interface.

So we can either:

  • detect null variadic buffers when importing from the C Data Interface, and normalize them as empty non-null buffers
  • or allow null variadic buffers everywhere in Arrow C++ (which may be more tedious and more fragile, as we would have to test for this condition explicitly)

Agree. Then we just need to check for nullptr variadic buffers during Export and set their sizes to 0 instead of calling buf->size().

if (need_variadic_buffer_sizes) {
auto variadic_buffers = std::span(data->buffers).subspan(2);
export_.variadic_buffer_sizes_.resize(variadic_buffers.size());
size_t i = 0;
for (const auto& buf : variadic_buffers) {
export_.variadic_buffer_sizes_[i++] = buf->size();
}
export_.buffers_.back() = export_.variadic_buffer_sizes_.data();
}

@fenfeng9

Copy link
Copy Markdown
Contributor Author

I'll update this PR when that documentation PR is merged.

@fenfeng9

Copy link
Copy Markdown
Contributor Author

I'll update this PR when that documentation PR is merged.

Sorry, it seems we only agreed to allow nullptr variadic buffers for the C Data Interface #50172 , but not for Arrow C++ internally.

@pitrou

pitrou commented Jun 25, 2026

Copy link
Copy Markdown
Member

Well, allowing null variadic buffers inside Arrow C++ might lead to hidden issues if we dereference them. Perhaps it's better to normalize null variadic buffers when importing them over the C Data Interface.

@fenfeng9

Copy link
Copy Markdown
Contributor Author

Well, allowing null variadic buffers inside Arrow C++ might lead to hidden issues if we dereference them. Perhaps it's better to normalize null variadic buffers when importing them over the C Data Interface.

It looks like the import path already handles null buffers — when the data pointer is null and size is 0, it gets normalized to zero_size_buffer_.

Status ImportBinaryView(const BinaryViewType&):

// The last C data buffer stores buffer sizes, and shouldn't be imported
auto* buffer_sizes =
static_cast<const int64_t*>(c_struct_->buffers[c_struct_->n_buffers - 1]);
for (int32_t buffer_id = 2; buffer_id < c_struct_->n_buffers - 1; ++buffer_id) {
RETURN_NOT_OK(ImportBuffer(buffer_id, buffer_sizes[buffer_id - 2]));
}
data_->buffers.pop_back();

ImportBuffer:

Status ImportBuffer(int32_t buffer_id, int64_t buffer_size,
bool is_null_bitmap = false) {
std::shared_ptr<Buffer>* out = &data_->buffers[buffer_id];
auto data = reinterpret_cast<const uint8_t*>(c_struct_->buffers[buffer_id]);
if (data != nullptr) {
if (memory_mgr_) {
*out = std::make_shared<ImportedBuffer>(data, buffer_size, memory_mgr_,
device_type_, import_);
} else {
*out = std::make_shared<ImportedBuffer>(data, buffer_size, import_);
}
} else if (is_null_bitmap) {
out->reset();
} else {
// Ensure that imported buffers are never null (except for the null bitmap)
if (buffer_size != 0) {
return Status::Invalid(
"ArrowArrayStruct contains null data pointer "
"for a buffer with non-zero computed size");
}
*out = zero_size_buffer_;
}
return Status::OK();
}

// For imported null buffer pointers
std::shared_ptr<Buffer> zero_size_buffer_;

@pitrou

pitrou commented Jun 25, 2026

Copy link
Copy Markdown
Member

Ok, so this implies we probably don't want to produce null variadic buffers in other Arrow functionality (such as casting to binary-view).

@pitrou

pitrou commented Jun 25, 2026

Copy link
Copy Markdown
Member

Also, we need to add tests for null variadic buffers in arrow/c/bridge_test.cc.

@fenfeng9

Copy link
Copy Markdown
Contributor Author

Ok, so this implies we probably don't want to produce null variadic buffers in other Arrow functionality (such as casting to binary-view).

Okay, so we keep the other changes? The validate still reject null variadic buffers?

And do we also need to update other documentation to state that we do not allow generating null variadic buffers?

@fenfeng9

Copy link
Copy Markdown
Contributor Author

Also, we need to add tests for null variadic buffers in arrow/c/bridge_test.cc.

I will update pr later.

@pitrou

pitrou commented Jun 25, 2026

Copy link
Copy Markdown
Member

Okay, so we keep the other changes? The validate still reject null variadic buffers?

Yes, I think validation should still reject them.

And do we also need to update other documentation to state that we do not allow generating null variadic buffers?

I don't think we have a piece of documentation that covers this, do we?

@fenfeng9

Copy link
Copy Markdown
Contributor Author

I don't think we have a piece of documentation that covers this, do we?

No, there isn't.

Thanks for the patient reply. Now I understand: we don't generate null variadic buffers inside Arrow C++.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants