Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jan 15, 2026

Rationale for this change

Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.

What changes are included in this PR?

Add a Security Considerations page in the Format section.

Doc preview: https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

Are these changes tested?

N/A

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Jan 15, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link

Revision: 593babb

Submitted crossbow builds: ursacomputing/crossbow @ actions-4f7018459b

Task Status
preview-docs GitHub Actions

Copy link
Member

@raboof raboof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable (without any particular Arrow expertise)

(noticed two typo's)

------------------

A less obvious pitfall is when some parts of an Arrow array are left uninitialized.
For example, if a element of a primitive Arrow array is marked null through its
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, if a element of a primitive Arrow array is marked null through its
For example, if an element of a primitive Arrow array is marked null through its

purposes. It is therefore tempting, when creating an array with null values, to
not initialize the corresponding value slots.

However, this then introduces a serious security if the Arrow data is serialized
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, this then introduces a serious security if the Arrow data is serialized
However, this then introduces a serious security risk if the Arrow data is serialized

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 15, 2026
uninitialized in a buffer if the array might be sent to, or read by, a untrusted
third-party, even when the uninitialized data is logically irrelevant. The
easiest way to do this, though perhaps not the most efficient, is to zero-initialize
any buffer that will not be populated in full.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth pointing out something about query engines and dataframe libraries deciding to not do so for internal/intermediate values in computations but applying a canonicalization pass when data leaves the system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).

And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system

from an untrusted source (for example because you are writing a proxy to
an arbitrary third-party service), it is **recommended** that you validate
the data first, as the consumer may assume that the data is valid already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the
C data interface might be performing only very light validation of these values.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 15, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pitrou -- this is much needed and very helpful

I had some suggestions on structure. Hopefully they are helpful

Comment on lines +37 to +38
process' address space. As such, in-memory Arrow data should not be accessed
without care.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding a double negative might make this read better

Suggested change
process' address space. As such, in-memory Arrow data should not be accessed
without care.
process' address space. As such, in-memory Arrow data should be accessed with care.

Comment on lines +43 to +50
Reading and interpreting Arrow data involves reading into several buffers,
sometimes in non-trivial ways. This may for instance involve data-dependent
indirect addressing: to read a value from a Binary array, you need to
1) read its offsets in buffer #2, and 2) read the range of bytes delimited by
these offsets in buffer #3. If the offsets are invalid (deliberately or not),
then step 2) can access invalid memory (potentially crashing the process) or
memory unrelated to Arrow (potentially allowing an attacker to exfiltrate
confidential data).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the core point of this paragraph may get a little lost in the specific details. I would suggest we start by explicitly stating the root cause of potential safety concerns. Something like

Suggested change
Reading and interpreting Arrow data involves reading into several buffers,
sometimes in non-trivial ways. This may for instance involve data-dependent
indirect addressing: to read a value from a Binary array, you need to
1) read its offsets in buffer #2, and 2) read the range of bytes delimited by
these offsets in buffer #3. If the offsets are invalid (deliberately or not),
then step 2) can access invalid memory (potentially crashing the process) or
memory unrelated to Arrow (potentially allowing an attacker to exfiltrate
confidential data).
Arrow is a low level memory format, and the contents of Arrow
buffers are often combined and treated as pointers into the process
memory space. Invalid Arrow data may cause invalid memory accesses
(potentially crashing the process) or permit access to non-Arrow data
(potentially allowing an attacker to exfiltrate confidential information).
For example, reading and interpreting Arrow data involves reading into several buffers,
sometimes in non-trivial ways. This may for instance involve data-dependent
indirect addressing: to read a value from a Binary array, you need to
1) read its offsets in buffer #2, and 2) read the range of bytes delimited by
these offsets in buffer #3. If the offsets are invalid (deliberately or not),
then step 2) can access invalid memory.

Advice for users
''''''''''''''''

If you receive Arrow in-memory data from an untrusted source, it is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we also make the point about performance here to give context about why
validation is not always performed

Perhaps something like this:

"Arrow implementations often assume Arrays follow the specification
to provide high speed processing. It is extremely important that
your application either trusts or validates the Arrays it receives from
other sources.

Many Arrow implementations provide APIs to do such validation.

In terms of APIs, the Rust implementation always validates data from external sources, unless the validation is explicitly turned off with APIs marked as unsafe (a special Rust keyword).

uninitialized in a buffer if the array might be sent to, or read by, a untrusted
third-party, even when the uninitialized data is logically irrelevant. The
easiest way to do this, though perhaps not the most efficient, is to zero-initialize
any buffer that will not be populated in full.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).

And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system

''''''''''''''''

If you produce a C Data Interface structure for data that nevertheless comes
from an untrusted source (for example because you are writing a proxy to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is any different than the other APIs -- basically "if you don't trust the producer source, you should always explicitly validate the arrays before processing them"

This doesn't seem any different for the C Data Interface than for the other APIs (like IPC files. etc)

a trusted producer, for the reason explained above. However, it is still **recommended**
that you validate it for soundness, as a trusted producer can have bugs anyway.

IPC Format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I think we could combine this into the section about validating data from untrusted sources, and give C Data Interface and IPC Format as examples of potentially untrusted sources.

How to read this
================

Hereafter we try list potential security concerns when dealing with the various
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Hereafter we try list potential security concerns when dealing with the various
Hereafter we try to list potential security concerns when dealing with the various

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants