GH-48868: [Doc] Document security model for the Arrow formats #48870

pitrou · 2026-01-15T15:43:43Z

Rationale for this change

Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.

What changes are included in this PR?

Add a Security Considerations page in the Format section.

Doc preview: https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html

Are these changes tested?

N/A

Are there any user-facing changes?

No.

GitHub Issue: [Doc] Document security model for the Arrow formats #48868

pitrou · 2026-01-15T15:45:55Z

@github-actions crossbow submit preview-docs

github-actions · 2026-01-15T15:48:40Z

Revision: 593babb

Submitted crossbow builds: ursacomputing/crossbow @ actions-4f7018459b

Task	Status
preview-docs

raboof

Looks reasonable (without any particular Arrow expertise)

(noticed two typo's)

raboof · 2026-01-15T16:31:14Z

docs/source/format/Security.rst

+------------------
+
+A less obvious pitfall is when some parts of an Arrow array are left uninitialized.
+For example, if a element of a primitive Arrow array is marked null through its


Suggested change

For example, if a element of a primitive Arrow array is marked null through its

For example, if an element of a primitive Arrow array is marked null through its

raboof · 2026-01-15T16:33:18Z

docs/source/format/Security.rst

+purposes. It is therefore tempting, when creating an array with null values, to
+not initialize the corresponding value slots.
+
+However, this then introduces a serious security if the Arrow data is serialized


Suggested change

However, this then introduces a serious security if the Arrow data is serialized

However, this then introduces a serious security risk if the Arrow data is serialized

felipecrv · 2026-01-15T19:43:50Z

docs/source/format/Security.rst

+uninitialized in a buffer if the array might be sent to, or read by, a untrusted
+third-party, even when the uninitialized data is logically irrelevant. The
+easiest way to do this, though perhaps not the most efficient, is to zero-initialize
+any buffer that will not be populated in full.


Worth pointing out something about query engines and dataframe libraries deciding to not do so for internal/intermediate values in computations but applying a canonicalization pass when data leaves the system.

Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).

And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system

felipecrv · 2026-01-15T19:50:20Z

docs/source/format/Security.rst

+from an untrusted source (for example because you are writing a proxy to
+an arbitrary third-party service), it is **recommended** that you validate
+the data first, as the consumer may assume that the data is valid already.
+


Suggested change

In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the

C data interface might be performing only very light validation of these values.

alamb

Thank you @pitrou -- this is much needed and very helpful

I had some suggestions on structure. Hopefully they are helpful

alamb · 2026-01-15T18:19:23Z

docs/source/format/Security.rst

+process' address space. As such, in-memory Arrow data should not be accessed
+without care.


Avoiding a double negative might make this read better

Suggested change

process' address space. As such, in-memory Arrow data should not be accessed

without care.

process' address space. As such, in-memory Arrow data should be accessed with care.

alamb · 2026-01-16T11:28:44Z

docs/source/format/Security.rst

+Reading and interpreting Arrow data involves reading into several buffers,
+sometimes in non-trivial ways. This may for instance involve data-dependent
+indirect addressing: to read a value from a Binary array, you need to
+1) read its offsets in buffer #2, and 2) read the range of bytes delimited by
+these offsets in buffer #3. If the offsets are invalid (deliberately or not),
+then step 2) can access invalid memory (potentially crashing the process) or
+memory unrelated to Arrow (potentially allowing an attacker to exfiltrate
+confidential data).


I think the core point of this paragraph may get a little lost in the specific details. I would suggest we start by explicitly stating the root cause of potential safety concerns. Something like

Suggested change

Reading and interpreting Arrow data involves reading into several buffers,

sometimes in non-trivial ways. This may for instance involve data-dependent

indirect addressing: to read a value from a Binary array, you need to

1) read its offsets in buffer #2, and 2) read the range of bytes delimited by

these offsets in buffer #3. If the offsets are invalid (deliberately or not),

then step 2) can access invalid memory (potentially crashing the process) or

memory unrelated to Arrow (potentially allowing an attacker to exfiltrate

confidential data).

Arrow is a low level memory format, and the contents of Arrow

buffers are often combined and treated as pointers into the process

memory space. Invalid Arrow data may cause invalid memory accesses

(potentially crashing the process) or permit access to non-Arrow data

(potentially allowing an attacker to exfiltrate confidential information).

For example, reading and interpreting Arrow data involves reading into several buffers,

sometimes in non-trivial ways. This may for instance involve data-dependent

indirect addressing: to read a value from a Binary array, you need to

1) read its offsets in buffer #2, and 2) read the range of bytes delimited by

these offsets in buffer #3. If the offsets are invalid (deliberately or not),

then step 2) can access invalid memory.

alamb · 2026-01-16T11:34:01Z

docs/source/format/Security.rst

+Advice for users
+''''''''''''''''
+
+If you receive Arrow in-memory data from an untrusted source, it is


I suggest we also make the point about performance here to give context about why
validation is not always performed

Perhaps something like this:

"Arrow implementations often assume Arrays follow the specification
to provide high speed processing. It is extremely important that
your application either trusts or validates the Arrays it receives from
other sources.

Many Arrow implementations provide APIs to do such validation.

In terms of APIs, the Rust implementation always validates data from external sources, unless the validation is explicitly turned off with APIs marked as unsafe (a special Rust keyword).

alamb · 2026-01-16T11:40:56Z

docs/source/format/Security.rst

+uninitialized in a buffer if the array might be sent to, or read by, a untrusted
+third-party, even when the uninitialized data is logically irrelevant. The
+easiest way to do this, though perhaps not the most efficient, is to zero-initialize
+any buffer that will not be populated in full.


Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).

And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system

alamb · 2026-01-16T11:42:10Z

docs/source/format/Security.rst

+''''''''''''''''
+
+If you produce a C Data Interface structure for data that nevertheless comes
+from an untrusted source (for example because you are writing a proxy to


I don't think this is any different than the other APIs -- basically "if you don't trust the producer source, you should always explicitly validate the arrays before processing them"

This doesn't seem any different for the C Data Interface than for the other APIs (like IPC files. etc)

alamb · 2026-01-16T11:43:07Z

docs/source/format/Security.rst

+a trusted producer, for the reason explained above. However, it is still **recommended**
+that you validate it for soundness, as a trusted producer can have bugs anyway.
+
+IPC Format


As above, I think we could combine this into the section about validating data from untrusted sources, and give C Data Interface and IPC Format as examples of potentially untrusted sources.

raulcd · 2026-01-16T12:49:52Z

docs/source/format/Security.rst

+How to read this
+================
+
+Hereafter we try list potential security concerns when dealing with the various


Suggested change

Hereafter we try list potential security concerns when dealing with the various

Hereafter we try to list potential security concerns when dealing with the various

apacheGH-48868: [Doc] Document security model for the Arrow formats

593babb

github-actions bot added Component: Documentation awaiting review Awaiting review labels Jan 15, 2026

raboof reviewed Jan 15, 2026

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 15, 2026

felipecrv reviewed Jan 15, 2026

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jan 15, 2026

alamb reviewed Jan 16, 2026

View reviewed changes

raulcd reviewed Jan 16, 2026

View reviewed changes

	For example, if a element of a primitive Arrow array is marked null through its
	For example, if an element of a primitive Arrow array is marked null through its

	However, this then introduces a serious security if the Arrow data is serialized
	However, this then introduces a serious security risk if the Arrow data is serialized



	In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the
	C data interface might be performing only very light validation of these values.

		process' address space. As such, in-memory Arrow data should not be accessed
		without care.

	Hereafter we try list potential security concerns when dealing with the various
	Hereafter we try to list potential security concerns when dealing with the various

GH-48868: [Doc] Document security model for the Arrow formats #48870

Are you sure you want to change the base?

GH-48868: [Doc] Document security model for the Arrow formats #48870

Conversation

pitrou commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pitrou commented Jan 15, 2026

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

raboof left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pitrou commented Jan 15, 2026 •

edited

Loading