Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 24 additions & 4 deletions docs/source/developers/cpp/fuzzing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,10 @@ Fuzzing Arrow C++
To make the handling of invalid input more robust, we have enabled
fuzz testing on several parts of the Arrow C++ feature set, currently:

* the IPC stream format
* the IPC file format
* the Parquet file format
* the CSV file format
* the IPC stream reader
* the IPC file reader
* the Parquet file reader
* the CSV file reader

We welcome any contribution to expand the scope of fuzz testing and cover
areas ingesting potentially invalid or malicious data.
Expand Down Expand Up @@ -110,3 +110,23 @@ dependencies, you may need to install these before building the fuzz targets:
$ conda install clang clangxx compiler-rt
$ cmake .. --preset=fuzzing
.. _fuzz-regression-files:

Regression files
================

When a fuzzer-detected bug is found and fixed, the corresponding reproducer
must be stored in the `arrow-testing <https://github.com/apache/arrow-testing/>`__
repository to ensure that the code doesn't regress.

The locations for these files are as follows:

* IPC streams: in the ``data/arrow-ipc-stream`` `directory <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream>`__
* IPC files: in the ``data/arrow-ipc-file`` `directory <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-file>`__
* Parquet files: in the ``data/parquet/fuzzing`` `directory <https://github.com/apache/arrow-testing/tree/master/data/parquet/fuzzing>`__
* CSV files: in the ``data/csv/fuzzing`` `directory <https://github.com/apache/arrow-testing/tree/master/data/csv/fuzzing>`__

Most of those files are invalid files for their respective formats and stress
proper error detection and reporting in the implementation code.
2 changes: 2 additions & 0 deletions docs/source/format/CanonicalExtensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,8 @@ UUID
A specific UUID version is not required or guaranteed. This extension represents
UUIDs as FixedSizeBinary(16) with big-endian notation and does not interpret the bytes in any way.

.. _opaque_extension:

Opaque
======

Expand Down
2 changes: 2 additions & 0 deletions docs/source/format/Integration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -561,6 +561,8 @@ in ``datagen.py``):
* Extension Types


.. _format-gold-integration-files:

Gold File Integration Tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
277 changes: 277 additions & 0 deletions docs/source/format/Security.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at

.. http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. _format_security:

***********************
Security Considerations
***********************

This document describes security considerations when reading Arrow
data from untrusted sources. It focuses specifically on data passed in a
standardized serialized form (such as a IPC stream), as opposed to an
implementation-specific native representation (such as ``arrow::Array`` in C++).

.. note::
Implementation-specific concerns, such as bad API usage, are out of scope
for this document. Please refer to the implementation's own documentation.


Who should read this
====================

You should read this document if you belong to either of these two categories:

1. *users* of Arrow: that is, developers of third-party libraries or applications
that don't directly implement the Arrow formats or protocols, but instead
call language-specific APIs provided by an Arrow library (as defined below);

2. *implementors* of Arrow libraries: that is, libraries that provide APIs
abstracting away from the details of the Arrow formats and protocols; such
libraries include, but are not limited to, the official Arrow implementations
documented on https://arrow.apache.org.


Columnar Format
===============

Invalid data
------------

The Arrow :ref:`columnar format <_format_columnar>` is an efficient binary
representation with a focus on performance and efficiency. While the format
does not store raw pointers, the contents of Arrow buffers are often
combined and converted to pointers into the process' address space.
Invalid Arrow data may therefore cause invalid memory accesses
(potentially crashing the process) or access to non-Arrow data
(potentially allowing an attacker to exfiltrate confidential information).

For instance, to read a value from a Binary array, one needs to 1) read the
values' offsets from the array's offsets buffer, and 2) read the range of bytes
delimited by these offsets in the array's data buffer. If the offsets are
invalid (deliberately or not), then step 2) can access memory outside of the
data buffer's range.

Another instance of invalid data lies in the values themselves. For example,
a String array is only allowed to contain valid UTF-8 data, but an untrusted
source might have emitted invalid UTF-8 under the disguise of a String array.
An unsuspecting algorithm that is only specified for valid UTF-8 inputs might
lead to dangerous behavior (for example by reading memory out of bounds when
looking for an UTF-8 character boundary).

Fortunately, knowing its schema, it is possible to validate Arrow data up front,
so that reading this data will not pose any danger later on.

.. TODO:
For each layout, we should list the associated security risks and the recommended
steps to validate (perhaps in Columnar.rst)
Comment on lines +79 to +81
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be nice, or perhaps a corpus of IPC streams that should fail validation for specific reasons.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, a corpus of IPC streams already exists in https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream . Perhaps we should move them to a dedicated subdirectory for readability, and then mention them in this document?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those are in a good place (I found them when I was writing the IPC reader and tested them). Those files don't necessarily exercise specific types of failures and many of the basic ones (e.g., a string array with an out of range offset) were covered before there were any fuzz failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we'd need a more deliberately generated corpus of invalid files. Unfortunately I don't think we have any tools to help us do that currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like links to the invalid fuzz data has been added above

I personally think this PR is already hugely valuable, even without such a list.

Thus I suggest we merge this PR and get it published. We can file an issue to track adding type specific security risks as a follow on


Advice for users
''''''''''''''''

Arrow implementations often assume inputs follow the specification to provide
high speed processing. It is **extremely recommended** that your application
explicitly validates any Arrow data it receives under serialized form
from untrusted sources. Many Arrow implementations provide explicit APIs to
perform such validation.

.. TODO: link to some validation APIs for the main implementations here?

Advice for implementors
'''''''''''''''''''''''

It is **recommended** that you provide dedicated APIs to validate Arrow arrays
and/or record batches. Users will be able to utilize those APIs to assert whether
data coming from untrusted sources can be safely accessed.

A typical validation API must return a well-defined error, not crash, if the
given Arrow data is invalid; it must always be safe to execute regardless of
whether the data is valid or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For APIs that read serialized data (e.g., an IPC reader or function that deserializes
IPC-formatted bytes into an object), it is **(extremely?) recommended** that validation
is performed. If the ability to turn off validation is exposed as an option for performance
reasons when reading trusted data, it is **recommended** that validation is still performed
by default and the security implications of skipping validation are documented.

Copy link
Member Author

@pitrou pitrou Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we would be recommending something that we don't do currently :-)

(and the reason we don't do it is that it can be costly, e.g. ValidateFull is O(n) for some datatypes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, Rust and nanoarrow both do this by default. It seems odd to me that the default behaviour of reading a file would be a mode that could cause a crash but perhaps this is normal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to the relevant piece of code in nanoarrow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Arrow Java doesn't validate automatically either, though I can't say for sure. @lidavidm @jbonofre

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow-java barely has a validation API in the first place :/
Unfortunately, its extensive use of unsafe also means that the JVM doesn't protect us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, its extensive use of unsafe also means that the JVM doesn't protect us.

Wasn't there a talk of switching to the newer FFM facility? Perhaps that hasn't happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's been talk, but no action; the bigger issue is that it would mean dropping Java <24 or so. Given we still have people pop up occasionally trying to use Java 8, that may be a hard sell; I think Spark has only just moved to Java 17.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW the way the Rust API works is that by default, data read from IPC is explicitly validated (which is indeed quite expensive)

It is possible to turn this validation off via an unsafe API (the unsafe bit means the calling application has to explicitly disable the validation to acknowledge they are trusting the source):

https://docs.rs/arrow-ipc/57.2.0/arrow_ipc/reader/struct.FileDecoder.html#method.with_skip_validation

Uninitialized data
------------------

A less obvious pitfall is when some parts of an Arrow array are left uninitialized.
For example, if an element of a primitive Arrow array is marked null through its
validity bitmap, the corresponding value slot in the values buffer can be ignored
for all purposes. It is therefore tempting, when creating an array with null
values, to not initialize the corresponding value slots.

However, this then introduces a serious security risk if the Arrow data is
serialized and published (e.g. using IPC or Flight) such that it can be
accessed by untrusted users. Indeed, the uninitialized value slot can
reveal data left by a previous memory allocation made in the same process.
Depending on the application, this data could contain confidential information.

Advice for users and implementors
'''''''''''''''''''''''''''''''''

When creating a Arrow array, it is **recommended** that you never leave any
data uninitialized in a buffer if the array might be sent to, or read by, an
untrusted third-party, even when the uninitialized data is logically
irrelevant. The easiest way to do this is to zero-initialize any buffer that
will not be populated in full.

If it is determined, through benchmarking, that zero-initialization imposes
an excessive performance cost, a library or application may instead decide
to use uninitialized memory internally as an optimization; but it should then
ensure all such uninitialized values are cleared before passing the Arrow data
to another system.

.. note::
Sending Arrow data out of the current process can happen *indirectly*,
for example if you produce it over the C Data Interface and the consumer
persists it using the IPC format on some public storage.


C Data Interface
================

The C Data Interface contains raw pointers into the process' address space.
It is generally not possible to validate that those pointers are legitimate;
read from such a pointer may crash or access unrelated or bogus data.

Advice for users
----------------

You should **never** consume a C Data Interface structure from an untrusted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good and clear 👍

producer, as it is by construction impossible to guard against dangerous
behavior in this case.

Advice for implementors
-----------------------

When consuming a C Data Interface structure, you can assume that it comes from
a trusted producer, for the reason explained above. However, it is still
**recommended** that you validate it for soundness (for example that the right
number of buffers is passed for a given datatype), as a trusted producer can
have bugs anyway.


IPC Format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I think we could combine this into the section about validating data from untrusted sources, and give C Data Interface and IPC Format as examples of potentially untrusted sources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This" means the IPC format section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- I was thinking that if the guidance is the same for IPC and C Data Interface, calling them out separately makes this documentation more verbose than it could be

The high level overview in my mind is:

  1. If you read Arrow data (in any format) from an untrusted source, it can be a memory and security concern. Thus you should always verify such data.
  2. Implementers should provide a way to let users opt in / out of the verification depending on the source and their security threat model.

If there are specific things to suggest remembering to check that is specific for specific formats such as the FlagBuffers offsets for IPC based formats, then we could call them out in those sections.

I am happy to propose some new wording if you like

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the guidance is not the same. For the C Data Interface, it is simply impossible (to my knowledge) to protect against rogue raw pointers. For IPC, validation is possible to ensure safe operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's no way to meaningfully validate an ArrowArray other than perhaps to check pointers for unexpected NULL. This is because the ArrowArray does not provide buffer lengths (the consumer has to infer those based on the ArrowSchema and/or the buffer data of previous buffers). This is roughly the same as receiving an Arrow C++ or arrow-rs array from another library: the consumer has to assume it was correctly produced to avoid a crash and a malicious consumer can always attempt to read byte ranges it isn't supposed to based on the buffer pointers).

==========

The :ref:`IPC format <_ipc-message-format>` is a serialization format for the
columnar format with associated metadata. Reading an IPC stream or file from
an untrusted source comes with similar caveats as reading the Arrow columnar
format.

The additional signalisation and metadata in the IPC format come with
their own risks. For example, buffer offsets and sizes encoded in IPC messages
may be out of bounds for the IPC stream; Flatbuffers-encoded metadata payloads
may carry incorrect offsets pointing outside of the designated metadata area.

Advice for users
----------------

Arrow libraries will typically ensure IPC streams are structurally valid
but may not also validate the underlying Array data. It is **extremely recommended**
that you use the appropriate APIs to validate the Arrow data read from an untrusted IPC stream.

Advice for implementors
-----------------------

It is **extremely recommended** to run dedicated validation checks when decoding
the IPC format, to make sure that the decoding can not induce unwanted behavior.
Failing those checks should return a well-known error to the caller, not crash.


Extension Types
===============

Extension types typically register a custom deserialization hook so that they
can be automatically recreated when reading from an external source (for example
using IPC). The deserialization hook has to decode the extension type's parameters
from a string or binary payload specific to the extension type.
:ref:`Typical examples <opaque_extension>` use a bespoke JSON representation
with object fields representing the various parameters.

When reading data from an untrusted source, any registered deserialization hook
could be called with an arbitrary payload. It is therefore of primary importance
that the hook be safe to call on invalid, potentially malicious, data. This mandates
the use of a robust metadata serialization schema (such as JSON, but not Python's
`pickle <https://docs.python.org/3/library/pickle.html>`__ or R's
`serialize() <https://stat.ethz.ch/R-manual/R-devel/library/base/html/serialize.html>`__,
for example).

Advice for users and implementors
---------------------------------

When designing an extension type, it is **extremely recommended** to choose a
metadata serialization format that is robust against potentially malicious
data.

When implementing an extension type, it is **recommended** to ensure that the
deserialization hook is able to detect, and error out gracefully, if the
serialized metadata payload is invalid.


Testing for robustness
======================

Advice for implementors
-----------------------

For APIs that may process untrusted inputs, it is **extremely recommended**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For APIs that may process untrusted inputs, it is **extremely recommended**
For APIs that may process untrusted inputs, it is **extremely recommended**

I think this applies to both serialized and unserialized inputs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general guideline that isn't put forth in this document (but perhaps we should?) is that it's preferrable to validate untrusted data at the Arrow deserialization/conversion boundary (for example deserializing IPC or decoding Parquet). As a consequence, all already deserialized data should already be trusted.

Can you think of counter-examples?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of counter examples (other than functions like from_buffers() that assemble untrusted buffers into an array). I think this is the distinction I was trying to get at, however we want to phrase it.

that your unit tests exercise your APIs against typical kinds of invalid data.
For example, your validation APIs will have to be tested against invalid Binary
or List offsets, invalid UTF-8 data in a String array, etc.

Testing against known regression files
''''''''''''''''''''''''''''''''''''''

The `arrow-testing <https://github.com/apache/arrow-testing/>`__ repository
contains regression files for various formats, such as the IPC format.

Two categories of files are especially noteworthy and can serve to exercise
an Arrow implementation's robustness:

1. :ref:`gold integration files <format-gold-integration-files>` that are valid
files to exercise compliance with Arrow IPC features;
2. :ref:`fuzz regression files <fuzz-regression-files>` that have been automatically
generated each time a fuzzer founds a bug triggered by a specific (usually invalid)
input for a given format.

Fuzzing
'''''''

It is **recommended** that you go one step further and set up some kind of
automated robustness testing against unforeseen inputs. One typical approach
is though fuzzing, possibly coupled with a runtime instrumentation framework
that detects dangerous behavior (such as Address Sanitizer in C++ or
Rust).

A reasonable way of setting up fuzzing for Arrow is using the IPC format as
a binary payload; the fuzz target should not only attempt to decode the IPC
stream as Arrow data, but it should then validate the Arrow data.
This will strengthen both the IPC decoder and the validation routines
against invalid, potentially malicious data. Finally, if validation comes out
successfully, the fuzz target may exercise some important core functionality,
such as printing the data for human display; this will help ensure that the
validation routine did not let through invalid data that may lead to dangerous
behavior.


Non-Arrow formats and protocols
===============================

Arrow data can also be sent or stored using third-party formats such as Apache
Parquet. Those formats may or may not present the same security risks as listed
above (for example, the precautions around uninitialized data may not apply
in a format like Parquet that does not create any value slots for null elements).
We suggest you refer to these projects' own documentation for more concrete
guidelines.
1 change: 1 addition & 0 deletions docs/source/format/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,6 @@ Specifications
Flight
FlightSql
ADBC
Security
Integration
Glossary
Loading