diff --git a/docs/source/developers/cpp/fuzzing.rst b/docs/source/developers/cpp/fuzzing.rst index 7c8b346074a..4df5455de22 100644 --- a/docs/source/developers/cpp/fuzzing.rst +++ b/docs/source/developers/cpp/fuzzing.rst @@ -26,10 +26,10 @@ Fuzzing Arrow C++ To make the handling of invalid input more robust, we have enabled fuzz testing on several parts of the Arrow C++ feature set, currently: -* the IPC stream format -* the IPC file format -* the Parquet file format -* the CSV file format +* the IPC stream reader +* the IPC file reader +* the Parquet file reader +* the CSV file reader We welcome any contribution to expand the scope of fuzz testing and cover areas ingesting potentially invalid or malicious data. @@ -110,3 +110,23 @@ dependencies, you may need to install these before building the fuzz targets: $ conda install clang clangxx compiler-rt $ cmake .. --preset=fuzzing + + +.. _fuzz-regression-files: + +Regression files +================ + +When a fuzzer-detected bug is found and fixed, the corresponding reproducer +must be stored in the `arrow-testing `__ +repository to ensure that the code doesn't regress. + +The locations for these files are as follows: + +* IPC streams: in the ``data/arrow-ipc-stream`` `directory `__ +* IPC files: in the ``data/arrow-ipc-file`` `directory `__ +* Parquet files: in the ``data/parquet/fuzzing`` `directory `__ +* CSV files: in the ``data/csv/fuzzing`` `directory `__ + +Most of those files are invalid files for their respective formats and stress +proper error detection and reporting in the implementation code. diff --git a/docs/source/format/CanonicalExtensions.rst b/docs/source/format/CanonicalExtensions.rst index 697e7627d89..41b94aa0a83 100644 --- a/docs/source/format/CanonicalExtensions.rst +++ b/docs/source/format/CanonicalExtensions.rst @@ -285,6 +285,8 @@ UUID A specific UUID version is not required or guaranteed. This extension represents UUIDs as FixedSizeBinary(16) with big-endian notation and does not interpret the bytes in any way. +.. _opaque_extension: + Opaque ====== diff --git a/docs/source/format/Integration.rst b/docs/source/format/Integration.rst index ca88a825dc8..5038132241c 100644 --- a/docs/source/format/Integration.rst +++ b/docs/source/format/Integration.rst @@ -561,6 +561,8 @@ in ``datagen.py``): * Extension Types +.. _format-gold-integration-files: + Gold File Integration Tests ~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/source/format/Security.rst b/docs/source/format/Security.rst new file mode 100644 index 00000000000..0c117fe1e21 --- /dev/null +++ b/docs/source/format/Security.rst @@ -0,0 +1,277 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. _format_security: + +*********************** +Security Considerations +*********************** + +This document describes security considerations when reading Arrow +data from untrusted sources. It focuses specifically on data passed in a +standardized serialized form (such as a IPC stream), as opposed to an +implementation-specific native representation (such as ``arrow::Array`` in C++). + +.. note:: + Implementation-specific concerns, such as bad API usage, are out of scope + for this document. Please refer to the implementation's own documentation. + + +Who should read this +==================== + +You should read this document if you belong to either of these two categories: + +1. *users* of Arrow: that is, developers of third-party libraries or applications + that don't directly implement the Arrow formats or protocols, but instead + call language-specific APIs provided by an Arrow library (as defined below); + +2. *implementors* of Arrow libraries: that is, libraries that provide APIs + abstracting away from the details of the Arrow formats and protocols; such + libraries include, but are not limited to, the official Arrow implementations + documented on https://arrow.apache.org. + + +Columnar Format +=============== + +Invalid data +------------ + +The Arrow :ref:`columnar format <_format_columnar>` is an efficient binary +representation with a focus on performance and efficiency. While the format +does not store raw pointers, the contents of Arrow buffers are often +combined and converted to pointers into the process' address space. +Invalid Arrow data may therefore cause invalid memory accesses +(potentially crashing the process) or access to non-Arrow data +(potentially allowing an attacker to exfiltrate confidential information). + +For instance, to read a value from a Binary array, one needs to 1) read the +values' offsets from the array's offsets buffer, and 2) read the range of bytes +delimited by these offsets in the array's data buffer. If the offsets are +invalid (deliberately or not), then step 2) can access memory outside of the +data buffer's range. + +Another instance of invalid data lies in the values themselves. For example, +a String array is only allowed to contain valid UTF-8 data, but an untrusted +source might have emitted invalid UTF-8 under the disguise of a String array. +An unsuspecting algorithm that is only specified for valid UTF-8 inputs might +lead to dangerous behavior (for example by reading memory out of bounds when +looking for an UTF-8 character boundary). + +Fortunately, knowing its schema, it is possible to validate Arrow data up front, +so that reading this data will not pose any danger later on. + +.. TODO: + For each layout, we should list the associated security risks and the recommended + steps to validate (perhaps in Columnar.rst) + +Advice for users +'''''''''''''''' + +Arrow implementations often assume inputs follow the specification to provide +high speed processing. It is **extremely recommended** that your application +explicitly validates any Arrow data it receives under serialized form +from untrusted sources. Many Arrow implementations provide explicit APIs to +perform such validation. + +.. TODO: link to some validation APIs for the main implementations here? + +Advice for implementors +''''''''''''''''''''''' + +It is **recommended** that you provide dedicated APIs to validate Arrow arrays +and/or record batches. Users will be able to utilize those APIs to assert whether +data coming from untrusted sources can be safely accessed. + +A typical validation API must return a well-defined error, not crash, if the +given Arrow data is invalid; it must always be safe to execute regardless of +whether the data is valid or not. + +Uninitialized data +------------------ + +A less obvious pitfall is when some parts of an Arrow array are left uninitialized. +For example, if an element of a primitive Arrow array is marked null through its +validity bitmap, the corresponding value slot in the values buffer can be ignored +for all purposes. It is therefore tempting, when creating an array with null +values, to not initialize the corresponding value slots. + +However, this then introduces a serious security risk if the Arrow data is +serialized and published (e.g. using IPC or Flight) such that it can be +accessed by untrusted users. Indeed, the uninitialized value slot can +reveal data left by a previous memory allocation made in the same process. +Depending on the application, this data could contain confidential information. + +Advice for users and implementors +''''''''''''''''''''''''''''''''' + +When creating a Arrow array, it is **recommended** that you never leave any +data uninitialized in a buffer if the array might be sent to, or read by, an +untrusted third-party, even when the uninitialized data is logically +irrelevant. The easiest way to do this is to zero-initialize any buffer that +will not be populated in full. + +If it is determined, through benchmarking, that zero-initialization imposes +an excessive performance cost, a library or application may instead decide +to use uninitialized memory internally as an optimization; but it should then +ensure all such uninitialized values are cleared before passing the Arrow data +to another system. + +.. note:: + Sending Arrow data out of the current process can happen *indirectly*, + for example if you produce it over the C Data Interface and the consumer + persists it using the IPC format on some public storage. + + +C Data Interface +================ + +The C Data Interface contains raw pointers into the process' address space. +It is generally not possible to validate that those pointers are legitimate; +read from such a pointer may crash or access unrelated or bogus data. + +Advice for users +---------------- + +You should **never** consume a C Data Interface structure from an untrusted +producer, as it is by construction impossible to guard against dangerous +behavior in this case. + +Advice for implementors +----------------------- + +When consuming a C Data Interface structure, you can assume that it comes from +a trusted producer, for the reason explained above. However, it is still +**recommended** that you validate it for soundness (for example that the right +number of buffers is passed for a given datatype), as a trusted producer can +have bugs anyway. + + +IPC Format +========== + +The :ref:`IPC format <_ipc-message-format>` is a serialization format for the +columnar format with associated metadata. Reading an IPC stream or file from +an untrusted source comes with similar caveats as reading the Arrow columnar +format. + +The additional signalisation and metadata in the IPC format come with +their own risks. For example, buffer offsets and sizes encoded in IPC messages +may be out of bounds for the IPC stream; Flatbuffers-encoded metadata payloads +may carry incorrect offsets pointing outside of the designated metadata area. + +Advice for users +---------------- + +Arrow libraries will typically ensure IPC streams are structurally valid +but may not also validate the underlying Array data. It is **extremely recommended** +that you use the appropriate APIs to validate the Arrow data read from an untrusted IPC stream. + +Advice for implementors +----------------------- + +It is **extremely recommended** to run dedicated validation checks when decoding +the IPC format, to make sure that the decoding can not induce unwanted behavior. +Failing those checks should return a well-known error to the caller, not crash. + + +Extension Types +=============== + +Extension types typically register a custom deserialization hook so that they +can be automatically recreated when reading from an external source (for example +using IPC). The deserialization hook has to decode the extension type's parameters +from a string or binary payload specific to the extension type. +:ref:`Typical examples ` use a bespoke JSON representation +with object fields representing the various parameters. + +When reading data from an untrusted source, any registered deserialization hook +could be called with an arbitrary payload. It is therefore of primary importance +that the hook be safe to call on invalid, potentially malicious, data. This mandates +the use of a robust metadata serialization schema (such as JSON, but not Python's +`pickle `__ or R's +`serialize() `__, +for example). + +Advice for users and implementors +--------------------------------- + +When designing an extension type, it is **extremely recommended** to choose a +metadata serialization format that is robust against potentially malicious +data. + +When implementing an extension type, it is **recommended** to ensure that the +deserialization hook is able to detect, and error out gracefully, if the +serialized metadata payload is invalid. + + +Testing for robustness +====================== + +Advice for implementors +----------------------- + +For APIs that may process untrusted inputs, it is **extremely recommended** +that your unit tests exercise your APIs against typical kinds of invalid data. +For example, your validation APIs will have to be tested against invalid Binary +or List offsets, invalid UTF-8 data in a String array, etc. + +Testing against known regression files +'''''''''''''''''''''''''''''''''''''' + +The `arrow-testing `__ repository +contains regression files for various formats, such as the IPC format. + +Two categories of files are especially noteworthy and can serve to exercise +an Arrow implementation's robustness: + +1. :ref:`gold integration files ` that are valid + files to exercise compliance with Arrow IPC features; +2. :ref:`fuzz regression files ` that have been automatically + generated each time a fuzzer founds a bug triggered by a specific (usually invalid) + input for a given format. + +Fuzzing +''''''' + +It is **recommended** that you go one step further and set up some kind of +automated robustness testing against unforeseen inputs. One typical approach +is though fuzzing, possibly coupled with a runtime instrumentation framework +that detects dangerous behavior (such as Address Sanitizer in C++ or +Rust). + +A reasonable way of setting up fuzzing for Arrow is using the IPC format as +a binary payload; the fuzz target should not only attempt to decode the IPC +stream as Arrow data, but it should then validate the Arrow data. +This will strengthen both the IPC decoder and the validation routines +against invalid, potentially malicious data. Finally, if validation comes out +successfully, the fuzz target may exercise some important core functionality, +such as printing the data for human display; this will help ensure that the +validation routine did not let through invalid data that may lead to dangerous +behavior. + + +Non-Arrow formats and protocols +=============================== + +Arrow data can also be sent or stored using third-party formats such as Apache +Parquet. Those formats may or may not present the same security risks as listed +above (for example, the precautions around uninitialized data may not apply +in a format like Parquet that does not create any value slots for null elements). +We suggest you refer to these projects' own documentation for more concrete +guidelines. diff --git a/docs/source/format/index.rst b/docs/source/format/index.rst index 91912a5325d..bbcc3ec6211 100644 --- a/docs/source/format/index.rst +++ b/docs/source/format/index.rst @@ -37,5 +37,6 @@ Specifications Flight FlightSql ADBC + Security Integration Glossary