-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-48868: [Doc] Document security model for the Arrow formats #48870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@github-actions crossbow submit preview-docs |
|
Revision: 593babb Submitted crossbow builds: ursacomputing/crossbow @ actions-4f7018459b
|
raboof
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable (without any particular Arrow expertise)
(noticed two typo's)
| ------------------ | ||
|
|
||
| A less obvious pitfall is when some parts of an Arrow array are left uninitialized. | ||
| For example, if a element of a primitive Arrow array is marked null through its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For example, if a element of a primitive Arrow array is marked null through its | |
| For example, if an element of a primitive Arrow array is marked null through its |
| purposes. It is therefore tempting, when creating an array with null values, to | ||
| not initialize the corresponding value slots. | ||
|
|
||
| However, this then introduces a serious security if the Arrow data is serialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| However, this then introduces a serious security if the Arrow data is serialized | |
| However, this then introduces a serious security risk if the Arrow data is serialized |
| uninitialized in a buffer if the array might be sent to, or read by, a untrusted | ||
| third-party, even when the uninitialized data is logically irrelevant. The | ||
| easiest way to do this, though perhaps not the most efficient, is to zero-initialize | ||
| any buffer that will not be populated in full. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth pointing out something about query engines and dataframe libraries deciding to not do so for internal/intermediate values in computations but applying a canonicalization pass when data leaves the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).
And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system
| from an untrusted source (for example because you are writing a proxy to | ||
| an arbitrary third-party service), it is **recommended** that you validate | ||
| the data first, as the consumer may assume that the data is valid already. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the | |
| C data interface might be performing only very light validation of these values. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pitrou -- this is much needed and very helpful
I had some suggestions on structure. Hopefully they are helpful
| process' address space. As such, in-memory Arrow data should not be accessed | ||
| without care. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding a double negative might make this read better
| process' address space. As such, in-memory Arrow data should not be accessed | |
| without care. | |
| process' address space. As such, in-memory Arrow data should be accessed with care. |
| Reading and interpreting Arrow data involves reading into several buffers, | ||
| sometimes in non-trivial ways. This may for instance involve data-dependent | ||
| indirect addressing: to read a value from a Binary array, you need to | ||
| 1) read its offsets in buffer #2, and 2) read the range of bytes delimited by | ||
| these offsets in buffer #3. If the offsets are invalid (deliberately or not), | ||
| then step 2) can access invalid memory (potentially crashing the process) or | ||
| memory unrelated to Arrow (potentially allowing an attacker to exfiltrate | ||
| confidential data). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the core point of this paragraph may get a little lost in the specific details. I would suggest we start by explicitly stating the root cause of potential safety concerns. Something like
| Reading and interpreting Arrow data involves reading into several buffers, | |
| sometimes in non-trivial ways. This may for instance involve data-dependent | |
| indirect addressing: to read a value from a Binary array, you need to | |
| 1) read its offsets in buffer #2, and 2) read the range of bytes delimited by | |
| these offsets in buffer #3. If the offsets are invalid (deliberately or not), | |
| then step 2) can access invalid memory (potentially crashing the process) or | |
| memory unrelated to Arrow (potentially allowing an attacker to exfiltrate | |
| confidential data). | |
| Arrow is a low level memory format, and the contents of Arrow | |
| buffers are often combined and treated as pointers into the process | |
| memory space. Invalid Arrow data may cause invalid memory accesses | |
| (potentially crashing the process) or permit access to non-Arrow data | |
| (potentially allowing an attacker to exfiltrate confidential information). | |
| For example, reading and interpreting Arrow data involves reading into several buffers, | |
| sometimes in non-trivial ways. This may for instance involve data-dependent | |
| indirect addressing: to read a value from a Binary array, you need to | |
| 1) read its offsets in buffer #2, and 2) read the range of bytes delimited by | |
| these offsets in buffer #3. If the offsets are invalid (deliberately or not), | |
| then step 2) can access invalid memory. |
| Advice for users | ||
| '''''''''''''''' | ||
|
|
||
| If you receive Arrow in-memory data from an untrusted source, it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we also make the point about performance here to give context about why
validation is not always performed
Perhaps something like this:
"Arrow implementations often assume Arrays follow the specification
to provide high speed processing. It is extremely important that
your application either trusts or validates the Arrays it receives from
other sources.
Many Arrow implementations provide APIs to do such validation.
In terms of APIs, the Rust implementation always validates data from external sources, unless the validation is explicitly turned off with APIs marked as unsafe (a special Rust keyword).
| uninitialized in a buffer if the array might be sent to, or read by, a untrusted | ||
| third-party, even when the uninitialized data is logically irrelevant. The | ||
| easiest way to do this, though perhaps not the most efficient, is to zero-initialize | ||
| any buffer that will not be populated in full. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).
And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system
| '''''''''''''''' | ||
|
|
||
| If you produce a C Data Interface structure for data that nevertheless comes | ||
| from an untrusted source (for example because you are writing a proxy to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is any different than the other APIs -- basically "if you don't trust the producer source, you should always explicitly validate the arrays before processing them"
This doesn't seem any different for the C Data Interface than for the other APIs (like IPC files. etc)
| a trusted producer, for the reason explained above. However, it is still **recommended** | ||
| that you validate it for soundness, as a trusted producer can have bugs anyway. | ||
|
|
||
| IPC Format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, I think we could combine this into the section about validating data from untrusted sources, and give C Data Interface and IPC Format as examples of potentially untrusted sources.
| How to read this | ||
| ================ | ||
|
|
||
| Hereafter we try list potential security concerns when dealing with the various |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Hereafter we try list potential security concerns when dealing with the various | |
| Hereafter we try to list potential security concerns when dealing with the various |
Rationale for this change
Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.
What changes are included in this PR?
Add a Security Considerations page in the Format section.
Doc preview: https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html
Are these changes tested?
N/A
Are there any user-facing changes?
No.