Skip to content

Latest commit

 

History

History
348 lines (253 loc) · 17.3 KB

File metadata and controls

348 lines (253 loc) · 17.3 KB

Summary

We would like to add a Tensor type to Vortex as an extension over FixedSizeList. This RFC proposes the design of a fixed-shape tensor with contiguous backing memory.

Motivation

Tensors in the wild

Tensors are multi-dimensional (n-dimensional) arrays that generalize vectors (1D) and matrices (2D) to arbitrary dimensions. They are quite common in ML/AI and scientific computing applications. To name just a few examples:

  • Image or video data stored as height x width x channels
  • Multi-dimensional sensor or time-series data
  • Embedding vectors from language models and recommendation systems

Tensors in Vortex

In the current version of Vortex, there are two ways to represent fixed-shape tensors using the FixedSizeList DType, and neither seems satisfactory.

The simplest approach is to flatten the tensor into a single FixedSizeList<n> whose size is the product of all dimensions (this is what Apache Arrow does). However, this discards shape information entirely: a 2x3 matrix and a 3x2 matrix would both become FixedSizeList<6>. Shape metadata must be stored separately, and any dimension-aware operation (slicing along an axis, transposing, etc.) reduces to manual index arithmetic with no type-level guarantees.

The alternative is to nest FixedSizeList types, e.g., FixedSizeList<FixedSizeList<n>, m> for a matrix. This preserves some structure, but becomes unwieldy for higher-dimensional tensors. Axis-specific slicing or indexing on individual tensors (tensor scalars, not tensor arrays) would require custom expressions aware of the specific nesting depth, rather than operating on a single, uniform tensor type.

Additionally, reshaping requires restructuring the entire nested type, and operations like transposes would be difficult to implement correctly.

Beyond these structural issues, neither approach stores shape and stride metadata explicitly, making interoperability with external tensor libraries (NumPy, PyTorch, etc.) that expect contiguous memory with this metadata awkward.

Thus, we propose a dedicated extension type that encapsulates tensor semantics (shape, strides, dimension-aware operations) on top of contiguous, row-major (C-style) backing memory.

Design

Since the design of extension types has not been fully solved yet (see RFC #0005), the complete design of tensors cannot be fully described here. However, we do know enough that we can present the general idea here.

Storage Type

Extension types in Vortex require defining a canonical storage type that represents what the extension array looks like when it is canonicalized. For tensors, we will want this storage type to be a FixedSizeList<p, s>, where p is a numeric type (like u8, f64, or a decimal type), and where s is the product of all dimensions of the tensor.

For example, if we want to represent a tensor of i32 with dimensions [2, 3, 4], the storage type for this tensor would be FixedSizeList<i32, 24> since 2 x 3 x 4 = 24.

This is equivalent to the design of Arrow's canonical Tensor extension type. For discussion on why we choose not to represent tensors as nested FSLs (for example FixedSizeList<FixedSizeList<FixedSizeList<i32, 2>, 3>, 4>), see the alternatives section.

Element Type

We restrict tensor element types to Primitive and Decimal. Tensors are fundamentally about dense numeric computation, and operations like transpose, reshape, and slicing rely on uniform, fixed-size elements whose offsets are computable from strides.

Variable-size types (like strings) would break this model entirely. Bool is excluded as well because Vortex bit-packs boolean arrays, which conflicts with byte-level stride arithmetic. This matches PyTorch, which also restricts tensors to numeric types.

Theoretically, we could allow more element types in the future, but it should remain a very low priority.

Validity

We define two layers of nullability for tensors: the tensor itself may be null (within a tensor array), and individual elements within a tensor may be null. However, we do not support nulling out entire sub-dimensions of a tensor (e.g., marking a whole row or slice as null).

The validity bitmap is flat (one bit per element) and follows the same contiguous layout as the backing data (just like FixedSizeList). This keeps stride-based access straightforward while still allowing sparse values within an otherwise dense tensor.

Note that this design is specifically for a dense tensor. A sparse tensor would likely need to have a different representation (or different storage type) in order to compress better (likely List or ListView since it can compress runs of nulls very well).

Metadata

Theoretically, we only need the dimensions of the tensor to have a useful Tensor type. However, we likely also want two other pieces of information, the dimension names and the permutation order, which mimics the Arrow Fixed Shape Tensor type (which is a Canonical Extension type).

Here is what the metadata of an extension Tensor type in Vortex will look like (in Rust):

/// Metadata for a [`Tensor`] extension type.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct TensorMetadata {
    /// The shape of the tensor.
    ///
    /// The shape is always defined over row-major storage. May be empty (0D scalar tensor) or
    /// contain dimensions of size 0 (degenerate tensor).
    shape: Vec<usize>,

    /// Optional names for each dimension. Each name corresponds to a dimension in the `shape`.
    ///
    /// If names exist, there must be an equal number of names to dimensions.
    dim_names: Option<Vec<String>>,

    /// The permutation of the tensor's dimensions, mapping each logical dimension to its
    /// corresponding physical dimension: `permutation[logical] = physical`.
    ///
    /// If this is `None`, then the logical and physical layout are equal, and the permutation is
    /// in-order `[0, 1, ..., N-1]`.
    permutation: Option<Vec<usize>>,
}

Stride

The stride of a tensor defines the number of elements to skip in memory to move one step along each dimension. Rather than storing strides explicitly as metadata, we can efficiently derive them from the shape and permutation. This is possible because the backing memory is always contiguous.

For a row-major tensor with shape d = [d_0, d_1, ..., d_{n-1}] and no permutation, the strides are:

stride[n-1] = 1  (innermost dimension always has stride 1)
stride[i]   = d[i+1] * stride[i+1]
stride[i]   = d[i+1] * d[i+2] * ... * d[n-1]

For example, a tensor with shape [2, 3, 4] and no permutation has strides [12, 4, 1]: moving one step along dimension 0 skips 12 elements, along dimension 1 skips 4, and along dimension 2 skips 1. The element at index [i, j, k] is located at memory offset 12*i + 4*j + k.

When a permutation is present, the logical strides are simply the row-major strides permuted accordingly. Continuing the [2, 3, 4] example with row-major strides [12, 4, 1], applying the permutation [2, 0, 1] yields logical strides [1, 12, 4]. This reorders which dimensions are contiguous in memory without copying any data.

Conversions

Arrow

Since our storage type and metadata are designed to match Arrow's Fixed Shape Tensor canonical extension type, conversion to and from Arrow is zero-copy (for tensors with at least one dimension). The FixedSizeList backing memory, shape, dimension names, and permutation all map directly between the two representations.

NumPy and PyTorch

Libraries like NumPy and PyTorch store strides as an independent, first-class field on their tensor objects. This allows them to represent non-contiguous views of memory.

For example, slicing every other row of a matrix produces a view with a doubled row stride, sharing memory with the original without copying. However, this means that non-contiguous tensors can appear anywhere, and kernels must handle arbitrary stride patterns. PyTorch supposedly requires many operations to call .contiguous() before proceeding.

Since Vortex tensors are always contiguous, we can always zero-copy to NumPy and PyTorch since both libraries can construct a view from a pointer, shape, and strides. Going the other direction, we can only zero-copy from NumPy/PyTorch tensors that are already contiguous.

Our proposed design for Vortex Tensors will handle non-contiguous operations differently than the Python libraries. Rather than mutating strides to create non-contiguous views, operations like slicing, indexing, and .contiguous() (after a permutation) would be expressed as lazy Expressions over the tensor.

These expressions describe the operation without materializing it, and when evaluated, they produce a new contiguous tensor. This fits naturally into Vortex's existing lazy compute system, where compute is deferred and composed rather than eagerly applied.

The exact mechanism for defining expressions over extension types is still being designed (see RFC #0005), but the intent is that tensor-specific operations like axis slicing, indexing, and reshaping would be custom expressions registered for the tensor extension type.

Edge Cases: 0D and Size-0 Dimensions

We will support two edge cases that arise naturally from the tensor model. Recall that the number of elements in a tensor is the product of its shape dimensions, and that the empty product is 1 (the multiplicative identity).

0-dimensional tensors

0D tensors have an empty shape [] and contain exactly one element (since the product of no dimensions is 1). These represent scalar values wrapped in the tensor type. The storage type is FixedSizeList<p, 1> (which is identical to a flat PrimitiveArray or DecimalArray).

Size-0 dimensions

Shapes may contain dimensions of size 0 (e.g., [3, 0, 4]), which produce tensors with zero elements (since the product includes a 0 factor). The storage type is a degenerate FixedSizeList<p, 0>, which Vortex already handles well.

Compatibility

Both NumPy and PyTorch support these cases. NumPy fully supports 0D arrays with shape (), and dimensions of size 0 are valid (e.g., np.zeros((3, 0, 4))). PyTorch supports 0D tensors since v0.4.0 and also allows size-0 dimensions.

Arrow's Fixed Shape Tensor spec, however, requires at least one dimension (ndim >= 1), so 0D tensors would need special handling during Arrow conversion (we would likely just panic).

Compression

Since the storage type is FixedSizeList over numeric types, Vortex's existing encodings (like ALP, FastLanes, etc.) will be applied to the flattened primitive buffer transparently.

However, there may be tensor-specific compression opportunities we could take advantage of. We will leave this as an open question.

Scalar Representation

Once we add the ScalarValue::Array variant (see tracking issue vortex#6771), we can easily pass around tensors as ArrayRef scalars as well as lazily computed slices.

The ExtVTable also requires specifying an associated NativeValue<'a> Rust type that an extension scalar can be unpacked into. We will want a NativeTensor<'a> type that references the backing memory of the Tensor, and we can add useful operations to that type.

Compatibility

Since this is a new type built on an existing canonical type (FixedSizeList), there should be no compatibility concerns.

Drawbacks

  • Fixed shape only: This design only supports tensors where every element in the array has the same shape. Variable-shape tensors (ragged arrays) are out of scope and would require a different type entirely.
  • Yet another crate: We will likely implement this in a vortex-tensor crate, which means even more surface area than we already have.

Alternatives

Nested FixedSizeList

Rather than a flat FixedSizeList with metadata, we could represent tensors as nested FixedSizeList types (e.g., FixedSizeList<FixedSizeList<FixedSizeList<i32, 4>, 3>, 2> for a [2, 3, 4] tensor). This has several disadvantages:

  • Each nesting level introduces its own validity bitmap, even though sub-dimensional nullability is not meaningful for tensors. This wastes space and complicates null-handling logic.
  • This does not match Arrow's canonical Fixed Shape Tensor type, making zero-copy conversion impossible.
  • Expressions would need to be aware of the nesting depth, and operations like transpose or reshape would require restructuring the type itself rather than updating metadata.

Do nothing

Users could continue to use FixedSizeList directly with out-of-band shape metadata. This works for simple storage, but as discussed in the motivation, it provides no type-level support for tensor operations and makes interoperability with tensor libraries awkward.

Prior Art

Note: This section was Claude-researched.

Columnar formats

  • Apache Arrow defines a Fixed Shape Tensor canonical extension type. Our design closely follows Arrow's approach: a flat FixedSizeList storage type with shape, dimension names, and permutation metadata. Arrow also defines a Variable Shape Tensor extension type for ragged tensors, which could inform future work.
  • Lance delegates entirely to Arrow's type system, including extension types. Arrow extension metadata (and therefore tensor metadata) is preserved end-to-end through Lance's storage layer, which validates the approach of building tensor semantics as an extension on top of FixedSizeList storage.
  • Parquet has no native FixedSizeList logical type. Arrow's FixedSizeList is stored as a regular LIST in Parquet, which adds conversion overhead via repetition levels. There is active discussion about introducing FixedSizeList as a Parquet logical type, partly motivated by tensor and embedding workloads.

Database systems

  • DuckDB has a native ARRAY type (fixed-size list) but no dedicated tensor type. Community discussions have proposed adding one, noting that nested ARRAY types can simulate multi-dimensional arrays but lack tensor-specific operations.
  • DataFusion uses Arrow's type system directly and has no dedicated tensor type. There is open discussion about a logical type layer that could support extension types as first-class citizens.

Tensor libraries

  • NumPy and PyTorch both represent tensors as contiguous (or non-contiguous) memory with shape and stride metadata. Our design is a subset of this model — we always require contiguous memory and derive strides from shape and permutation, as discussed in the conversions section.
  • ndindex is a Python library that provides a unified interface for representing and manipulating NumPy array indices (slices, integers, ellipses, boolean arrays, etc.). It supports operations like canonicalization, shape inference, and re-indexing onto array chunks. We will want to implement tensor compute expressions in Vortex that are similar to the operations ndindex provides — for example, computing the result shape of a slice or translating a logical index into a physical offset.

Academic work

  • TACO (Tensor Algebra Compiler) separates the tensor storage format from the tensor program. Each dimension can independently be specified as dense or sparse, and dimensions can be reordered. The Vortex approach of storing tensors as flat contiguous memory with a permutation is one specific point in TACO's format space (all dimensions dense, with a specific dimension ordering).

Unresolved Questions

  • Are two tensors with different permutations but the same logical values considered equal? This affects deduplication and comparisons. The type metadata might be different but the entire tensor value might be equal, so it seems strange to say that they are not actually equal?
  • Are there potential tensor-specific compression schemes we can take advantage of?

Future Possibilities

Variable-shape tensors

Arrow defines a Variable Shape Tensor extension type for arrays where each tensor can have a different shape. This would enable workloads like batched sequences of different lengths.

Sparse tensors

A similar Sparse Tensor type could use List or ListView as its storage type to efficiently represent tensors with many null or zero elements, as noted in the validity section.

Tensor-specific encodings

Beyond general-purpose compression, encodings tailored to tensor data (e.g., exploiting spatial locality across dimensions) could improve compression ratios for specific workloads.

ndindex-style compute expressions

As the extension type expression system matures, we can implement a rich set of tensor indexing and slicing operations inspired by ndindex, including slice canonicalization, shape inference, and chunk-level re-indexing.