-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Overview
There are multiple open PRs that collectively aim to significantly improve VectorType (custom type) deserialization performance for vector search workloads. This issue tracks them, describes their contributions, and proposes a merge priority/order.
Vector search workloads (e.g., 768/1536-dimension float vectors for embeddings) are a key use case where deserialization overhead is substantial. These PRs attack the problem from multiple angles: pure Python fast-paths, Cython C-level operations, numpy integration, type-resolution caching, and general read-path optimization.
Directly Vector-Related PRs
1. #689 — (Improvement) Improve performance of Vector type parsing
- What: Original umbrella PR containing multiple commits across Python, Cython, and numpy layers for vector deserialization optimization.
- Status: Open since Feb 5, 2026. This appears to be the original monolithic PR that was later broken out into (improvement) Optimize VectorType deserialization with struct.unpack and numpy #730–tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage #733.
- Note: May be superseded by the individual PRs below. Needs clarification on whether this should be closed in favor of (improvement) Optimize VectorType deserialization with struct.unpack and numpy #730–tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage #733.
2. #730 — Optimize VectorType deserialization with struct.unpack and numpy
- What: Pure Python path optimization. Replaces element-by-element deserialization with bulk
struct.unpackfor known numeric types, and adds anumpy.frombuffer().tolist()fast-path for vectors >= 32 elements. - Impact: 3.5x speedup for small vectors (
Vector<float, 3>), 4.0x for large vectors (Vector<float, 1536>). - Scope:
cassandra/cqltypes.pyonly. No Cython dependency. - Priority: HIGH — This is the foundational optimization that benefits all users (pure Python path is always available).
3. #731 — Add VectorType support to numpy_parser for 2D array parsing
- What: Extends the Cython NumPy row parser to produce 2D masked arrays
(num_rows, vector_dimension)for vector columns, usingmemcpyof raw wire bytes directly into pre-allocated numpy buffers. - Impact: Zero-copy path for ML/AI workloads consuming results as numpy arrays. Fastest possible path when the consumer is numpy.
- Scope:
cassandra/numpy_parser.pyx+ new tests. - Priority: MEDIUM — Benefits users who opt into the numpy result path. Independent of other PRs.
4. #732 — Optimize Cython deserialization primitives and add VectorType Cython deserializer
- What: 6 commits: (1)
ntohs/ntohlintrinsics for Cython byte unpacking, (2) float-specificntohlbyte-swap, (3)from_ptr_and_size()refactor, (4) newDesVectorTypeCython class with type-specialized C-level deserialization, (5) Windows portability, (6) buffer bounds validation. - Impact: 4.4–4.7x faster than pure Python for small vectors. For large vectors, both paths use numpy so the per-vector gain is marginal, but per-row dispatch overhead is eliminated. Also ~4-5% general row throughput improvement from byte-swap intrinsics.
- Scope:
cassandra/deserializers.pyx,cassandra/cython_marshal.pyx,cassandra/ioutils.pyx,cassandra/buffer.pxd. - Priority: HIGH — This is the Cython counterpart to (improvement) Optimize VectorType deserialization with struct.unpack and numpy #730 and also improves general Cython deserialization performance.
5. #733 — Add VectorType deserialization benchmarks and expand test coverage
- What: Benchmark harness testing 4 strategies (VectorType.deserialize, struct.unpack, numpy.frombuffer, Cython DesVectorType) across multiple vector sizes/types. Enables vector integration tests on Scylla 2025.4+. Adds unit tests for Cython fallback and numpy large-vector paths.
- Impact: No production code changes. Provides the measurement infrastructure to validate all the other PRs.
- Priority: HIGH — Should be merged early or alongside the optimization PRs to provide regression tracking.
Supporting/Infrastructure PRs (Benefit Vector Workloads Indirectly)
These PRs optimize the general deserialization pipeline that vector data flows through. They benefit all types but have particular impact on vector workloads due to the high volume of data.
6. #690 — Optimize custom type parsing with LRU caching
- What: Caches
lookup_casstype()results to avoid repeated string manipulation and regex scanning. VectorType is a custom/parameterized type, so this directly reduces per-query type resolution overhead. - Priority: MEDIUM — Prerequisite-like optimization. Benefits vector type resolution specifically.
7. #729 — Fast-path lookup_casstype() for simple type names
- What: Skips the regex scanner and stack-based parser for non-parameterized types (direct dict lookup). While VectorType itself is parameterized, its subtypes (FloatType, etc.) are simple and benefit from this.
- Priority: LOW-MEDIUM — Incremental improvement to type resolution.
8. #734 — Remove copies on the read path
- What: Reduces memory copies in the read path. Up to 5.3x speedup for large payloads. Vector results with 768/1536-dim float columns are large payloads.
- Impact: 1.2x–5.3x depending on payload size.
- Priority: HIGH — Significant general improvement that directly benefits vector query result processing.
9. #741 — Cache deserializer instances in find_deserializer (Cython only)
- What: Caches
find_deserializer()andmake_deserializers()results, avoiding repeated class lookups and Deserializer object creation. TheDesVectorTypefrom (improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer #732 would be cached here. - Impact: 6x speedup for
find_deserializer, 29x formake_deserializers(10 types). - Priority: MEDIUM — Depends on (improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer #732 (or at least benefits from it). Important for prepared statement hot paths.
10. #742 — Cache ParseDesc for prepared statements (Cython only)
- What: Caches the ParseDesc object so repeated prepared statement executions skip list comprehensions, ColDesc construction, and
make_deserializers(). - Impact: 4.7x speedup for single-row prepared results (ParseDesc construction: 5,730ns → 175ns for 10 cols).
- Priority: MEDIUM — Most vector search uses prepared statements, so this directly helps.
11. #743 — Direct PyUnicode_DecodeUTF8/ASCII from C buffer (Cython only)
- What: Eliminates intermediate bytes object allocation for UTF8/ASCII deserialization using direct CPython C API calls.
- Impact: 1.2x–2.1x for text columns. While not vector-specific, this improves the same Cython deserialization layer.
- Priority: LOW — Tangential to vector performance; improves the same code layer.
Proposed Merge Order
The ordering considers dependencies, risk, and impact:
| Order | PR | Rationale |
|---|---|---|
| 1 | #733 (benchmarks/tests) | Merge first to establish measurement baseline. No production code changes, low risk. |
| 2 | #730 (struct.unpack + numpy, pure Python) | Foundational vector optimization. Benefits all users. No Cython dependency. |
| 3 | #734 (remove copies on read path) | General read-path improvement. Benefits vector workloads significantly. Independent. |
| 4 | #732 (Cython VectorType deserializer) | Cython vector optimization. May depend on #730 for shared numpy threshold logic. |
| 5 | #731 (numpy_parser 2D arrays) | Independent, targets numpy consumers specifically. Can go anywhere after #730. |
| 6 | #690 (custom type LRU cache) | Type resolution caching. Independent. |
| 7 | #729 (fast-path lookup_casstype) | Incremental type resolution improvement. Independent. |
| 8 | #741 (cache deserializer instances) | Benefits from #732 being merged first (caches DesVectorType). |
| 9 | #742 (cache ParseDesc) | Builds on same Cython layer as #741. |
| 10 | #743 (direct UTF8 decode) | Same Cython layer, least vector-specific. |
| 11 | #689 (original umbrella PR) | Likely close if #730–#733 cover all its content. Needs author confirmation. |
Notes
- All PRs are authored by @mykaul.
- PRs (improvement) Optimize VectorType deserialization with struct.unpack and numpy #730–tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage #733 appear to be a decomposition of (Improvement) improve performance of Vector type parsing #689 into focused, reviewable units.
- PRs (improvement) (python code path only): cache namedtuple class in named_tuple_factory to avoid … #740 (namedtuple cache), (Improvement) faster metadata schema parsing #745 (metadata schema parsing), and Optimize column_encryption_policy checks in recv_results_rows #630 (column_encryption_policy opt) are also open performance PRs but are not vector-specific and are excluded from this tracking issue.
- The Cython-only PRs ((improvement) Add VectorType support to numpy_parser for 2D array parsing #731, (improvement) Optimize Cython deserialization primitives and add VectorType Cython deserializer #732, (improvement) (cython only) cache deserializer instances in find_deserializer and m… #741, (improvement) (Cython only) row_parser: cache ParseDesc for prepared statements #742, (improvement) (Cython only) deserializers: use direct PyUnicode_DecodeUTF8/ASCII fr… #743) only benefit users with Cython compiled extensions. The pure Python PRs ((improvement) Optimize VectorType deserialization with struct.unpack and numpy #730, (improvement)Optimize custom type parsing with LRU caching #690, (improvement) cqltypes: fast-path lookup_casstype() for simple type n… #729) benefit all users.