(improvement) serializers: add Cython-optimized serialization for VectorType by mykaul · Pull Request #748 · scylladb/python-driver

mykaul · 2026-03-14T11:08:22Z

Summary

Adds cassandra/serializers.pyx and cassandra/serializers.pxd implementing Cython-optimized serialization that mirrors the deserializers.pyx architecture.

What's included

Scalar serializers: SerFloatType (4-byte IEEE 754), SerDoubleType (8-byte), SerInt32Type (4-byte signed) — the three subtypes commonly used in vector columns
SerVectorType: Pre-allocates a contiguous char * buffer and uses C-level byte swapping for float/double/int32 vectors, with a generic fallback for other subtypes
GenericSerializer: Delegates to the Python-level cqltype.serialize() classmethod for all other types
Factory functions: find_serializer(cqltype) and make_serializers(cqltypes_list) for easy lookup and batch creation

Architecture

Mirrors deserializers.pyx exactly:

Deserializer side	Serializer side
`Deserializer` base class	`Serializer` base class
`DesFloatType`, `DesDoubleType`, `DesInt32Type`	`SerFloatType`, `SerDoubleType`, `SerInt32Type`
`DesVectorType` (type-specialized)	`SerVectorType` (type-specialized)
`GenericDeserializer`	`GenericSerializer`
`find_deserializer()`	`find_serializer()`
`make_deserializers()`	`make_serializers()`

Performance

Benchmarked on Vector<float, 1536> (typical embedding dimension):

Method	us/op	Speedup
Current `VectorType.serialize()` (io.BytesIO loop)	~823 us	1x (baseline)
Python `struct.pack` batch format string	~74 us	~11x
Cython `SerVectorType`	~4 us	~30x

No setup.py changes needed — the existing cassandra/*.pyx glob already picks up new .pyx files.

Related PRs

Part of the vector serialization optimization effort tracked in Tracking: Vector search (VectorType) performance improvement PRs #746
Serialization benchmarks added to PR tests/benchmarks: Add VectorType deserialization benchmarks and expand test coverage #733
Follow-up PR will integrate this into BoundStatement.bind() (depends on this PR + Optimize column_encryption_policy checks in recv_results_rows #630)

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

…torType Add cassandra/serializers.pyx and cassandra/serializers.pxd implementing Cython-optimized serialization that mirrors the deserializers.pyx architecture. Implements type-specialized serializers for the three subtypes commonly used in vector columns: - SerFloatType: 4-byte big-endian IEEE 754 float - SerDoubleType: 8-byte big-endian double - SerInt32Type: 4-byte big-endian signed int32 SerVectorType pre-allocates a contiguous buffer and uses C-level byte swapping for float/double/int32 vectors, with a generic fallback for other subtypes. GenericSerializer delegates to the Python-level cqltype.serialize() classmethod. Factory functions find_serializer() and make_serializers() allow easy lookup and batch creation of serializers for column types. Benchmarks show ~30x speedup over the current io.BytesIO baseline and ~3x speedup over Python struct.pack for Vector<float, 1536> serialization. No setup.py changes needed - the existing cassandra/*.pyx glob already picks up new .pyx files.

…nt.bind() When Cython serializers (from cassandra.serializers) are available and no column encryption policy is active, BoundStatement.bind() now uses pre-built Serializer objects cached on the PreparedStatement instead of calling cqltype classmethods. This avoids per-value Python method dispatch overhead and enables the ~30x vector serialization speedup from the Cython serializers module. The bind loop is split into three paths: 1. Column encryption policy path (unchanged behavior) 2. Cython serializers path (new fast path) 3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check) Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy bind split).

Copilot

Pull request overview

This PR introduces a new Cython extension module to accelerate CQL value serialization—especially VectorType—using the same general “typed Serializer object + factory lookup” approach as the existing Cython deserialization stack.

Changes:

Add cassandra/serializers.pyx implementing Cython serializers for FloatType, DoubleType, Int32Type, and an optimized VectorType serializer with generic fallback.
Add find_serializer() / make_serializers() factory helpers for serializer creation.
Add cassandra/serializers.pxd to expose the Serializer interface to other Cython modules.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
cassandra/serializers.pyx	New Cython-optimized serialization implementations and factory lookup.
cassandra/serializers.pxd	Cython declarations for the `Serializer` interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cassandra/serializers.pyx

…nt.bind() When Cython serializers (from cassandra.serializers) are available and no column encryption policy is active, BoundStatement.bind() now uses pre-built Serializer objects cached on the PreparedStatement instead of calling cqltype classmethods. This avoids per-value Python method dispatch overhead and enables the ~30x vector serialization speedup from the Cython serializers module. The bind loop is split into three paths: 1. Column encryption policy path (unchanged behavior) 2. Cython serializers path (new fast path) 3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check) Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy bind split).

…lizers Address all 8 Copilot review comments on PR scylladb#748: - Add _check_float_range() for float overflow detection matching struct.pack - Add _check_int32_range() for int32 bounds checking before C cast - Wire bounds checks into SerFloatType, SerInt32Type, and VectorType fast-paths - Replace malloc/free with PyBytes_FromStringAndSize(NULL,n)+PyBytes_AS_STRING - Add empty vector early return (b'') before allocation - Remove unused uint32_t cimport and libc.stdlib import - Add comprehensive test suite (67 tests) covering equivalence, overflow, special values, vectors, round-trips, and factory functions

Copilot

Pull request overview

Adds a new Cython serialization module to speed up VectorType (and a few common scalar subtypes) while keeping wire-format output identical to existing cqltypes.*.serialize() implementations.

Changes:

Introduce cassandra/serializers.pyx + .pxd implementing Serializer classes, including a specialized SerVectorType with float/double/int32 fast-paths.
Add serializer lookup/batch factories (find_serializer, make_serializers).
Add unit tests validating byte-for-byte equivalence and basic round-trips for the new serializers.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
`cassandra/serializers.pyx`	New Cython serializers, including optimized `VectorType` serialization and factory lookup functions.
`cassandra/serializers.pxd`	Cython declarations for the `Serializer` base class.
`tests/unit/test_serializers.py`	New unit tests covering scalar/vector equivalence, round-trips, and factory behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/test_serializers.py

cassandra/serializers.pyx

tests/unit/test_serializers.py

…nt.bind() When Cython serializers (from cassandra.serializers) are available and no column encryption policy is active, BoundStatement.bind() now uses pre-built Serializer objects cached on the PreparedStatement instead of calling cqltype classmethods. This avoids per-value Python method dispatch overhead and enables the ~30x vector serialization speedup from the Cython serializers module. The bind loop is split into three paths: 1. Column encryption policy path (unchanged behavior) 2. Cython serializers path (new fast path) 3. Plain Python path (no CE, no Cython -- removes per-value ColDesc/CE check) Depends on PR scylladb#748 (Cython serializers module) and PR scylladb#630 (CE-policy bind split).

- Fix _check_float_range() docstring: clarify it raises OverflowError, not struct.error - Fix _check_int32_range() docstring: same clarification - Document __getitem__ requirement in vector fast-paths (_serialize_float, _serialize_double, _serialize_int32) as intentional for performance - Expand test import guard to cover VERIFY_CYTHON - Remove unused imports: math, parse_casstype_args

Copilot

Pull request overview

Adds a new Cython extension module (cassandra.serializers) that provides optimized serialization implementations for common scalar types and VectorType, intended to mirror the existing deserializers.pyx architecture and improve vector-heavy workloads.

Changes:

Introduces cassandra/serializers.pyx + .pxd implementing Serializer classes, SerVectorType, and factory helpers (find_serializer, make_serializers).
Adds unit tests validating byte-for-byte equivalence vs. Python cqltype.serialize() and basic factory behavior.
Implements float/double/int32 vector fast-paths via preallocated contiguous buffers and endian swapping.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
`cassandra/serializers.pyx`	New Cython-optimized serializers, including `VectorType` specialized paths and serializer lookup/factory functions.
`cassandra/serializers.pxd`	Cython declarations for the `Serializer` base class API.
`tests/unit/test_serializers.py`	New tests for serializer equivalence, edge cases, and factory helpers under Cython.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T22:18:49Z

cassandra/serializers.pyx

+    # For VectorType, always use SerVectorType (it handles generic subtypes internally)
+    if issubclass(cqltype, cqltypes.VectorType):
+        return SerVectorType(cqltype)


find_serializer() unconditionally routes any VectorType subclass (including the un-parameterized base cqltypes.VectorType) to SerVectorType. The base VectorType has subtype=None, which makes SerVectorType.__init__ call issubclass(None, ...) and raise TypeError during serializer creation. Please guard this path (e.g., only use SerVectorType when cqltype.subtype is set / cqltype is not cqltypes.VectorType, otherwise fall back to GenericSerializer or raise a clearer error).

Suggested change

# For VectorType, always use SerVectorType (it handles generic subtypes internally)

if issubclass(cqltype, cqltypes.VectorType):

return SerVectorType(cqltype)

# For VectorType, use SerVectorType only when the cqltype is parameterized

# (i.e., has a non-None subtype). The un-parameterized base VectorType has

# subtype=None, which SerVectorType cannot handle safely.

if issubclass(cqltype, cqltypes.VectorType):

subtype = getattr(cqltype, 'subtype', None)

if subtype is not None:

return SerVectorType(cqltype)

Copilot · 2026-03-24T22:18:50Z

cassandra/serializers.pyx

+        cdef char *src
+        cdef char *dst
+
+        for i in range(self.vector_size):
+            _check_float_range(<double>values[i])
+            val = <float>values[i]


In the vector fast-path loops, each element is indexed multiple times (values[i] is evaluated for the range-check and again for the cast). This adds overhead and can also produce inconsistent behavior for custom sequences with side effects in __getitem__. Consider storing values[i] in a local variable once per iteration and reusing it for range-check + cast.

Suggested change

cdef char *src

cdef char *dst

for i in range(self.vector_size):

_check_float_range(<double>values[i])

val = <float>values[i]

cdef double dv

cdef char *src

cdef char *dst

for i in range(self.vector_size):

dv = <double>values[i]

_check_float_range(dv)

val = <float>dv

Copilot · 2026-03-24T22:18:50Z

tests/unit/test_serializers.py

+from cassandra.cython_deps import HAVE_CYTHON
+
+try:
+    from tests import VERIFY_CYTHON
+except ImportError:
+    VERIFY_CYTHON = False
+
+from cassandra.cqltypes import (
+    FloatType,
+    DoubleType,
+    Int32Type,
+    VectorType,
+    UTF8Type,
+    LongType,
+    BooleanType,
+)
+
+# Import serializers only if Cython is available
+if HAVE_CYTHON or VERIFY_CYTHON:
+    from cassandra.serializers import (
+        Serializer,
+        SerFloatType,
+        SerDoubleType,
+        SerInt32Type,
+        SerVectorType,
+        GenericSerializer,
+        find_serializer,
+        make_serializers,
+    )
+
+cythontest = unittest.skipUnless(
+    HAVE_CYTHON or VERIFY_CYTHON, "Cython is not available"
+)


When VERIFY_CYTHON is true (common in this repo’s Cython unit tests), importing cassandra.serializers directly can fail if the extension wasn’t built ahead of time. Other Cython tests use tests.unit.cython.utils.cyimport() to compile/import .pyx modules under VERIFY_CYTHON. Consider switching to that helper (and reusing tests.unit.cython.utils.cythontest) so these tests work in both compiled-extension and pyximport modes.

Copilot · 2026-03-24T22:18:50Z

tests/unit/test_serializers.py

+        for val in [0.0, 1.0, -1.0, 3.14, float("inf"), float("-inf")]:
+            serialized = ser.serialize(val, PROTO)
+            deserialized = FloatType.deserialize(serialized, PROTO)
+            self.assertAlmostEqual(val, deserialized, places=5)


assertAlmostEqual() will fail for inf/-inf values (because inf - inf becomes nan). Since this test includes infinities, compare those with assertEqual() (or handle them as a special case) and reserve assertAlmostEqual() for finite floats.

Suggested change

self.assertAlmostEqual(val, deserialized, places=5)

if val in (float("inf"), float("-inf")):

self.assertEqual(val, deserialized)

else:

self.assertAlmostEqual(val, deserialized, places=5)

Copilot · 2026-03-24T22:18:50Z

tests/unit/test_serializers.py

+        import ctypes
+
+        flt_max = 3.4028234663852886e38
+        self._assert_equiv(flt_max)


Unused import: ctypes is imported but never used in this test. Please remove it to keep the test minimal.

Copilot · 2026-03-24T22:18:50Z

cassandra/serializers.pyx

+    cdef int vector_size
+    cdef object subtype
+    # 0 = generic, 1 = float, 2 = double, 3 = int32
+    cdef int type_code
+
+    def __init__(self, cqltype):
+        super().__init__(cqltype)
+        self.vector_size = cqltype.vector_size
+        self.subtype = cqltype.subtype


SerVectorType stores vector_size as a C int (self.vector_size = cqltype.vector_size). If a parameterized VectorType is constructed with a very large dimension, this cast can truncate/overflow, and the subsequent buf_size = self.vector_size * 4/8 + loop can write past the allocated bytes buffer (memory corruption/segfault). Consider storing vector_size as Py_ssize_t (or Python int) and validating it (>=0 and multiplication fits Py_ssize_t) before allocating/writing.

mykaul mentioned this pull request Mar 14, 2026

(improvement) query: add Cython-aware serializer path in BoundStatement.bind() #749

Draft

mykaul marked this pull request as draft March 14, 2026 11:23

mykaul requested a review from Copilot March 14, 2026 19:25

Copilot started reviewing on behalf of mykaul March 14, 2026 19:25 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

mykaul requested a review from Copilot March 19, 2026 14:19

Copilot AI reviewed Mar 19, 2026

View reviewed changes

mykaul requested a review from Copilot March 24, 2026 22:14

Copilot started reviewing on behalf of mykaul March 24, 2026 22:14 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Conversation

mykaul commented Mar 14, 2026

Summary

What's included

Architecture

Performance

Related PRs

Pre-review checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants