Skip to content

(improvement) LWT prepared statement performance: analysis and improvement plan #751

@mykaul

Description

@mykaul

Summary

Analysis of LWT (Lightweight Transaction) performance with prepared statements, identification of bottlenecks, and a plan for improvements across the full execution pipeline.

Background

LWT queries in this driver are always prepared statements — the is_lwt flag is set during the PREPARE response via the ScyllaDB protocol extension SCYLLA_LWT_ADD_METADATA_MARK (protocol.py:789-791). The prepared statement execution pipeline for LWT shares the same stages as any prepared statement but with one critical difference:

LWT prepared statements have result_metadata = None (query.py:508, test_prepared_statements.py:623). This is because LWT results have variable column sets:

  • applied=True → returns ([applied]) (1 column)
  • applied=False → returns ([applied], col1, col2, ...) (N columns with existing values)

This means the skip_meta optimization (cluster.py:2956) is disabled for LWT — the server must send full result metadata with every response.

How LWT Queries Flow Through the Driver

Detection (Server-Side, ScyllaDB Extension)

  1. During connection establishment, the server advertises SCYLLA_LWT_ADD_METADATA_MARK with a LWT_OPTIMIZATION_META_BIT_MASK=<bitmask> value
  2. Parsed into _LwtInfo(lwt_meta_bit_mask) stored in ProtocolFeatures.lwt_info (protocol_features.py:84-96)
  3. On PREPARE response, lwt_info.get_lwt_flag(flags) checks if the LWT bit is set (protocol.py:791)
  4. Stored as PreparedStatement._is_lwt (query.py:514,539)

Flag Propagation

Class Location Behavior
Statement query.py:378-379 Base: always False
PreparedStatement query.py:514,539,650-651 Stores _is_lwt from server
BoundStatement query.py:912-913 Delegates to prepared_statement.is_lwt()
BatchStatement query.py:989,1085-1108,1144-1145 True if any added statement is LWT

Impact on Routing and Retry

  • TokenAwarePolicy (policies.py:517): LWT queries disable replica shuffling for Paxos consensus ordering
  • RetryPolicy (policies.py:960-965): CAS write timeouts are rethrown (not retried)
  • DowngradingConsistencyRetryPolicy (policies.py:1108-1110): Serial consistency reads are not downgraded

Result Handling

  • ResultSet.was_applied (cluster.py:5323-5351): Checks [applied] column from LWT results
  • LWTException / check_applied() (cqlengine/query.py:45-78): CQLEngine ORM integration

Identified Performance Bottlenecks

1. LWT Disables Result Metadata Caching (Fundamental Protocol Limitation — NOT FIXABLE)

LWT prepared statements have result_metadata = None, so skip_meta=bool(prepared_statement.result_metadata) evaluates to False (cluster.py:2956). Every LWT execution includes full metadata from the server, and the driver re-parses it every time.

Investigation conclusion: This is a correct, intentional protocol-level design decision, not a driver-side oversight. The server sets NO_METADATA_FLAG (0x0004) in the PREPARE response because LWT result schemas are non-deterministic at prepare time. See B1 analysis below for the full explanation.

2. namedtuple Class Creation Per Response (High Impact)

named_tuple_factory (the default row factory) calls Python's namedtuple() which internally uses exec() — on every response. For single-row LWT results, this is the dominant per-response cost.

3. ParseDesc Reconstruction Per Response (High Impact)

ParseDesc construction (column name extraction, ColDesc namedtuple creation, deserializer lookup) runs on every result set. For LWT's small 1-row results, this is the dominant decode cost.

4. Per-Value Column Encryption Policy Check (Medium Impact)

column_encryption_policy is checked per-value in both bind and decode paths, even when no encryption policy is set (99%+ of deployments).

5. Limited Cython Serializer Coverage (Medium Impact)

Only FloatType, DoubleType, Int32Type, and VectorType have Cython serializers. All other types fall through to GenericSerializer.

6. Hardcoded Timeouts (Low Impact)

  • Re-preparation timeout: 5.0s (cluster.py:2296-2297)
  • Connection borrow timeout: 2.0s (cluster.py:4520)

7. WeakValueDictionary for Prepared Statement Cache (Low Impact)

If the user discards all references to a PreparedStatement, it gets GC'd and needs re-preparation — a hidden extra round-trip (cluster.py:1448).

Existing PRs Addressing the Pipeline

Stage PR Impact on LWT Speedup
Bind (serialization) #749 Cython serializers, #630 CE-policy split Direct — every LWT execution calls bind() 1.3x scalar
Send (memory) #659 __slots__ Indirect — reduces GC pressure per message ~300 bytes/msg
Receive (I/O) #734 zero-copy read path Modest for LWT (small 1-row results) 1.17x small
Decode (ParseDesc) #742 cache ParseDesc Critical for LWT — ParseDesc construction is dominant cost for 1-row results 13.7x 1-row
Decode (deserializers) #741 cache deserializer lookup Direct — prerequisite for #742 6-38x lookup
Decode (CE elimination) #630 CE-policy split Direct — removes per-value CE check Constant factor
Row factory #740 cache namedtuple class Criticalexec() called per response is 135x overhead 135x class creation
Re-prepare (correctness) #725 UNPREPARED fix Prevents spurious failures during coordinator restarts Reliability

Recommended PR Landing Order

Phase 1 — Land existing PRs (priority order for LWT impact):

  1. PR (improvement) (cython only) cache deserializer instances in find_deserializer and m… #741 (deserializer cache) — no dependencies
  2. PR (improvement) (python code path only): cache namedtuple class in named_tuple_factory to avoid … #740 (namedtuple cache) — no dependencies
  3. PR Optimize column_encryption_policy checks in recv_results_rows #630 (CE-policy split) — no dependencies
  4. PR This PR fixes UNPREPARED handling in the Python driver execution path… #725 (UNPREPARED fix) — no dependencies
  5. PR (improvement) (Cython only) row_parser: cache ParseDesc for prepared statements #742 (ParseDesc cache) — after (improvement) (cython only) cache deserializer instances in find_deserializer and m… #741
  6. PR (improvement) query: add Cython-aware serializer path in BoundStatement.bind() #749 (Cython bind) — after Optimize column_encryption_policy checks in recv_results_rows #630
  7. PR Add __slots__ to multiple classes for memory optimization #659 (__slots__) — no dependencies
  8. PR (improvement) perf: remove copies on the read path #734 (zero-copy read) — no dependencies

New Work: LWT-Specific Optimizations

B1. LWT Result Metadata Caching (NOT FEASIBLE)

Cache both result metadata variants (applied vs. not-applied) keyed by column count.

Status: Investigated and ruled out. The result_metadata = None / NO_METADATA_FLAG behavior is a correct protocol-level design decision by the Cassandra/ScyllaDB server, not a driver-side oversight or performance gap that can be closed.

Why LWT result schemas cannot be cached:

A single prepared LWT statement can produce three or more distinct result schemas across its lifetime:

Scenario Columns Returned
Applied (row inserted) [applied] — 1 column
Not applied (row exists) [applied], a, b, d — 4 columns
Not applied (after ALTER TABLE ADD c) [applied], a, b, c, d — 5 columns

This is verified by the integration test _test_updated_conditional (tests/integration/standard/test_prepared_statements.py:595-634, PYTHON-847).

A dual-schema caching approach fails because:

  1. ALTER TABLE invalidates the not-applied schema. Adding/dropping columns changes the column count and types. A cached schema would silently produce wrong deserialization after DDL changes.
  2. The schema space is unbounded. Over the lifetime of a prepared statement, there are N+1 possible schemas (1 applied + N historical table schema versions after DDL changes).
  3. Column count is ambiguous. Different schema versions could produce the same column count after enough ALTER TABLE operations.
  4. Different LWT types produce different not-applied shapes. INSERT IF NOT EXISTS returns all columns; UPDATE IF col = val returns only the IF-clause columns; UPDATE IF EXISTS never has additional columns.
  5. The server has already made this decision. The NO_METADATA_FLAG is set server-side in the PREPARE response. The driver's skip_meta=bool(result_metadata) simply mirrors this. Changing this would require a new CQL protocol extension in ScyllaDB itself.

History: ScyllaDB issue scylladb/scylladb#6259 (2020, kostja) addressed what was feasible — adding the is_lwt flag to PREPARE responses for routing purposes. Metadata caching was deliberately out of scope because the fundamental variability problem cannot be solved driver-side.

Driver git history:

  • 2016 (PYTHON-71): result_metadata and skip_meta optimization introduced. LWT returned [] for result_metadata.
  • 2017 (PYTHON-847): Integration tests added proving result_metadata stays empty and result_metadata_id does NOT change even after ALTER TABLE.
  • 2019: Refactored from [] to None (early return in recv_results_metadata()). Test updated to assert assertIsNone(prepared_statement.result_metadata).
  • 2025: ScyllaDB LWT protocol extension (SCYLLA_LWT_ADD_METADATA_MARK) added for is_lwt flag detection.

B2. LWT-Aware Retry Policy (Medium Impact, Low Complexity)

Create configurable CAS retry behavior — allow retrying CAS timeouts on the same coordinator with backoff.

Status: Not yet started.

B3. LWT Performance Benchmark Suite (DONE)

Create benchmarks for: LWT bind, LWT decode (applied/not-applied), LWT end-to-end throughput, ParseDesc cache hit rate for LWT, and prepared SELECT vs. prepared LWT comparison.

Status: Completed. Branch perf/lwt-benchmarks on mykaul remote. 67 pytest-benchmark tests covering bind, decode, was_applied, and comparison benchmarks.

B4. Optimize was_applied Fast Path (Low Impact, Low Complexity)

Use is_lwt() to skip batch detection regex in ResultSet.was_applied.

Status: Completed. PR mykaul/python-driver#13. Unit tests added.

Benchmark results (inconclusive): Microbenchmarks on a noisy system (5 runs, pinned to single CPU, 5000+ rounds each) showed no statistically significant difference. The optimization saves an isinstance() + regex match() call (~tens of nanoseconds), which is lost in the noise of property access and row factory overhead (~3400ns median). The value is in code clarity and avoiding unnecessary regex for the common non-batch LWT path.

B5. Pre-allocate values List in BoundStatement.bind() (Low Impact, Low Complexity)

Use [None] * len(values) with index assignment instead of repeated append().

Status: Completed. PR mykaul/python-driver#12. No new tests needed (existing 37 bind tests cover all paths).

Benchmark results (inconclusive): Microbenchmarks showed no measurable improvement. The bind() cost is dominated by serialize() calls per value, not list operations. Results across multiple runs were within noise (±3-4%). The optimization is essentially neutral — the enumerate(zip(...)) wrapper may offset the savings from avoiding append().

Expected Combined Impact

For a typical LWT prepared statement execution (single-row result, default named_tuple_factory):

The decode phase improvements are the most impactful for LWT because LWT results are small (1 row), making fixed per-response costs (ParseDesc construction, namedtuple class creation) the dominant overhead rather than per-row processing.

Note: Bottleneck #1 (LWT disables result metadata caching) is a fundamental CQL protocol limitation that cannot be addressed driver-side. The per-response metadata overhead is the inherent cost of LWT's variable result schemas. The existing PRs (#740, #741, #742) minimize the cost of processing that metadata once received.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions