-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Summary
Analysis of LWT (Lightweight Transaction) performance with prepared statements, identification of bottlenecks, and a plan for improvements across the full execution pipeline.
Background
LWT queries in this driver are always prepared statements — the is_lwt flag is set during the PREPARE response via the ScyllaDB protocol extension SCYLLA_LWT_ADD_METADATA_MARK (protocol.py:789-791). The prepared statement execution pipeline for LWT shares the same stages as any prepared statement but with one critical difference:
LWT prepared statements have result_metadata = None (query.py:508, test_prepared_statements.py:623). This is because LWT results have variable column sets:
applied=True→ returns([applied])(1 column)applied=False→ returns([applied], col1, col2, ...)(N columns with existing values)
This means the skip_meta optimization (cluster.py:2956) is disabled for LWT — the server must send full result metadata with every response.
How LWT Queries Flow Through the Driver
Detection (Server-Side, ScyllaDB Extension)
- During connection establishment, the server advertises
SCYLLA_LWT_ADD_METADATA_MARKwith aLWT_OPTIMIZATION_META_BIT_MASK=<bitmask>value - Parsed into
_LwtInfo(lwt_meta_bit_mask)stored inProtocolFeatures.lwt_info(protocol_features.py:84-96) - On PREPARE response,
lwt_info.get_lwt_flag(flags)checks if the LWT bit is set (protocol.py:791) - Stored as
PreparedStatement._is_lwt(query.py:514,539)
Flag Propagation
| Class | Location | Behavior |
|---|---|---|
Statement |
query.py:378-379 |
Base: always False |
PreparedStatement |
query.py:514,539,650-651 |
Stores _is_lwt from server |
BoundStatement |
query.py:912-913 |
Delegates to prepared_statement.is_lwt() |
BatchStatement |
query.py:989,1085-1108,1144-1145 |
True if any added statement is LWT |
Impact on Routing and Retry
- TokenAwarePolicy (
policies.py:517): LWT queries disable replica shuffling for Paxos consensus ordering - RetryPolicy (
policies.py:960-965): CAS write timeouts are rethrown (not retried) - DowngradingConsistencyRetryPolicy (
policies.py:1108-1110): Serial consistency reads are not downgraded
Result Handling
ResultSet.was_applied(cluster.py:5323-5351): Checks[applied]column from LWT resultsLWTException/check_applied()(cqlengine/query.py:45-78): CQLEngine ORM integration
Identified Performance Bottlenecks
1. LWT Disables Result Metadata Caching (Fundamental Protocol Limitation — NOT FIXABLE)
LWT prepared statements have result_metadata = None, so skip_meta=bool(prepared_statement.result_metadata) evaluates to False (cluster.py:2956). Every LWT execution includes full metadata from the server, and the driver re-parses it every time.
Investigation conclusion: This is a correct, intentional protocol-level design decision, not a driver-side oversight. The server sets NO_METADATA_FLAG (0x0004) in the PREPARE response because LWT result schemas are non-deterministic at prepare time. See B1 analysis below for the full explanation.
2. namedtuple Class Creation Per Response (High Impact)
named_tuple_factory (the default row factory) calls Python's namedtuple() which internally uses exec() — on every response. For single-row LWT results, this is the dominant per-response cost.
3. ParseDesc Reconstruction Per Response (High Impact)
ParseDesc construction (column name extraction, ColDesc namedtuple creation, deserializer lookup) runs on every result set. For LWT's small 1-row results, this is the dominant decode cost.
4. Per-Value Column Encryption Policy Check (Medium Impact)
column_encryption_policy is checked per-value in both bind and decode paths, even when no encryption policy is set (99%+ of deployments).
5. Limited Cython Serializer Coverage (Medium Impact)
Only FloatType, DoubleType, Int32Type, and VectorType have Cython serializers. All other types fall through to GenericSerializer.
6. Hardcoded Timeouts (Low Impact)
- Re-preparation timeout: 5.0s (
cluster.py:2296-2297) - Connection borrow timeout: 2.0s (
cluster.py:4520)
7. WeakValueDictionary for Prepared Statement Cache (Low Impact)
If the user discards all references to a PreparedStatement, it gets GC'd and needs re-preparation — a hidden extra round-trip (cluster.py:1448).
Existing PRs Addressing the Pipeline
| Stage | PR | Impact on LWT | Speedup |
|---|---|---|---|
| Bind (serialization) | #749 Cython serializers, #630 CE-policy split | Direct — every LWT execution calls bind() |
1.3x scalar |
| Send (memory) | #659 __slots__ |
Indirect — reduces GC pressure per message | ~300 bytes/msg |
| Receive (I/O) | #734 zero-copy read path | Modest for LWT (small 1-row results) | 1.17x small |
| Decode (ParseDesc) | #742 cache ParseDesc | Critical for LWT — ParseDesc construction is dominant cost for 1-row results | 13.7x 1-row |
| Decode (deserializers) | #741 cache deserializer lookup | Direct — prerequisite for #742 | 6-38x lookup |
| Decode (CE elimination) | #630 CE-policy split | Direct — removes per-value CE check | Constant factor |
| Row factory | #740 cache namedtuple class | Critical — exec() called per response is 135x overhead |
135x class creation |
| Re-prepare (correctness) | #725 UNPREPARED fix | Prevents spurious failures during coordinator restarts | Reliability |
Recommended PR Landing Order
Phase 1 — Land existing PRs (priority order for LWT impact):
- PR (improvement) (cython only) cache deserializer instances in find_deserializer and m… #741 (deserializer cache) — no dependencies
- PR (improvement) (python code path only): cache namedtuple class in named_tuple_factory to avoid … #740 (namedtuple cache) — no dependencies
- PR Optimize column_encryption_policy checks in recv_results_rows #630 (CE-policy split) — no dependencies
- PR This PR fixes UNPREPARED handling in the Python driver execution path… #725 (UNPREPARED fix) — no dependencies
- PR (improvement) (Cython only) row_parser: cache ParseDesc for prepared statements #742 (ParseDesc cache) — after (improvement) (cython only) cache deserializer instances in find_deserializer and m… #741
- PR (improvement) query: add Cython-aware serializer path in BoundStatement.bind() #749 (Cython bind) — after Optimize column_encryption_policy checks in recv_results_rows #630
- PR Add __slots__ to multiple classes for memory optimization #659 (
__slots__) — no dependencies - PR (improvement) perf: remove copies on the read path #734 (zero-copy read) — no dependencies
New Work: LWT-Specific Optimizations
B1. LWT Result Metadata Caching (NOT FEASIBLE)
Cache both result metadata variants (applied vs. not-applied) keyed by column count.
Status: Investigated and ruled out. The result_metadata = None / NO_METADATA_FLAG behavior is a correct protocol-level design decision by the Cassandra/ScyllaDB server, not a driver-side oversight or performance gap that can be closed.
Why LWT result schemas cannot be cached:
A single prepared LWT statement can produce three or more distinct result schemas across its lifetime:
| Scenario | Columns Returned |
|---|---|
| Applied (row inserted) | [applied] — 1 column |
| Not applied (row exists) | [applied], a, b, d — 4 columns |
Not applied (after ALTER TABLE ADD c) |
[applied], a, b, c, d — 5 columns |
This is verified by the integration test _test_updated_conditional (tests/integration/standard/test_prepared_statements.py:595-634, PYTHON-847).
A dual-schema caching approach fails because:
ALTER TABLEinvalidates the not-applied schema. Adding/dropping columns changes the column count and types. A cached schema would silently produce wrong deserialization after DDL changes.- The schema space is unbounded. Over the lifetime of a prepared statement, there are N+1 possible schemas (1 applied + N historical table schema versions after DDL changes).
- Column count is ambiguous. Different schema versions could produce the same column count after enough ALTER TABLE operations.
- Different LWT types produce different not-applied shapes.
INSERT IF NOT EXISTSreturns all columns;UPDATE IF col = valreturns only the IF-clause columns;UPDATE IF EXISTSnever has additional columns. - The server has already made this decision. The
NO_METADATA_FLAGis set server-side in the PREPARE response. The driver'sskip_meta=bool(result_metadata)simply mirrors this. Changing this would require a new CQL protocol extension in ScyllaDB itself.
History: ScyllaDB issue scylladb/scylladb#6259 (2020, kostja) addressed what was feasible — adding the is_lwt flag to PREPARE responses for routing purposes. Metadata caching was deliberately out of scope because the fundamental variability problem cannot be solved driver-side.
Driver git history:
- 2016 (PYTHON-71):
result_metadataandskip_metaoptimization introduced. LWT returned[]for result_metadata. - 2017 (PYTHON-847): Integration tests added proving
result_metadatastays empty andresult_metadata_iddoes NOT change even afterALTER TABLE. - 2019: Refactored from
[]toNone(early return inrecv_results_metadata()). Test updated to assertassertIsNone(prepared_statement.result_metadata). - 2025: ScyllaDB LWT protocol extension (
SCYLLA_LWT_ADD_METADATA_MARK) added foris_lwtflag detection.
B2. LWT-Aware Retry Policy (Medium Impact, Low Complexity)
Create configurable CAS retry behavior — allow retrying CAS timeouts on the same coordinator with backoff.
Status: Not yet started.
B3. LWT Performance Benchmark Suite (DONE)
Create benchmarks for: LWT bind, LWT decode (applied/not-applied), LWT end-to-end throughput, ParseDesc cache hit rate for LWT, and prepared SELECT vs. prepared LWT comparison.
Status: Completed. Branch perf/lwt-benchmarks on mykaul remote. 67 pytest-benchmark tests covering bind, decode, was_applied, and comparison benchmarks.
B4. Optimize was_applied Fast Path (Low Impact, Low Complexity)
Use is_lwt() to skip batch detection regex in ResultSet.was_applied.
Status: Completed. PR mykaul/python-driver#13. Unit tests added.
Benchmark results (inconclusive): Microbenchmarks on a noisy system (5 runs, pinned to single CPU, 5000+ rounds each) showed no statistically significant difference. The optimization saves an isinstance() + regex match() call (~tens of nanoseconds), which is lost in the noise of property access and row factory overhead (~3400ns median). The value is in code clarity and avoiding unnecessary regex for the common non-batch LWT path.
B5. Pre-allocate values List in BoundStatement.bind() (Low Impact, Low Complexity)
Use [None] * len(values) with index assignment instead of repeated append().
Status: Completed. PR mykaul/python-driver#12. No new tests needed (existing 37 bind tests cover all paths).
Benchmark results (inconclusive): Microbenchmarks showed no measurable improvement. The bind() cost is dominated by serialize() calls per value, not list operations. Results across multiple runs were within noise (±3-4%). The optimization is essentially neutral — the enumerate(zip(...)) wrapper may offset the savings from avoiding append().
Expected Combined Impact
For a typical LWT prepared statement execution (single-row result, default named_tuple_factory):
- Bind phase: ~1.3x faster (PR (improvement) query: add Cython-aware serializer path in BoundStatement.bind() #749 + Optimize column_encryption_policy checks in recv_results_rows #630)
- Decode phase: ~14x faster for ParseDesc (PR (improvement) (Cython only) row_parser: cache ParseDesc for prepared statements #742 + (improvement) (cython only) cache deserializer instances in find_deserializer and m… #741), ~135x faster for row factory (PR (improvement) (python code path only): cache namedtuple class in named_tuple_factory to avoid … #740), CE overhead eliminated (PR Optimize column_encryption_policy checks in recv_results_rows #630)
- Memory: ~1.2 KB saved per round-trip (PR Add __slots__ to multiple classes for memory optimization #659)
- Reliability: Transparent re-preparation on coordinator restart (PR This PR fixes UNPREPARED handling in the Python driver execution path… #725)
The decode phase improvements are the most impactful for LWT because LWT results are small (1 row), making fixed per-response costs (ParseDesc construction, namedtuple class creation) the dominant overhead rather than per-row processing.
Note: Bottleneck #1 (LWT disables result metadata caching) is a fundamental CQL protocol limitation that cannot be addressed driver-side. The per-response metadata overhead is the inherent cost of LWT's variable result schemas. The existing PRs (#740, #741, #742) minimize the cost of processing that metadata once received.