(improvement) LWT prepared statement performance: analysis and improvement plan

## Summary

Analysis of LWT (Lightweight Transaction) performance with prepared statements, identification of bottlenecks, and a plan for improvements across the full execution pipeline.

## Background

LWT queries in this driver are **always** prepared statements — the `is_lwt` flag is set during the PREPARE response via the ScyllaDB protocol extension `SCYLLA_LWT_ADD_METADATA_MARK` (`protocol.py:789-791`). The prepared statement execution pipeline for LWT shares the same stages as any prepared statement but with one critical difference:

**LWT prepared statements have `result_metadata = None`** (`query.py:508`, `test_prepared_statements.py:623`). This is because LWT results have variable column sets:
- `applied=True` → returns `([applied])` (1 column)
- `applied=False` → returns `([applied], col1, col2, ...)` (N columns with existing values)

This means the `skip_meta` optimization (`cluster.py:2956`) is **disabled** for LWT — the server must send full result metadata with every response.

## How LWT Queries Flow Through the Driver

### Detection (Server-Side, ScyllaDB Extension)

1. During connection establishment, the server advertises `SCYLLA_LWT_ADD_METADATA_MARK` with a `LWT_OPTIMIZATION_META_BIT_MASK=<bitmask>` value
2. Parsed into `_LwtInfo(lwt_meta_bit_mask)` stored in `ProtocolFeatures.lwt_info` (`protocol_features.py:84-96`)
3. On PREPARE response, `lwt_info.get_lwt_flag(flags)` checks if the LWT bit is set (`protocol.py:791`)
4. Stored as `PreparedStatement._is_lwt` (`query.py:514,539`)

### Flag Propagation

| Class | Location | Behavior |
|-------|----------|----------|
| `Statement` | `query.py:378-379` | Base: always `False` |
| `PreparedStatement` | `query.py:514,539,650-651` | Stores `_is_lwt` from server |
| `BoundStatement` | `query.py:912-913` | Delegates to `prepared_statement.is_lwt()` |
| `BatchStatement` | `query.py:989,1085-1108,1144-1145` | `True` if any added statement is LWT |

### Impact on Routing and Retry

- **TokenAwarePolicy** (`policies.py:517`): LWT queries disable replica shuffling for Paxos consensus ordering
- **RetryPolicy** (`policies.py:960-965`): CAS write timeouts are rethrown (not retried)
- **DowngradingConsistencyRetryPolicy** (`policies.py:1108-1110`): Serial consistency reads are not downgraded

### Result Handling

- `ResultSet.was_applied` (`cluster.py:5323-5351`): Checks `[applied]` column from LWT results
- `LWTException` / `check_applied()` (`cqlengine/query.py:45-78`): CQLEngine ORM integration

## Identified Performance Bottlenecks

### 1. LWT Disables Result Metadata Caching (Fundamental Protocol Limitation — NOT FIXABLE)

LWT prepared statements have `result_metadata = None`, so `skip_meta=bool(prepared_statement.result_metadata)` evaluates to `False` (`cluster.py:2956`). Every LWT execution includes full metadata from the server, and the driver re-parses it every time.

**Investigation conclusion:** This is a correct, intentional protocol-level design decision, not a driver-side oversight. The server sets `NO_METADATA_FLAG` (0x0004) in the PREPARE response because LWT result schemas are non-deterministic at prepare time. See [B1 analysis below](#b1-lwt-result-metadata-caching-not-feasible) for the full explanation.

### 2. namedtuple Class Creation Per Response (High Impact)

`named_tuple_factory` (the default row factory) calls Python's `namedtuple()` which internally uses `exec()` — on every response. For single-row LWT results, this is the dominant per-response cost.

### 3. ParseDesc Reconstruction Per Response (High Impact)

`ParseDesc` construction (column name extraction, `ColDesc` namedtuple creation, deserializer lookup) runs on every result set. For LWT's small 1-row results, this is the dominant decode cost.

### 4. Per-Value Column Encryption Policy Check (Medium Impact)

`column_encryption_policy` is checked per-value in both bind and decode paths, even when no encryption policy is set (99%+ of deployments).

### 5. Limited Cython Serializer Coverage (Medium Impact)

Only `FloatType`, `DoubleType`, `Int32Type`, and `VectorType` have Cython serializers. All other types fall through to `GenericSerializer`.

### 6. Hardcoded Timeouts (Low Impact)

- Re-preparation timeout: 5.0s (`cluster.py:2296-2297`)
- Connection borrow timeout: 2.0s (`cluster.py:4520`)

### 7. WeakValueDictionary for Prepared Statement Cache (Low Impact)

If the user discards all references to a `PreparedStatement`, it gets GC'd and needs re-preparation — a hidden extra round-trip (`cluster.py:1448`).

## Existing PRs Addressing the Pipeline

| Stage | PR | Impact on LWT | Speedup |
|-------|----|---------------|---------|
| **Bind** (serialization) | #749 Cython serializers, #630 CE-policy split | Direct — every LWT execution calls `bind()` | 1.3x scalar |
| **Send** (memory) | #659 `__slots__` | Indirect — reduces GC pressure per message | ~300 bytes/msg |
| **Receive** (I/O) | #734 zero-copy read path | Modest for LWT (small 1-row results) | 1.17x small |
| **Decode** (ParseDesc) | #742 cache ParseDesc | **Critical for LWT** — ParseDesc construction is dominant cost for 1-row results | **13.7x** 1-row |
| **Decode** (deserializers) | #741 cache deserializer lookup | Direct — prerequisite for #742 | 6-38x lookup |
| **Decode** (CE elimination) | #630 CE-policy split | Direct — removes per-value CE check | Constant factor |
| **Row factory** | #740 cache namedtuple class | **Critical** — `exec()` called per response is 135x overhead | **135x** class creation |
| **Re-prepare** (correctness) | #725 UNPREPARED fix | Prevents spurious failures during coordinator restarts | Reliability |

### Recommended PR Landing Order

**Phase 1 — Land existing PRs (priority order for LWT impact):**
1. PR #741 (deserializer cache) — no dependencies
2. PR #740 (namedtuple cache) — no dependencies
3. PR #630 (CE-policy split) — no dependencies
4. PR #725 (UNPREPARED fix) — no dependencies
5. PR #742 (ParseDesc cache) — after #741
6. PR #749 (Cython bind) — after #630
7. PR #659 (`__slots__`) — no dependencies
8. PR #734 (zero-copy read) — no dependencies

## New Work: LWT-Specific Optimizations

### B1. LWT Result Metadata Caching (NOT FEASIBLE)

~~Cache both result metadata variants (applied vs. not-applied) keyed by column count.~~

**Status: Investigated and ruled out.** The `result_metadata = None` / `NO_METADATA_FLAG` behavior is a correct protocol-level design decision by the Cassandra/ScyllaDB server, not a driver-side oversight or performance gap that can be closed.

**Why LWT result schemas cannot be cached:**

A single prepared LWT statement can produce **three or more** distinct result schemas across its lifetime:

| Scenario | Columns Returned |
|----------|-----------------|
| Applied (row inserted) | `[applied]` — 1 column |
| Not applied (row exists) | `[applied], a, b, d` — 4 columns |
| Not applied (after `ALTER TABLE ADD c`) | `[applied], a, b, c, d` — 5 columns |

This is verified by the integration test `_test_updated_conditional` (`tests/integration/standard/test_prepared_statements.py:595-634`, PYTHON-847).

A dual-schema caching approach fails because:

1. **`ALTER TABLE` invalidates the not-applied schema.** Adding/dropping columns changes the column count and types. A cached schema would silently produce wrong deserialization after DDL changes.
2. **The schema space is unbounded.** Over the lifetime of a prepared statement, there are N+1 possible schemas (1 applied + N historical table schema versions after DDL changes).
3. **Column count is ambiguous.** Different schema versions could produce the same column count after enough ALTER TABLE operations.
4. **Different LWT types produce different not-applied shapes.** `INSERT IF NOT EXISTS` returns all columns; `UPDATE IF col = val` returns only the IF-clause columns; `UPDATE IF EXISTS` never has additional columns.
5. **The server has already made this decision.** The `NO_METADATA_FLAG` is set server-side in the PREPARE response. The driver's `skip_meta=bool(result_metadata)` simply mirrors this. Changing this would require a new CQL protocol extension in ScyllaDB itself.

**History:** ScyllaDB issue [scylladb/scylladb#6259](https://github.com/scylladb/scylladb/issues/6259) (2020, kostja) addressed what was feasible — adding the `is_lwt` flag to PREPARE responses for routing purposes. Metadata caching was deliberately out of scope because the fundamental variability problem cannot be solved driver-side.

**Driver git history:**
- 2016 (PYTHON-71): `result_metadata` and `skip_meta` optimization introduced. LWT returned `[]` for result_metadata.
- 2017 (PYTHON-847): Integration tests added proving `result_metadata` stays empty and `result_metadata_id` does NOT change even after `ALTER TABLE`.
- 2019: Refactored from `[]` to `None` (early return in `recv_results_metadata()`). Test updated to assert `assertIsNone(prepared_statement.result_metadata)`.
- 2025: ScyllaDB LWT protocol extension (`SCYLLA_LWT_ADD_METADATA_MARK`) added for `is_lwt` flag detection.

### B2. LWT-Aware Retry Policy (Medium Impact, Low Complexity)

Create configurable CAS retry behavior — allow retrying CAS timeouts on the same coordinator with backoff.

**Status: Not yet started.**

### B3. LWT Performance Benchmark Suite (DONE)

Create benchmarks for: LWT bind, LWT decode (applied/not-applied), LWT end-to-end throughput, ParseDesc cache hit rate for LWT, and prepared SELECT vs. prepared LWT comparison.

**Status: Completed.** Branch `perf/lwt-benchmarks` on [mykaul](https://github.com/mykaul/python-driver) remote. 67 pytest-benchmark tests covering bind, decode, `was_applied`, and comparison benchmarks.

### B4. Optimize `was_applied` Fast Path (Low Impact, Low Complexity)

Use `is_lwt()` to skip batch detection regex in `ResultSet.was_applied`.

**Status: Completed.** PR [mykaul/python-driver#13](https://github.com/mykaul/python-driver/pull/13). Unit tests added.

**Benchmark results (inconclusive):** Microbenchmarks on a noisy system (5 runs, pinned to single CPU, 5000+ rounds each) showed no statistically significant difference. The optimization saves an `isinstance()` + regex `match()` call (~tens of nanoseconds), which is lost in the noise of property access and row factory overhead (~3400ns median). The value is in code clarity and avoiding unnecessary regex for the common non-batch LWT path.

### B5. Pre-allocate `values` List in `BoundStatement.bind()` (Low Impact, Low Complexity)

Use `[None] * len(values)` with index assignment instead of repeated `append()`.

**Status: Completed.** PR [mykaul/python-driver#12](https://github.com/mykaul/python-driver/pull/12). No new tests needed (existing 37 bind tests cover all paths).

**Benchmark results (inconclusive):** Microbenchmarks showed no measurable improvement. The `bind()` cost is dominated by `serialize()` calls per value, not list operations. Results across multiple runs were within noise (±3-4%). The optimization is essentially neutral — the `enumerate(zip(...))` wrapper may offset the savings from avoiding `append()`.

## Expected Combined Impact

For a typical LWT prepared statement execution (single-row result, default `named_tuple_factory`):

- **Bind phase:** ~1.3x faster (PR #749 + #630)
- **Decode phase:** ~14x faster for ParseDesc (PR #742 + #741), ~135x faster for row factory (PR #740), CE overhead eliminated (PR #630)
- **Memory:** ~1.2 KB saved per round-trip (PR #659)
- **Reliability:** Transparent re-preparation on coordinator restart (PR #725)

The decode phase improvements are the most impactful for LWT because LWT results are small (1 row), making fixed per-response costs (ParseDesc construction, namedtuple class creation) the dominant overhead rather than per-row processing.

**Note:** Bottleneck #1 (LWT disables result metadata caching) is a fundamental CQL protocol limitation that cannot be addressed driver-side. The per-response metadata overhead is the inherent cost of LWT's variable result schemas. The existing PRs (#740, #741, #742) minimize the cost of *processing* that metadata once received.

Class	Location	Behavior
`Statement`	`query.py:378-379`	Base: always `False`
`PreparedStatement`	`query.py:514,539,650-651`	Stores `_is_lwt` from server
`BoundStatement`	`query.py:912-913`	Delegates to `prepared_statement.is_lwt()`
`BatchStatement`	`query.py:989,1085-1108,1144-1145`	`True` if any added statement is LWT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(improvement) LWT prepared statement performance: analysis and improvement plan #751

Summary

Background

How LWT Queries Flow Through the Driver

Detection (Server-Side, ScyllaDB Extension)

Flag Propagation

Impact on Routing and Retry

Result Handling

Identified Performance Bottlenecks

1. LWT Disables Result Metadata Caching (Fundamental Protocol Limitation — NOT FIXABLE)

2. namedtuple Class Creation Per Response (High Impact)

3. ParseDesc Reconstruction Per Response (High Impact)

4. Per-Value Column Encryption Policy Check (Medium Impact)

5. Limited Cython Serializer Coverage (Medium Impact)

6. Hardcoded Timeouts (Low Impact)

7. WeakValueDictionary for Prepared Statement Cache (Low Impact)

Existing PRs Addressing the Pipeline

Recommended PR Landing Order

New Work: LWT-Specific Optimizations

B1. LWT Result Metadata Caching (NOT FEASIBLE)

B2. LWT-Aware Retry Policy (Medium Impact, Low Complexity)

B3. LWT Performance Benchmark Suite (DONE)

B4. Optimize `was_applied` Fast Path (Low Impact, Low Complexity)

B5. Pre-allocate `values` List in `BoundStatement.bind()` (Low Impact, Low Complexity)

Expected Combined Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stage	PR	Impact on LWT	Speedup
Bind (serialization)	#749 Cython serializers, #630 CE-policy split	Direct — every LWT execution calls `bind()`	1.3x scalar
Send (memory)	#659 `__slots__`	Indirect — reduces GC pressure per message	~300 bytes/msg
Receive (I/O)	#734 zero-copy read path	Modest for LWT (small 1-row results)	1.17x small
Decode (ParseDesc)	#742 cache ParseDesc	Critical for LWT — ParseDesc construction is dominant cost for 1-row results	13.7x 1-row
Decode (deserializers)	#741 cache deserializer lookup	Direct — prerequisite for #742	6-38x lookup
Decode (CE elimination)	#630 CE-policy split	Direct — removes per-value CE check	Constant factor
Row factory	#740 cache namedtuple class	Critical — `exec()` called per response is 135x overhead	135x class creation
Re-prepare (correctness)	#725 UNPREPARED fix	Prevents spurious failures during coordinator restarts	Reliability

Scenario	Columns Returned
Applied (row inserted)	`[applied]` — 1 column
Not applied (row exists)	`[applied], a, b, d` — 4 columns
Not applied (after `ALTER TABLE ADD c`)	`[applied], a, b, c, d` — 5 columns

(improvement) LWT prepared statement performance: analysis and improvement plan #751

Description

Summary

Background

How LWT Queries Flow Through the Driver

Detection (Server-Side, ScyllaDB Extension)

Flag Propagation

Impact on Routing and Retry

Result Handling

Identified Performance Bottlenecks

1. LWT Disables Result Metadata Caching (Fundamental Protocol Limitation — NOT FIXABLE)

2. namedtuple Class Creation Per Response (High Impact)

3. ParseDesc Reconstruction Per Response (High Impact)

4. Per-Value Column Encryption Policy Check (Medium Impact)

5. Limited Cython Serializer Coverage (Medium Impact)

6. Hardcoded Timeouts (Low Impact)

7. WeakValueDictionary for Prepared Statement Cache (Low Impact)

Existing PRs Addressing the Pipeline

Recommended PR Landing Order

New Work: LWT-Specific Optimizations

B1. LWT Result Metadata Caching (NOT FEASIBLE)

B2. LWT-Aware Retry Policy (Medium Impact, Low Complexity)

B3. LWT Performance Benchmark Suite (DONE)

B4. Optimize was_applied Fast Path (Low Impact, Low Complexity)

B5. Pre-allocate values List in BoundStatement.bind() (Low Impact, Low Complexity)

Expected Combined Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

B4. Optimize `was_applied` Fast Path (Low Impact, Low Complexity)

B5. Pre-allocate `values` List in `BoundStatement.bind()` (Low Impact, Low Complexity)