Skip to content

cluster: add control-connection query fallback#878

Open
dkropachev wants to merge 1 commit intomasterfrom
dk/add-control-connection-fallback
Open

cluster: add control-connection query fallback#878
dkropachev wants to merge 1 commit intomasterfrom
dk/add-control-connection-fallback

Conversation

@dkropachev
Copy link
Copy Markdown
Collaborator

@dkropachev dkropachev commented May 7, 2026

Summary

Add an opt-in control-connection fallback for application queries when the driver cannot populate normal node pools, which happens in deployments that expose the cluster through a non-broadcast IP address such as a TCP proxy or a node public IP. In that mode the driver can still execute queries over the single control connection, but throughput is poor and connection churn increases the chance of request errors. This option is intentionally disabled by default and should not be used in production.

Also propagate keyspace updates on the fallback path so USE keeps the control connection in sync.

Fixes: #720

Tests

  • uv run python - <<\"PY\" ... pytest tests/unit/test_cluster.py -q -k set_keyspace_for_all_pools_reports_all_errors ... PY
  • uv run python - <<\"PY\" ... pytest tests/unit/test_response_future.py -q -k control_connection_fallback_updates_connection_keyspace ... PY

@dkropachev dkropachev self-assigned this May 7, 2026
@dkropachev dkropachev marked this pull request as ready for review May 7, 2026 05:58
@dkropachev dkropachev force-pushed the dk/add-control-connection-fallback branch from 88b0f45 to 82a5b3c Compare May 7, 2026 05:59
@dkropachev dkropachev marked this pull request as draft May 7, 2026 05:59
@dkropachev dkropachev force-pushed the dk/add-control-connection-fallback branch from 82a5b3c to 4eb62ce Compare May 7, 2026 06:10
Comment thread cassandra/cluster.py Outdated
Comment thread cassandra/cluster.py
@dkropachev dkropachev marked this pull request as ready for review May 7, 2026 11:08
@dkropachev
Copy link
Copy Markdown
Collaborator Author

cqlsh manual test report

I tested cqlsh with the local driver from /extra/scylladb/python-driver-4, installed into the cqlsh venv as editable scylla-driver 3.29.9.

Test setup:

Scylla container: cqlsh-nodepool-test
Docker network: bridge
Reachable CQL endpoint: 127.0.0.1:19042
Advertised node endpoint: 10.255.255.1:19042

This setup makes the initial control connection reachable, but makes discovered node-pool connections unusable.

Results:

Mode Session start Node-pool behavior Query result
Disabled Fails no usable pool NoHostAvailable
Fallback Succeeds tries pools, tolerates none usable query succeeds via control connection
NoNodePoolFallback Succeeds skips pool creation query succeeds via control connection

Default Disabled behavior failed as expected:

Connection error: ('Unable to connect to any servers', ['10.255.255.1'])

With ControlConnectionQueryFallback.Fallback, cqlsh succeeded:

No usable node pools; falling back to control connection for host 127.0.0.1:19042

 system.now()
--------------------------------------
 11708f30-4a04-11f1-9b89-4372a228dacb

(1 rows)

I also verified directly through the driver that the session starts with no node pools and still executes the query:

pools_at_session_start {}
query_result OrderedDict([('system.now()', UUID(...))])
pools_after_query {}

With ControlConnectionQueryFallback.NoNodePoolFallback, cqlsh also succeeded, and the driver skipped pool creation entirely:

pools_at_session_start {}
initial_connect_futures set()
query_result OrderedDict([('system.now()', UUID(...))])
pools_after_query {}

Conclusion: Fallback now handles the intended case: cqlsh can start even when no node pools are reachable, and queries run through the control connection. NoNodePoolFallback also works and explicitly skips node-pool creation.

@dkropachev dkropachev force-pushed the dk/add-control-connection-fallback branch 3 times, most recently from d14ab6c to a94d092 Compare May 7, 2026 11:26
Comment thread cassandra/cluster.py
Comment thread cassandra/cluster.py Outdated
Comment thread cassandra/cluster.py
Comment thread cassandra/cluster.py Outdated
Comment thread cassandra/cluster.py
Comment thread cassandra/cluster.py Outdated
Comment thread cassandra/cluster.py
@dkropachev dkropachev force-pushed the dk/add-control-connection-fallback branch 2 times, most recently from b50342a to 5f16f20 Compare May 7, 2026 18:04
@dkropachev dkropachev closed this May 7, 2026
@dkropachev dkropachev reopened this May 8, 2026
@dkropachev dkropachev force-pushed the dk/add-control-connection-fallback branch from 5f16f20 to a37ecce Compare May 8, 2026 01:06
Copy link
Copy Markdown

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit I don't understand the logic and flow around _control_connection_query_attempted. I'll have to just assume it works.

I have 2 more questions regarding SkipPoolCreation:

  • WDYT about changing ProfileManager to return IGNORED distance for all hosts in this mode? It would be additional hardening about unexpected code paths doing something weird. One possible problem is that it may affect control connection, in which case we should not do it.
  • I think Cluster.on_up and Cluster.on_down should also be guarded in SkipPoolCreation mode. We don't want reconnectors to start etc.

@dkropachev dkropachev force-pushed the dk/add-control-connection-fallback branch from a37ecce to 16424dc Compare May 8, 2026 10:32
@dkropachev
Copy link
Copy Markdown
Collaborator Author

I have to admit I don't understand the logic and flow around _control_connection_query_attempted. I'll have to just assume it works.

I have 2 more questions regarding SkipPoolCreation:

  • WDYT about changing ProfileManager to return IGNORED distance for all hosts in this mode? It would be additional hardening about unexpected code paths doing something weird. One possible problem is that it may affect control connection, in which case we should not do it.
  • I think Cluster.on_up and Cluster.on_down should also be guarded in SkipPoolCreation mode. We don't want reconnectors to start etc.

Great idea, easily can see how it can stop driver spin on pool creation, added.

@dkropachev dkropachev requested a review from Lorak-mmk May 8, 2026 10:33
Copy link
Copy Markdown
Collaborator

@sylwiaszunejko sylwiaszunejko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now

Add an opt-in control-connection fallback for application queries when the driver cannot populate normal node pools, which happens in deployments that expose the cluster through a non-broadcast IP address such as a TCP proxy or a node public IP. In that mode the driver can still execute queries over the single control connection, but throughput is poor and connection churn increases the chance of request errors. This option is intentionally disabled by default and should not be used in production.

Also propagate keyspace updates on the fallback path so USE keeps the control connection in sync.

Tests:
- tests/unit/test_cluster.py::ClusterTest::test_set_keyspace_for_all_pools_reports_all_errors
- tests/unit/test_response_future.py::ResponseFutureTests::test_control_connection_fallback_updates_connection_keyspace
@dkropachev dkropachev force-pushed the dk/add-control-connection-fallback branch from 16424dc to b0e8df1 Compare May 8, 2026 12:35
"Simple keyspace names should not be quoted, got: %r" % query)

@mock_session_pools
<<<<<<< HEAD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SCT failing to connect via public adddress with scylla-driver==3.29.8

3 participants