FIX: Release GIL around blocking SQLSetConnectAttr calls (#565)#568
FIX: Release GIL around blocking SQLSetConnectAttr calls (#565)#568saurabh500 wants to merge 6 commits into
Conversation
PR #541 released the GIL around SQLDriverConnect / SQLDisconnect / SQLEndTran, but several SQLSetConnectAttr sites in connection.cpp were left holding the GIL: - Connection::setAutocommit (SQL_ATTR_AUTOCOMMIT) - Connection::setAttribute (user-facing set_attr) - Connection::reset (SQL_ATTR_RESET_CONNECTION, SQL_ATTR_TXN_ISOLATION) — called by the pool on acquire. setAutocommit is the call that hangs in the issue #565 repro: it runs by default immediately after connect, and when the network path goes through another Python thread (e.g. an in-process SSH tunnel via paramiko + sshtunnel), that thread cannot make progress while the GIL is held — deadlock. Wrap each of these calls with py::gil_scoped_release, matching the pattern PR #541 introduced for the other ODBC entry points. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
74814ca to
204849c
Compare
827eea5 to
92a568b
Compare
Routes mssql_python.connect()+SELECT through a pure-Python TCP forwarder running in a subprocess, with a hard external watchdog (Popen.communicate(timeout=...)). The forwarder's _pipe loop calls dst.sendall(...) on every chunk; that bound-method dispatch needs the GIL, which is exactly the deadlock condition from issue #565: with the GIL held across SQLSetConnectAttr(SQL_ATTR_AUTOCOMMIT), the forwarder thread cannot run, the driver waits forever for a reply, and the subprocess hangs. Verified that the test fails (in 30s, with the issue-#565 diagnostic message) against the unfixed binary on main, and passes (~0.2s) once the GIL release on setAutocommit/setAttribute/reset is in place. Linux-only and runs in the default functional suite (not stress). A subprocess + external watchdog is required because once the worker thread deadlocks while holding the GIL, the entire Python interpreter is starved — an in-process watchdog thread cannot make progress either. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
92a568b to
0f31307
Compare
The PR's GIL-release fix in Connection::reset() creates a mutex/GIL lock-ordering deadlock when multiple threads acquire pooled connections concurrently: - Thread A holds pool mutex, releases GIL inside reset() - Thread B gets GIL, blocks on pool mutex - Thread A's ODBC call returns, blocks reacquiring GIL → deadlock Fix: Pop one candidate at a time from the pool under the mutex, then validate (isAlive + reset) outside the mutex. This follows the same pattern already used for connect() and disconnect() in the pool. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CI environment does not have mssql_python installed system-wide; pytest's path manipulation makes it importable in the test process but the spawned subprocess does not inherit that. Pass sys.path via PYTHONPATH so the subprocess can locate mssql_python. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/connection/connection.cppLines 377-389 377 this->strBytesBuffer = std::move(binary_data);
378 SQLPOINTER ptr = const_cast<char*>(this->strBytesBuffer.c_str());
379 SQLINTEGER length = static_cast<SQLINTEGER>(this->strBytesBuffer.size());
380
! 381 SQLRETURN ret;
! 382 {
! 383 py::gil_scoped_release release;
! 384 ret = SQLSetConnectAttr_ptr(_dbcHandle->get(), attribute, ptr, length);
! 385 }
386 if (!SQL_SUCCEEDED(ret)) {
387 LOG("Failed to set binary attribute=%d, ret=%d", attribute, ret);
388 } else {
389 LOG("Set binary attribute=%d successfully (length=%d)", attribute, length);mssql_python/pybind/connection/connection_pool.cppLines 66-74 66 // slot will open up — but we can't wait for it here without
67 // adding a condition-variable retry loop. This is an
68 // acceptable trade-off: transient "pool full" errors under
69 // heavy contention are rare and callers can retry.
! 70 throw std::runtime_error("ConnectionPool::acquire: pool size limit reached");
71 }
72 break;
73 }
74 candidate = _pool.front();Lines 81-97 81 valid_conn = candidate;
82 break;
83 }
84 } catch (const std::exception& ex) {
! 85 LOG("Candidate connection validation failed: %s", ex.what());
86 }
87
88 // Candidate is dead or reset failed — mark for disconnect and
89 // decrement the pool size.
! 90 to_disconnect.push_back(candidate);
! 91 {
! 92 std::lock_guard<std::mutex> lock(_mutex);
! 93 if (_current_size > 0) --_current_size;
94 }
95 }
96
97 // Phase 3: Connect the new connection outside the mutex.📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.9%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.6%
mssql_python.pybind.connection.connection.cpp: 76.5%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.connection.py: 85.3%
mssql_python.logging.py: 85.5%🔗 Quick Links
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a GIL-related deadlock where mssql_python.connect() could hang indefinitely when the connection’s network path is mediated by another in-process Python thread (e.g., paramiko/sshtunnel). It does so by releasing the GIL around blocking SQLSetConnectAttr calls, and adjusts connection-pool acquisition to avoid a mutex/GIL lock-order inversion when reset() now releases the GIL.
Changes:
- Release the Python GIL around potentially blocking
SQLSetConnectAttrcall sites (autocommit, userset_attr()paths, and pooled reset paths). - Restructure
ConnectionPool::acquire()to validate candidate connections outside the pool mutex to prevent mutex/GIL deadlocks. - Add a Linux-only functional regression test that reproduces the deadlock via a pure-Python in-process TCP forwarder executed in a subprocess with a hard watchdog.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
mssql_python/pybind/connection/connection.cpp |
Releases GIL around blocking SQLSetConnectAttr calls to prevent tunnel-thread starvation deadlocks. |
mssql_python/pybind/connection/connection_pool.cpp |
Refactors acquire path to avoid holding the pool mutex across ODBC calls that may release/reacquire the GIL. |
tests/test_023_ssh_tunnel_gil_release.py |
Adds a subprocess-based watchdog regression test reproducing the hang scenario. |
Comments suppressed due to low confidence (1)
mssql_python/pybind/connection/connection_pool.cpp:62
- In acquire(), a thread can pop a candidate for validation (outside the mutex) while another thread sees the pool empty with
_current_size == _max_sizeand immediately throwspool size limit reached. If the in-flight candidate later fails validation and decrements_current_size, the second thread has already failed even though capacity would have become available. Consider tracking “in validation” candidates (or temporarily reserving/releasing capacity) so that exhaustion is decided after all popped candidates are conclusively validated, not while a candidate is being checked outside the lock.
while (true) {
std::shared_ptr<Connection> candidate;
{
std::lock_guard<std::mutex> lock(_mutex);
if (_pool.empty()) {
// No more candidates — try to reserve a slot for a new connection.
if (_current_size < _max_size) {
valid_conn = std::make_shared<Connection>(connStr, true);
++_current_size;
needs_connect = true;
} else {
throw std::runtime_error("ConnectionPool::acquire: pool size limit reached");
}
break;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -39,39 +40,51 @@ std::shared_ptr<Connection> ConnectionPool::acquire(const std::wstring& connStr, | |||
|
|
|||
| size_t pruned = before - _pool.size(); | |||
| _current_size = (_current_size >= pruned) ? (_current_size - pruned) : 0; | |||
| } | |||
| env["PYTHONPATH"] = os.pathsep.join(sys.path) | ||
|
|
||
| proc = subprocess.Popen( | ||
| [sys.executable, __file__], |
- Use os.path.abspath(__file__) for subprocess re-exec robustness - Add comments clarifying _current_size tracks reserved capacity - Document the pool-full race window as an acceptable trade-off Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Work Item / Issue Reference
Summary
Fixes the indefinite hang of
mssql_python.connect()when the target server is reached through an in-process SSH tunnel created withparamiko+sshtunnel(issue #565, originally reported as #491).Root cause
PR #541 released the GIL around
SQLDriverConnect,SQLDisconnect, andSQLEndTran, but a handful ofSQLSetConnectAttrcall sites inmssql_python/pybind/connection/connection.cppwere left holding the GIL:Connection::setAutocommit—SQL_ATTR_AUTOCOMMIT. This is the call that hangs in the issue's repro: it runs by default immediately after everyconnect().Connection::setAttribute(int / wide-string / bytes paths) — user-facingset_attr().Connection::reset—SQL_ATTR_RESET_CONNECTIONandSQL_ATTR_TXN_ISOLATION, called by the connection pool on acquire.When the GIL is held during a call that ends up doing blocking work, paramiko's transport thread cannot run and so cannot forward bytes through the SSH channel — the driver waits forever for a response that never comes.
pyodbcdoes not exhibit this because it releases the GIL around its blocking ODBC calls.Fix
Wrap each of these call sites in
py::gil_scoped_release, matching the pattern PR #541 already established forSQLDriverConnect/SQLEndTran. Local-only attribute reads (SQLGetConnectAttrforSQL_ATTR_CONNECTION_DEAD/SQL_ATTR_AUTOCOMMIT,SQLGetInfo,SQLAllocHandle) are intentionally left alone.Additional fix: Connection pool mutex/GIL deadlock
The initial GIL-release change in
Connection::reset()introduced a secondary deadlock inConnectionPool::acquire(). The pool'sacquire()method calledconn->isAlive()andconn->reset()while holding the pool_mutex. Withreset()now releasing the GIL internally, this created a classic ABBA lock-ordering inversion:_mutex, callsreset()_mutexreset()releases GIL, ODBC call_mutexThe fix restructures
ConnectionPool::acquire()to pop one candidate connection from the pool under the mutex, then validate it (isAlive()+reset()) outside the mutex. This follows the same pattern already established in the pool forconnect()(Phase 2.5) anddisconnect()(Phase 3/4), where blocking ODBC calls that release the GIL are performed without holding the pool lock.Why this is safe:
_pool: The candidate is popped from the deque atomically under the mutex before any validation. No other thread can access the same connection object._current_sizestays accurate: Stale connections are decremented during pruning (Phase 1). Failed candidates are decremented individually after validation. New connections increment the count under the mutex beforeconnect()is called._current_size < _max_size) still happens under the mutex, after confirming no viable candidates remain.isAlive()orreset()throws, the connection is caught and treated as dead — it goes to the disconnect list and_current_sizeis decremented.Verification
mssql_python.connect()hung indefinitely; after the fix it completes in ~0.3s.connect()concurrently — this hung indefinitely before the pool fix and completes instantly after.Diff: 2 files changed —
connection.cpp(+42/-12 GIL release) andconnection_pool.cpp(+39/-26 mutex restructure).