Thrift: result-set heartbeat to prevent operation-handle eviction#785
Open
sreekanth-db wants to merge 1 commit into
Open
Thrift: result-set heartbeat to prevent operation-handle eviction#785sreekanth-db wants to merge 1 commit into
sreekanth-db wants to merge 1 commit into
Conversation
30dc347 to
17876f9
Compare
The Databricks SQL warehouse reaps a query's operation handle after roughly 20-30 minutes of driver idleness. Once that happens, any subsequent TFetchResults against the handle returns HTTP 404 / RESOURCE_DOES_NOT_EXIST and the result set is permanently broken -- the driver's retry policy classifies the error as non-retryable, so the user sees their query fail even though the data has already been computed. This affects any workflow where a consumer iterates slowly over a large result set: Notebook sessions, BI tools paginating user-facing tables, anywhere a fetch can sit idle for tens of minutes between calls. Adds a per-ThriftResultSet daemon thread that periodically calls GetOperationStatus while rows are still pending on the server, mirroring the C# ADBC driver's DatabricksOperationStatusPoller and the JDBC work in databricks-jdbc PR #1415. - New: ResultHeartbeatManager (backend/thrift_result_heartbeat_manager.py). Daemon thread, 60s default interval, 10-consecutive-failure stop, terminal-state self-stop. Bypasses make_request retry budget so a transient failure can't stall inside a single poll. - New: ThriftDatabricksClient._heartbeat_poll helper (acquires the existing _request_lock; bypasses make_request). - New: Connection kwargs enable_heartbeat (default True) and heartbeat_interval_seconds (default 60). - Wired into ThriftResultSet: started in __init__ when the server still has rows to deliver; stopped (a) the moment _fill_results_buffer sees has_more_rows flip to False, (b) ResultSet.close(), (c) transitively from Cursor.close()/Connection.close(). - 16 new unit tests (8 manager + 8 wiring), 3 new env-var-gated e2e tests. Verified end-to-end on a real warehouse against all three Thrift result dispositions (Arrow inline, cloud-fetch, pre-Arrow Column inline): heartbeat=False + 30-min idle reliably reproduces the 404; heartbeat=True (default) succeeds with ~29 GetOperationStatus polls during the idle window. Co-authored-by: Isaac Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
17876f9 to
df47d05
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a client reads query results slowly — long pauses between
fetchone/fetchmany/fetchallcalls — the Databricks SQL warehouse considers the query idle and times out the operation server-side, releasing the operation handle even though the data has already been computed. Any subsequentTFetchResultsagainst the now-released handle returns HTTP 404RESOURCE_DOES_NOT_EXISTand the result set is permanently broken — the driver's retry policy classifies this error as non-retryable, so the user sees their query fail and has to re-execute from scratch.Observed eviction window: roughly 20–30 minutes of driver idleness on the warehouse used for testing.
This PR adds a per-
ThriftResultSetdaemon thread that periodically issuesTGetOperationStatusagainst the operation handle while rows are still pending on the server. The poll resets the server's inactivity timer so the operation stays alive across long idle windows. Mirrors the C# ADBC driver'sDatabricksOperationStatusPollerand the equivalent JDBC work indatabricks-jdbcPR #1415.Motivation
Affects any workflow where a consumer iterates slowly over a large result set: Notebook sessions, BI tools paginating user-facing tables, dashboards that re-poll on a slow cadence — anywhere a fetch can sit idle for tens of minutes between calls. Without a keepalive, the next fetch silently 404s with no recovery path; the query must be re-executed.
Changes
New file
src/databricks/sql/backend/thrift_result_heartbeat_manager.py—ResultHeartbeatManagerdaemon thread.heartbeat_interval_seconds).TOperationState(CANCELED_STATE,CLOSED_STATE,ERROR_STATE,TIMEDOUT_STATE,UKNOWN_STATE).FINISHED_STATEintentionally not terminal — the handle remains alive for result streaming after execution finishes.stop(timeout=5.0)joins with a bounded timeout; logs at DEBUG if the thread is still alive after the timeout. Daemon thread, so it dies with the interpreter regardless.Backend helper
ThriftDatabricksClient._heartbeat_poll(op_handle)(thrift_backend.py): single-shotGetOperationStatusunder the existing_request_lock. Bypassesmake_requestso a transient failure can't stall inside the driver's 15-minute retry budget — failure counting moves to the manager (10-consecutive threshold).Connection kwargs
enable_heartbeat: bool = True— default ON (matches ADBC). Opt out withenable_heartbeat=False.heartbeat_interval_seconds: int = 60.ResultSet wiring (
result_set.py)ResultSet._stop_heartbeat()no-op, called as the FIRST step inclose()(beforeclose_command) to avoid racing the close RPC on the same Thrift transport.ThriftResultSet.__init__constructs and starts the manager when all of the following hold:connection.enable_heartbeatis Truehas_more_rowsis True (server still has rows to deliver)command_id.to_thrift_handle()is non-None_fill_results_bufferstops the manager the instanthas_more_rowsflips toFalse— mirrors C# ADBC's end-of-results stop hook insideReadNextRecordBatchAsync.Telemetry — intentionally not added. ADBC writes heartbeat polls into the same
n_operation_status_calls/operation_status_latency_millisfields used by the execute-time poll loop, conflating two distinct mechanisms. We don't follow that, to keep the existing "poll count" metric semantically clean. If/when heartbeat-specific observability is needed, it should land as distinct fields. The in-memory consecutive-failure counter stays — it's local control flow, not telemetry.Eligibility / where the heartbeat does NOT run
has_more_rows=Falseat construction time (e.g. small inline result with all rows delivered in direct results — server has nothing to keep alive).command_idis None or not a Thrift handle (defensive guard for tests and SEA)._fill_results_bufferobserves the server's last batch — no point pinging a handle the server has already finished with.How it's tested
Unit tests (16 new, all passing locally and via
pytest tests/unit)tests/unit/test_result_heartbeat_manager.py(8 tests): lifecycle (start, stop, idempotency, stop-before-start, double-stop, double-start no-op), terminal-state self-stop across all five terminal states,FINISHED_STATEcontinues polling, max-consecutive-failure stop, failure counter resets on success.tests/unit/test_thrift_result_set_heartbeat.py(8 tests): wiring — manager created only when eligible,close()stops once and is idempotent,_fill_results_bufferstops manager when server signals end-of-data, manager keeps running while more rows pending.E2E tests (3 new, gated on
DATABRICKS_SERVER_HOSTNAME/DATABRICKS_HTTP_PATH/DATABRICKS_TOKEN)tests/e2e/test_heartbeat.py: heartbeat polls during idle (≥2 polls in 5 s atinterval=2 s),enable_heartbeat=Falseskips manager construction,ResultSet.close()joins the thread within timeout. Runs in ~9 s when env vars are set; auto-skips otherwise.End-to-end on a real Databricks SQL warehouse
A 30-minute idle test was run against all three Thrift result-set dispositions, exercising the exact failure mode: pause longer than the warehouse's inactivity timeout, then attempt to fetch more rows. Without the heartbeat the server has timed out the operation; with the heartbeat the server keeps the operation alive.
enable_heartbeat=False(without fix)enable_heartbeat=True(with fix)ArrowQueue)RESOURCE_DOES_NOT_EXISTon nextTFetchResultsfetchmany(20 000)succeeds; ~29 polls fire during the idle window and keep the operation aliveThriftCloudFetchQueue)RESOURCE_DOES_NOT_EXISTon nextTFetchResultsfetchmany(80 000)succeeds across multipleTFetchResultsround-trips; ~29 pollsColumnQueue, nopyarrow)RESOURCE_DOES_NOT_EXISTon nextTFetchResultsfetchmany(10)succeeds; ~29 pollsTwo consecutive 6/6 reproductions confirm the fix is stable across all three result-set paths.
CI checks (matches
.github/workflows/code-quality-checks.yml)black --check src— clean.mypy --install-types --non-interactive src— clean (no new errors).pytest tests/unit— all heartbeat tests pass (616 total passing); 1 unrelated pre-existing failure onmain(test_useragent_header,agent/claude-codeUser-Agent suffix from agent detection) is unaffected by this PR.Test plan
tests/e2e/test_heartbeat.py, runs in seconds with env vars set).enable_heartbeat=True, fails withenable_heartbeat=False.This pull request and its description were written by Isaac.