Thrift: result-set heartbeat to prevent operation-handle eviction by sreekanth-db · Pull Request #785 · databricks/databricks-sql-python

sreekanth-db · 2026-05-12T09:09:37Z

Summary

When a client reads query results slowly — long pauses between fetchone / fetchmany / fetchall calls — the Databricks SQL warehouse considers the query idle and times out the operation server-side, releasing the operation handle even though the data has already been computed. Any subsequent TFetchResults against the now-released handle returns HTTP 404 RESOURCE_DOES_NOT_EXIST and the result set is permanently broken — the driver's retry policy classifies this error as non-retryable, so the user sees their query fail and has to re-execute from scratch.

Observed eviction window: roughly 20–30 minutes of driver idleness on the warehouse used for testing.

This PR adds a per-ThriftResultSet daemon thread that periodically issues TGetOperationStatus against the operation handle while rows are still pending on the server. The poll resets the server's inactivity timer so the operation stays alive across long idle windows. Mirrors the C# ADBC driver's DatabricksOperationStatusPoller and the equivalent JDBC work in databricks-jdbc PR #1415.

Motivation

Affects any workflow where a consumer iterates slowly over a large result set: Notebook sessions, BI tools paginating user-facing tables, dashboards that re-poll on a slow cadence — anywhere a fetch can sit idle for tens of minutes between calls. Without a keepalive, the next fetch silently 404s with no recovery path; the query must be re-executed.

Changes

New file

src/databricks/sql/backend/thrift_result_heartbeat_manager.py — ResultHeartbeatManager daemon thread.
- Default 60 s interval (configurable via heartbeat_interval_seconds).
- Stops on terminal TOperationState (CANCELED_STATE, CLOSED_STATE, ERROR_STATE, TIMEDOUT_STATE, UKNOWN_STATE). FINISHED_STATE intentionally not terminal — the handle remains alive for result streaming after execution finishes.
- 10 consecutive failures → self-stop (matches ADBC).
- stop(timeout=5.0) joins with a bounded timeout; logs at DEBUG if the thread is still alive after the timeout. Daemon thread, so it dies with the interpreter regardless.

Backend helper

ThriftDatabricksClient._heartbeat_poll(op_handle) (thrift_backend.py): single-shot GetOperationStatus under the existing _request_lock. Bypasses make_request so a transient failure can't stall inside the driver's 15-minute retry budget — failure counting moves to the manager (10-consecutive threshold).

Connection kwargs

enable_heartbeat: bool = True — default ON (matches ADBC). Opt out with enable_heartbeat=False.
heartbeat_interval_seconds: int = 60.

ResultSet wiring (result_set.py)

Base ResultSet._stop_heartbeat() no-op, called as the FIRST step in close() (before close_command) to avoid racing the close RPC on the same Thrift transport.
ThriftResultSet.__init__ constructs and starts the manager when all of the following hold:
- connection.enable_heartbeat is True
- has_more_rows is True (server still has rows to deliver)
- not already closed server-side
- command_id.to_thrift_handle() is non-None
_fill_results_buffer stops the manager the instant has_more_rows flips to False — mirrors C# ADBC's end-of-results stop hook inside ReadNextRecordBatchAsync.

Telemetry — intentionally not added. ADBC writes heartbeat polls into the same n_operation_status_calls / operation_status_latency_millis fields used by the execute-time poll loop, conflating two distinct mechanisms. We don't follow that, to keep the existing "poll count" metric semantically clean. If/when heartbeat-specific observability is needed, it should land as distinct fields. The in-memory consecutive-failure counter stays — it's local control flow, not telemetry.

Eligibility / where the heartbeat does NOT run

Not started when has_more_rows=False at construction time (e.g. small inline result with all rows delivered in direct results — server has nothing to keep alive).
Not started when the server already marked the operation closed.
Not started when command_id is None or not a Thrift handle (defensive guard for tests and SEA).
Stopped immediately when _fill_results_buffer observes the server's last batch — no point pinging a handle the server has already finished with.

How it's tested

Unit tests (16 new, all passing locally and via `pytest tests/unit`)

tests/unit/test_result_heartbeat_manager.py (8 tests): lifecycle (start, stop, idempotency, stop-before-start, double-stop, double-start no-op), terminal-state self-stop across all five terminal states, FINISHED_STATE continues polling, max-consecutive-failure stop, failure counter resets on success.
tests/unit/test_thrift_result_set_heartbeat.py (8 tests): wiring — manager created only when eligible, close() stops once and is idempotent, _fill_results_buffer stops manager when server signals end-of-data, manager keeps running while more rows pending.

E2E tests (3 new, gated on `DATABRICKS_SERVER_HOSTNAME` / `DATABRICKS_HTTP_PATH` / `DATABRICKS_TOKEN`)

tests/e2e/test_heartbeat.py: heartbeat polls during idle (≥2 polls in 5 s at interval=2 s), enable_heartbeat=False skips manager construction, ResultSet.close() joins the thread within timeout. Runs in ~9 s when env vars are set; auto-skips otherwise.

End-to-end on a real Databricks SQL warehouse

A 30-minute idle test was run against all three Thrift result-set dispositions, exercising the exact failure mode: pause longer than the warehouse's inactivity timeout, then attempt to fetch more rows. Without the heartbeat the server has timed out the operation; with the heartbeat the server keeps the operation alive.

Disposition	`enable_heartbeat=False` (without fix)	`enable_heartbeat=True` (with fix)
Arrow inline (`ArrowQueue`)	server times out the operation → 404 `RESOURCE_DOES_NOT_EXIST` on next `TFetchResults`	post-sleep `fetchmany(20 000)` succeeds; ~29 polls fire during the idle window and keep the operation alive
Cloud-fetch (`ThriftCloudFetchQueue`)	server times out the operation → 404 `RESOURCE_DOES_NOT_EXIST` on next `TFetchResults`	post-sleep `fetchmany(80 000)` succeeds across multiple `TFetchResults` round-trips; ~29 polls
Column inline (`ColumnQueue`, no `pyarrow`)	server times out the operation → 404 `RESOURCE_DOES_NOT_EXIST` on next `TFetchResults`	post-sleep `fetchmany(10)` succeeds; ~29 polls

Two consecutive 6/6 reproductions confirm the fix is stable across all three result-set paths.

CI checks (matches `.github/workflows/code-quality-checks.yml`)

black --check src — clean.
mypy --install-types --non-interactive src — clean (no new errors).
pytest tests/unit — all heartbeat tests pass (616 total passing); 1 unrelated pre-existing failure on main (test_useragent_header, agent/claude-code User-Agent suffix from agent detection) is unaffected by this PR.

Test plan

CI matrix passes on Python 3.9–3.14 with and without pyarrow extras (unit tests).
E2E heartbeat tests pass against a real warehouse (tests/e2e/test_heartbeat.py, runs in seconds with env vars set).
Long-duration (30 min idle) repro: succeeds with enable_heartbeat=True, fails with enable_heartbeat=False.

This pull request and its description were written by Isaac.

The Databricks SQL warehouse reaps a query's operation handle after roughly 20-30 minutes of driver idleness. Once that happens, any subsequent TFetchResults against the handle returns HTTP 404 / RESOURCE_DOES_NOT_EXIST and the result set is permanently broken -- the driver's retry policy classifies the error as non-retryable, so the user sees their query fail even though the data has already been computed. This affects any workflow where a consumer iterates slowly over a large result set: Notebook sessions, BI tools paginating user-facing tables, anywhere a fetch can sit idle for tens of minutes between calls. Adds a per-ThriftResultSet daemon thread that periodically calls GetOperationStatus while rows are still pending on the server, mirroring the C# ADBC driver's DatabricksOperationStatusPoller and the JDBC work in databricks-jdbc PR #1415. - New: ResultHeartbeatManager (backend/thrift_result_heartbeat_manager.py). Daemon thread, 60s default interval, 10-consecutive-failure stop, terminal-state self-stop. Bypasses make_request retry budget so a transient failure can't stall inside a single poll. - New: ThriftDatabricksClient._heartbeat_poll helper (acquires the existing _request_lock; bypasses make_request). - New: Connection kwargs enable_heartbeat (default True) and heartbeat_interval_seconds (default 60). - Wired into ThriftResultSet: started in __init__ when the server still has rows to deliver; stopped (a) the moment _fill_results_buffer sees has_more_rows flip to False, (b) ResultSet.close(), (c) transitively from Cursor.close()/Connection.close(). - 16 new unit tests (8 manager + 8 wiring), 3 new env-var-gated e2e tests. Verified end-to-end on a real warehouse against all three Thrift result dispositions (Arrow inline, cloud-fetch, pre-Arrow Column inline): heartbeat=False + 30-min idle reliably reproduces the 404; heartbeat=True (default) succeeds with ~29 GetOperationStatus polls during the idle window. Co-authored-by: Isaac Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>

sreekanth-db had a problem deploying to azure-prod May 12, 2026 09:09 — with GitHub Actions Failure

sreekanth-db force-pushed the thrift-result-set-heartbeat branch from 30dc347 to 17876f9 Compare May 12, 2026 09:14

sreekanth-db had a problem deploying to azure-prod May 12, 2026 09:14 — with GitHub Actions Failure

sreekanth-db force-pushed the thrift-result-set-heartbeat branch from 17876f9 to df47d05 Compare May 12, 2026 09:21

sreekanth-db had a problem deploying to azure-prod May 12, 2026 09:21 — with GitHub Actions Failure

sreekanth-db requested review from gopalldb, msrathore-db and vikrantpuppala May 12, 2026 09:29

sreekanth-db had a problem deploying to azure-prod May 12, 2026 10:39 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thrift: result-set heartbeat to prevent operation-handle eviction#785

Thrift: result-set heartbeat to prevent operation-handle eviction#785
sreekanth-db wants to merge 1 commit into
mainfrom
thrift-result-set-heartbeat

sreekanth-db commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sreekanth-db commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Eligibility / where the heartbeat does NOT run

How it's tested

Unit tests (16 new, all passing locally and via pytest tests/unit)

E2E tests (3 new, gated on DATABRICKS_SERVER_HOSTNAME / DATABRICKS_HTTP_PATH / DATABRICKS_TOKEN)

End-to-end on a real Databricks SQL warehouse

CI checks (matches .github/workflows/code-quality-checks.yml)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sreekanth-db commented May 12, 2026 •

edited

Loading

Unit tests (16 new, all passing locally and via `pytest tests/unit`)

E2E tests (3 new, gated on `DATABRICKS_SERVER_HOSTNAME` / `DATABRICKS_HTTP_PATH` / `DATABRICKS_TOKEN`)

CI checks (matches `.github/workflows/code-quality-checks.yml`)