Skip to content

Thrift: result-set heartbeat to prevent operation-handle eviction#785

Open
sreekanth-db wants to merge 1 commit into
mainfrom
thrift-result-set-heartbeat
Open

Thrift: result-set heartbeat to prevent operation-handle eviction#785
sreekanth-db wants to merge 1 commit into
mainfrom
thrift-result-set-heartbeat

Conversation

@sreekanth-db
Copy link
Copy Markdown
Contributor

@sreekanth-db sreekanth-db commented May 12, 2026

Summary

When a client reads query results slowly — long pauses between fetchone / fetchmany / fetchall calls — the Databricks SQL warehouse considers the query idle and times out the operation server-side, releasing the operation handle even though the data has already been computed. Any subsequent TFetchResults against the now-released handle returns HTTP 404 RESOURCE_DOES_NOT_EXIST and the result set is permanently broken — the driver's retry policy classifies this error as non-retryable, so the user sees their query fail and has to re-execute from scratch.

Observed eviction window: roughly 20–30 minutes of driver idleness on the warehouse used for testing.

This PR adds a per-ThriftResultSet daemon thread that periodically issues TGetOperationStatus against the operation handle while rows are still pending on the server. The poll resets the server's inactivity timer so the operation stays alive across long idle windows. Mirrors the C# ADBC driver's DatabricksOperationStatusPoller and the equivalent JDBC work in databricks-jdbc PR #1415.

Motivation

Affects any workflow where a consumer iterates slowly over a large result set: Notebook sessions, BI tools paginating user-facing tables, dashboards that re-poll on a slow cadence — anywhere a fetch can sit idle for tens of minutes between calls. Without a keepalive, the next fetch silently 404s with no recovery path; the query must be re-executed.

Changes

New file

  • src/databricks/sql/backend/thrift_result_heartbeat_manager.pyResultHeartbeatManager daemon thread.
    • Default 60 s interval (configurable via heartbeat_interval_seconds).
    • Stops on terminal TOperationState (CANCELED_STATE, CLOSED_STATE, ERROR_STATE, TIMEDOUT_STATE, UKNOWN_STATE). FINISHED_STATE intentionally not terminal — the handle remains alive for result streaming after execution finishes.
    • 10 consecutive failures → self-stop (matches ADBC).
    • stop(timeout=5.0) joins with a bounded timeout; logs at DEBUG if the thread is still alive after the timeout. Daemon thread, so it dies with the interpreter regardless.

Backend helper

  • ThriftDatabricksClient._heartbeat_poll(op_handle) (thrift_backend.py): single-shot GetOperationStatus under the existing _request_lock. Bypasses make_request so a transient failure can't stall inside the driver's 15-minute retry budget — failure counting moves to the manager (10-consecutive threshold).

Connection kwargs

  • enable_heartbeat: bool = True — default ON (matches ADBC). Opt out with enable_heartbeat=False.
  • heartbeat_interval_seconds: int = 60.

ResultSet wiring (result_set.py)

  • Base ResultSet._stop_heartbeat() no-op, called as the FIRST step in close() (before close_command) to avoid racing the close RPC on the same Thrift transport.
  • ThriftResultSet.__init__ constructs and starts the manager when all of the following hold:
    • connection.enable_heartbeat is True
    • has_more_rows is True (server still has rows to deliver)
    • not already closed server-side
    • command_id.to_thrift_handle() is non-None
  • _fill_results_buffer stops the manager the instant has_more_rows flips to False — mirrors C# ADBC's end-of-results stop hook inside ReadNextRecordBatchAsync.

Telemetry — intentionally not added. ADBC writes heartbeat polls into the same n_operation_status_calls / operation_status_latency_millis fields used by the execute-time poll loop, conflating two distinct mechanisms. We don't follow that, to keep the existing "poll count" metric semantically clean. If/when heartbeat-specific observability is needed, it should land as distinct fields. The in-memory consecutive-failure counter stays — it's local control flow, not telemetry.

Eligibility / where the heartbeat does NOT run

  • Not started when has_more_rows=False at construction time (e.g. small inline result with all rows delivered in direct results — server has nothing to keep alive).
  • Not started when the server already marked the operation closed.
  • Not started when command_id is None or not a Thrift handle (defensive guard for tests and SEA).
  • Stopped immediately when _fill_results_buffer observes the server's last batch — no point pinging a handle the server has already finished with.

How it's tested

Unit tests (16 new, all passing locally and via pytest tests/unit)

  • tests/unit/test_result_heartbeat_manager.py (8 tests): lifecycle (start, stop, idempotency, stop-before-start, double-stop, double-start no-op), terminal-state self-stop across all five terminal states, FINISHED_STATE continues polling, max-consecutive-failure stop, failure counter resets on success.
  • tests/unit/test_thrift_result_set_heartbeat.py (8 tests): wiring — manager created only when eligible, close() stops once and is idempotent, _fill_results_buffer stops manager when server signals end-of-data, manager keeps running while more rows pending.

E2E tests (3 new, gated on DATABRICKS_SERVER_HOSTNAME / DATABRICKS_HTTP_PATH / DATABRICKS_TOKEN)

  • tests/e2e/test_heartbeat.py: heartbeat polls during idle (≥2 polls in 5 s at interval=2 s), enable_heartbeat=False skips manager construction, ResultSet.close() joins the thread within timeout. Runs in ~9 s when env vars are set; auto-skips otherwise.

End-to-end on a real Databricks SQL warehouse

A 30-minute idle test was run against all three Thrift result-set dispositions, exercising the exact failure mode: pause longer than the warehouse's inactivity timeout, then attempt to fetch more rows. Without the heartbeat the server has timed out the operation; with the heartbeat the server keeps the operation alive.

Disposition enable_heartbeat=False (without fix) enable_heartbeat=True (with fix)
Arrow inline (ArrowQueue) server times out the operation → 404 RESOURCE_DOES_NOT_EXIST on next TFetchResults post-sleep fetchmany(20 000) succeeds; ~29 polls fire during the idle window and keep the operation alive
Cloud-fetch (ThriftCloudFetchQueue) server times out the operation → 404 RESOURCE_DOES_NOT_EXIST on next TFetchResults post-sleep fetchmany(80 000) succeeds across multiple TFetchResults round-trips; ~29 polls
Column inline (ColumnQueue, no pyarrow) server times out the operation → 404 RESOURCE_DOES_NOT_EXIST on next TFetchResults post-sleep fetchmany(10) succeeds; ~29 polls

Two consecutive 6/6 reproductions confirm the fix is stable across all three result-set paths.

CI checks (matches .github/workflows/code-quality-checks.yml)

  • black --check src — clean.
  • mypy --install-types --non-interactive src — clean (no new errors).
  • pytest tests/unit — all heartbeat tests pass (616 total passing); 1 unrelated pre-existing failure on main (test_useragent_header, agent/claude-code User-Agent suffix from agent detection) is unaffected by this PR.

Test plan

  • CI matrix passes on Python 3.9–3.14 with and without pyarrow extras (unit tests).
  • E2E heartbeat tests pass against a real warehouse (tests/e2e/test_heartbeat.py, runs in seconds with env vars set).
  • Long-duration (30 min idle) repro: succeeds with enable_heartbeat=True, fails with enable_heartbeat=False.

This pull request and its description were written by Isaac.

The Databricks SQL warehouse reaps a query's operation handle after
roughly 20-30 minutes of driver idleness. Once that happens, any
subsequent TFetchResults against the handle returns HTTP 404 /
RESOURCE_DOES_NOT_EXIST and the result set is permanently broken --
the driver's retry policy classifies the error as non-retryable, so
the user sees their query fail even though the data has already been
computed.

This affects any workflow where a consumer iterates slowly over a
large result set: Notebook sessions, BI tools paginating user-facing
tables, anywhere a fetch can sit idle for tens of minutes between
calls.

Adds a per-ThriftResultSet daemon thread that periodically calls
GetOperationStatus while rows are still pending on the server,
mirroring the C# ADBC driver's DatabricksOperationStatusPoller and
the JDBC work in databricks-jdbc PR #1415.

- New: ResultHeartbeatManager (backend/thrift_result_heartbeat_manager.py).
  Daemon thread, 60s default interval, 10-consecutive-failure stop,
  terminal-state self-stop. Bypasses make_request retry budget so a
  transient failure can't stall inside a single poll.
- New: ThriftDatabricksClient._heartbeat_poll helper (acquires the
  existing _request_lock; bypasses make_request).
- New: Connection kwargs enable_heartbeat (default True) and
  heartbeat_interval_seconds (default 60).
- Wired into ThriftResultSet: started in __init__ when the server
  still has rows to deliver; stopped (a) the moment
  _fill_results_buffer sees has_more_rows flip to False,
  (b) ResultSet.close(), (c) transitively from
  Cursor.close()/Connection.close().
- 16 new unit tests (8 manager + 8 wiring), 3 new env-var-gated e2e
  tests.

Verified end-to-end on a real warehouse against all three Thrift
result dispositions (Arrow inline, cloud-fetch, pre-Arrow Column
inline): heartbeat=False + 30-min idle reliably reproduces the 404;
heartbeat=True (default) succeeds with ~29 GetOperationStatus polls
during the idle window.

Co-authored-by: Isaac
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant