Skip to content

Cursor.description collapses TIMESTAMP_NTZ to 'timestamp' on the SELECT path #786

@rshura

Description

@rshura

Cursor.description reports TIMESTAMP_NTZ columns as 'timestamp' — the _NTZ distinction is lost on the SELECT path

Summary

For any SELECT-style query, cursor.description returns the same type_code ('timestamp') for columns/expressions of Spark type TIMESTAMP and Spark type TIMESTAMP_NTZ. The _NTZ distinction is erased before it reaches the DB-API caller, so downstream code that relies on cursor.description to decide whether a column is timezone-aware cannot tell the two apart.

This is the cursor-description analog of the SQLAlchemy reflection bug fixed in #295 / #296. That fix landed only on the DESCRIBE-driven reflection path; the SELECT path was not touched.

It is also structurally identical to the still-open #336 ("Cursor().description reports NULL (VOID) and INTERVAL fields as 'string'") — same root cause, different victim type.

Reproduction

from databricks import sql

with sql.connect(
    server_hostname=...,
    http_path=...,
    access_token=...,
) as conn:
    with conn.cursor() as cur:
        cur.execute(
            "SELECT "
            "  CAST('2024-10-07 12:00:00' AS TIMESTAMP)     AS tz_aware, "
            "  CAST('2024-10-07 12:00:00' AS TIMESTAMP_NTZ) AS tz_naive"
        )
        for col in cur.description:
            print(col[0], "->", col[1])

Observed output (against a Databricks SQL Warehouse, connector 4.2.6):

tz_aware -> timestamp
tz_naive -> timestamp

The same collapse happens for SELECT <ntz_column> FROM <table> where the underlying column is genuinely TIMESTAMP_NTZ (verified via DESCRIBE), and for explicit CAST(... AS TIMESTAMP_NTZ) over any expression.

The repo's own unit-test fixture pins this behavior:
tests/unit/test_util.py

("timestamp_column",     "timestamp", None, None, None, None, None),
("timestamp_ntz_column", "timestamp", None, None, None, None, None),

Expected

cursor.description[i][1] should be 'timestamp_ntz' for genuine TIMESTAMP_NTZ columns/expressions, and 'timestamp' for TIMESTAMP. This matches the string returned by DESCRIBE on the same column and the type names already used by the SQLAlchemy dialect's reflection map (where "timestamp_ntz" is a valid key — see databricks-sqlalchemy/src/databricks/sqlalchemy/_parse.py).

Root cause

In src/databricks/sql/backend/thrift_backend.py, _col_to_description derives the type_code from the Thrift TTypeId enum name:

# thrift_backend.py, _col_to_description
name = ttypes.TTypeId._VALUES_TO_NAMES[type_entry.primitiveEntry.type]
# Drop _TYPE suffix
cleaned_type = (name[:-5] if name.endswith("_TYPE") else name).lower()

The Thrift TTypeId enum has a single TIMESTAMP_TYPE; there is no TIMESTAMP_NTZ_TYPE. Both Spark TIMESTAMP and Spark TIMESTAMP_NTZ columns arrive over the wire with primitiveEntry.type = TIMESTAMP_TYPE, so after the suffix-strip + lowercase both end up as 'timestamp'.

The same function already contains the mechanism needed to recover the lost distinction. Further down in _col_to_description:

# Extract variant type from field if available
if field is not None:
    try:
        if field.metadata and b"Spark:DataType:SqlName" in field.metadata:
            sql_type = field.metadata.get(b"Spark:DataType:SqlName")
            if sql_type == b"VARIANT":
                cleaned_type = "variant"
    except Exception as e:
        logger.debug(f"Could not extract variant type from field: {e}")

The Arrow field metadata key Spark:DataType:SqlName carries the true Spark SQL type name, including the TIMESTAMP_NTZ distinction. PR #560 used this hook to recover the VARIANT type that was being collapsed to 'string'. The same hook can recover TIMESTAMP_NTZ collapsed to 'timestamp'.

Proposed fix

Extend the existing Spark:DataType:SqlName override in _col_to_description to also handle TIMESTAMP_NTZ:

if field is not None:
    try:
        if field.metadata and b"Spark:DataType:SqlName" in field.metadata:
            sql_type = field.metadata.get(b"Spark:DataType:SqlName")
            if sql_type == b"VARIANT":
                cleaned_type = "variant"
            elif sql_type == b"TIMESTAMP_NTZ":
                cleaned_type = "timestamp_ntz"
    except Exception as e:
        logger.debug(f"Could not extract type from field metadata: {e}")

The same shape would cleanly extend to the cases in #336 (VOID, INTERVAL) if their Spark:DataType:SqlName metadata is populated by DBR — worth verifying as part of this fix.

The corresponding unit-test fixture in tests/unit/test_util.py would need to be updated to assert "timestamp_ntz" (rather than "timestamp") for the timestamp_ntz_column row.

Impact

Any consumer that uses cursor.description to classify columns as timezone-aware vs timezone-naive on Databricks must currently treat every timestamp column as tz-aware, or fall back to a separate DESCRIBE round trip to recover the _NTZ suffix. This is a real-world blocker for tools building on the connector — for example, any validation that requires "this time axis is tz-naive" cannot be expressed against a Databricks TIMESTAMP_NTZ column based on connector output alone.

Bumping the connector pin does not help: verified that 4.2.6 (latest at time of writing) still collapses the type. The TIMESTAMP_NTZ-related CHANGELOG entries since the initial fix concern parameter binding (TimestampNTZParameter) and SQLAlchemy reflection — neither touches the cursor.description path.

Related

Environment

  • databricks-sql-connector 4.2.6
  • Backend: Databricks SQL Warehouse
  • Behavior is independent of the Python or OS version.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions