Skip to content

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495

Open
subrata-ms wants to merge 13 commits intomainfrom
subrata-ms/CP1252Encoding
Open

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495
subrata-ms wants to merge 13 commits intomainfrom
subrata-ms/CP1252Encoding

Conversation

@subrata-ms
Copy link
Copy Markdown
Contributor

@subrata-ms subrata-ms commented Apr 3, 2026

Work Item / Issue Reference

AB#43177

GitHub Issue: #468


Summary

This pull request updates the default handling of SQL CHAR/VARCHAR columns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.

Default Encoding and Decoding Changes:

  • The default decoding for SQL CHAR columns is now set to use "utf-16le" encoding and the SQL_WCHAR ctype, replacing the previous "utf-8"/SQL_CHAR defaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)

  • All cursor fetch methods (fetchone, fetchmany, fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetching CHAR data, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]

C++ Binding and Processing Updates:

  • The ColumnInfoExt struct now tracks whether wide character (UTF-16) fetching is used for a column, and the ProcessChar function is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]

Test Adjustments:

  • Tests are updated to expect "utf-16le" and SQL_WCHAR as the default decoding settings for SQL_CHAR columns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]

@github-actions github-actions Bot added the pr-size: large Substantial code update label Apr 3, 2026
Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed
Comment thread mssql_python/pybind/ddbc_bindings.cpp Dismissed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 3, 2026

📊 Code Coverage Report

🔥 Diff Coverage

79%


🎯 Overall Coverage

79%


📈 Total Lines Covered: 6902 out of 8720
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/connection.py (100%)
  • mssql_python/pybind/ddbc_bindings.cpp (82.8%): Missing lines 2066,3416-3421,3432-3443,3449-3454,3521-3522,3524-3525,3538-3541,3606-3607,3609-3610,3623-3626,4999-5001,5259-5260,5310-5312,5613-5614
  • mssql_python/pybind/ddbc_bindings.h (50.0%): Missing lines 844-853,859-863

Summary

  • Total: 328 lines
  • Missing: 66 lines
  • Coverage: 79%

mssql_python/pybind/ddbc_bindings.cpp

Lines 2062-2070

  2062                         size_t chunkBytes = DAE_CHUNK_SIZE;
  2063                         while (offset < totalBytes) {
  2064                             size_t len = std::min(chunkBytes, totalBytes - offset);
  2065 
! 2066                             rc = putData((SQLPOINTER)(dataPtr + offset), static_cast<SQLLEN>(len));
  2067                             if (!SQL_SUCCEEDED(rc)) {
  2068                                 LOG("SQLExecute: SQLPutData failed for "
  2069                                     "SQL_C_CHAR chunk - offset=%zu",
  2070                                     offset, totalBytes, len, rc);

Lines 3412-3425

  3412                                     "length=%lu",
  3413                                     i, (unsigned long)numCharsInData);
  3414                             } else {
  3415                                 // Buffer too small, fallback to streaming
! 3416                                 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3417                                     "truncated, using streaming LOB",
! 3418                                     i);
! 3419                                 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3420                                                               "utf-16le"));
! 3421                             }
  3422                         } else if (dataLen == SQL_NULL_DATA) {
  3423                             LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
  3424                             row.append(py::none());
  3425                         } else if (dataLen == 0) {

Lines 3428-3447

  3428                             // Driver cannot report total length up front; this is
  3429                             // NOT a NULL value. Fall back to streaming via
  3430                             // FetchLobColumnData (repeated SQLGetData chunks) so
  3431                             // we don't silently lose data.
! 3432                             LOG("SQLGetData: SQL_NO_TOTAL for column %d (CHAR via WCHAR), "
! 3433                                 "streaming via FetchLobColumnData",
! 3434                                 i);
! 3435                             row.append(
! 3436                                 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, "utf-16le"));
! 3437                         } else if (dataLen < 0) {
! 3438                             LOG("SQLGetData: Unexpected negative data length "
! 3439                                 "for column %d - dataType=%d, dataLen=%ld",
! 3440                                 i, dataType, (long)dataLen);
! 3441                             ThrowStdException("SQLGetData returned an unexpected negative "
! 3442                                               "data length");
! 3443                         }
  3444                     } else {
  3445                         // Surface driver errors instead of silently returning NULL.
  3446                         // Returning py::none() here would be indistinguishable from
  3447                         // a genuine SQL NULL value to the Python caller and is a

Lines 3445-3458

  3445                         // Surface driver errors instead of silently returning NULL.
  3446                         // Returning py::none() here would be indistinguishable from
  3447                         // a genuine SQL NULL value to the Python caller and is a
  3448                         // data-integrity risk.
! 3449                         LOG_ERROR("SQLGetData: Error retrieving data for column %d "
! 3450                                   "(CHAR via WCHAR) - SQLRETURN=%d",
! 3451                                   i, ret);
! 3452                         ThrowStdException("SQLGetData failed for CHAR/VARCHAR column "
! 3453                                           "fetched as SQL_C_WCHAR");
! 3454                     }
  3455                 } else {
  3456                     // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
  3457                     //
  3458                     // Why this differs from SQLBindColums / FetchBatchData:

Lines 3517-3529

  3517                             // Driver cannot report total length up front; this is
  3518                             // NOT a NULL value. Fall back to streaming via
  3519                             // FetchLobColumnData (repeated SQLGetData chunks) so
  3520                             // we don't silently lose data.
! 3521                             LOG("SQLGetData: SQL_NO_TOTAL for column %d (SQL_CHAR), "
! 3522                                 "streaming via FetchLobColumnData",
  3523                                 i);
! 3524                             row.append(FetchLobColumnData(hStmt, i, SQL_C_CHAR, false, false,
! 3525                                                           effectiveCharEnc));
  3526                         } else if (dataLen < 0) {
  3527                             LOG("SQLGetData: Unexpected negative data length "
  3528                                 "for column %d - dataType=%d, dataLen=%ld",
  3529                                 i, dataType, (long)dataLen);

Lines 3534-3545

  3534                         // Surface driver errors instead of silently returning NULL.
  3535                         // Returning py::none() here would be indistinguishable from
  3536                         // a genuine SQL NULL value to the Python caller and is a
  3537                         // data-integrity risk.
! 3538                         LOG_ERROR("SQLGetData: Error retrieving data for column %d "
! 3539                                   "(SQL_CHAR) - SQLRETURN=%d",
! 3540                                   i, ret);
! 3541                         ThrowStdException("SQLGetData failed for SQL_CHAR/VARCHAR column");
  3542                     }
  3543                 }
  3544                 break;
  3545             }

Lines 3602-3614

  3602                             // Driver cannot report total length up front; this is
  3603                             // NOT a NULL value. Fall back to streaming via
  3604                             // FetchLobColumnData (repeated SQLGetData chunks) so
  3605                             // we don't silently lose data.
! 3606                             LOG("SQLGetData: SQL_NO_TOTAL for column %d (NVARCHAR), "
! 3607                                 "streaming via FetchLobColumnData",
  3608                                 i);
! 3609                             row.append(
! 3610                                 FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false, "utf-16le"));
  3611                         } else if (dataLen < 0) {
  3612                             LOG("SQLGetData: Unexpected negative data length "
  3613                                 "for column %d (NVARCHAR) - dataLen=%ld",
  3614                                 i, (long)dataLen);

Lines 3619-3630

  3619                         // Surface driver errors instead of silently returning NULL.
  3620                         // Returning py::none() here would be indistinguishable from
  3621                         // a genuine SQL NULL value to the Python caller and is a
  3622                         // data-integrity risk.
! 3623                         LOG_ERROR("SQLGetData: Error retrieving data for column %d "
! 3624                                   "(NVARCHAR) - SQLRETURN=%d",
! 3625                                   i, ret);
! 3626                         ThrowStdException("SQLGetData failed for NVARCHAR column");
  3627                     }
  3628                 }
  3629                 break;
  3630             }

Lines 4995-5005

  4995                 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
  4996                 break;
  4997             default:
  4998                 std::ostringstream errorString;
! 4999                 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 5000                             << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 5001                             << (i + 1);
  5002                 LOG(errorString.str().c_str());
  5003                 ThrowStdException(errorString.str());
  5004                 break;
  5005         }

Lines 5255-5264

  5255                                                  buffers.datetimeoffsetBuffers[idxCol].data(),
  5256                                                  sizeof(DateTimeOffset),
  5257                                                  buffers.indicators[idxCol].data());
  5258                             if (!SQL_SUCCEEDED(ret)) {
! 5259                                 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5260                                     idxCol + 1);
  5261                                 return ret;
  5262                             }
  5263                             break;
  5264                         }

Lines 5306-5316

  5306                     nullCounts[idxCol] += 1;
  5307                     continue;
  5308                 } else if (indicator < 0) {
  5309                     // Negative value is unexpected, log column index, SQL type & raise exception
! 5310                     LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5311                         "Length - %lld",
! 5312                         idxCol + 1, dataType, (long long)indicator);
  5313                     ThrowStdException("Unexpected negative data length.");
  5314                 }
  5315                 auto dataLen = static_cast<uint64_t>(indicator);

Lines 5609-5618

  5609         arrowSchemaBatchCapsule =
  5610             py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
  5611                 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
  5612                 if (arrowSchema->release) {
! 5613                     arrowSchema->release(arrowSchema);
! 5614                 }
  5615                 delete arrowSchema;
  5616             });
  5617     } catch (...) {
  5618         arrowSchemaBatch->release(arrowSchemaBatch.get());

mssql_python/pybind/ddbc_bindings.h

  840                 // Decode failed — fall back to returning the raw UTF-16LE bytes
  841                 // (consistent with the narrow-char path below and with
  842                 // FetchLobColumnData / SQLGetData_wrap which return raw bytes on
  843                 // decode failure instead of silently dropping data as None).
! 844                 PyErr_Clear();
! 845                 PyObject* pyBytes = PyBytes_FromStringAndSize(
! 846                     reinterpret_cast<const char*>(wcharData), numCharsInData * sizeof(SQLWCHAR));
! 847                 if (pyBytes) {
! 848                     PyList_SET_ITEM(row, col - 1, pyBytes);
! 849                 } else {
! 850                     PyErr_Clear();
! 851                     Py_INCREF(Py_None);
! 852                     PyList_SET_ITEM(row, col - 1, Py_None);
! 853                 }
  854             } else {
  855                 PyList_SET_ITEM(row, col - 1, pyStr);
  856             }
  857         } else {

  855                 PyList_SET_ITEM(row, col - 1, pyStr);
  856             }
  857         } else {
  858             // LOB / truncated: stream with SQL_C_WCHAR
! 859             PyList_SET_ITEM(row, col - 1,
! 860                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 861                                 .release()
! 862                                 .ptr());
! 863         }
  864         return;
  865     }
  866 
  867     // Original narrow-char path (charCtype == SQL_C_CHAR)


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 66.5%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.3%
mssql_python.pybind.connection.connection.cpp: 76.2%
mssql_python.__init__.py: 77.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@subrata-ms subrata-ms marked this pull request as ready for review April 3, 2026 08:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.

Changes:

  • Update connection defaults so SQL_CHAR decoding uses encoding="utf-16le" with ctype=SQL_WCHAR.
  • Plumb charCtype through cursor fetch APIs into the C++ bindings, enabling wide-char fetch for SQL_CHAR columns.
  • Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mssql_python/connection.py Changes default decoding settings for SQL_CHAR to UTF-16LE/SQL_WCHAR.
mssql_python/cursor.py Passes updated decoding settings (encoding + ctype) into the native fetch functions.
mssql_python/pybind/ddbc_bindings.cpp Adds charCtype plumb-through and implements SQL_C_WCHAR paths for SQLGetData and bound-column fetching.
mssql_python/pybind/ddbc_bindings.h Extends per-column metadata and adds a wide-char branch in ProcessChar.
tests/test_013_encoding_decoding.py Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/connection.py
Comment thread mssql_python/pybind/ddbc_bindings.cpp Outdated
Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread tests/test_013_encoding_decoding.py Outdated
Comment thread tests/test_013_encoding_decoding.py
Comment thread mssql_python/pybind/ddbc_bindings.cpp
Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread mssql_python/pybind/ddbc_bindings.cpp
Comment thread mssql_python/pybind/ddbc_bindings.cpp Outdated
Comment thread mssql_python/connection.py
Copy link
Copy Markdown
Contributor

@sumitmsft sumitmsft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Otherwise, I am ok with this change

Comment thread mssql_python/pybind/ddbc_bindings.cpp
Comment on lines 6071 to 6078
py::arg("StatementHandle"), py::arg("row"), py::arg("charEncoding") = "utf-8",
py::arg("wcharEncoding") = "utf-16le");
py::arg("wcharEncoding") = "utf-16le", py::arg("charCtype") = SQL_C_WCHAR);
m.def("DDBCSQLFetchMany", &FetchMany_wrap, py::arg("StatementHandle"), py::arg("rows"),
py::arg("fetchSize"), py::arg("charEncoding") = "utf-8",
py::arg("wcharEncoding") = "utf-16le", "Fetch many rows from the result set");
py::arg("wcharEncoding") = "utf-16le", py::arg("charCtype") = SQL_C_WCHAR,
"Fetch many rows from the result set");
m.def("DDBCSQLFetchAll", &FetchAll_wrap, "Fetch all rows from the result set",
py::arg("StatementHandle"), py::arg("rows"), py::arg("charEncoding") = "utf-8",
Copy link
Copy Markdown
Contributor

@gargsaumya gargsaumya May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pybind11 default for charEncoding is still "utf-8" here, but this PR changed the connection-level default to "utf-16le" everywhere else (Connection.init, setdecoding(), all three cursor fallbacks in cursor.py). I think these three defaults should be "utf-16le" to stay consistent with the rest of the PR., just veify this once?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants