Skip to content

Comments

FIX: VARCHAR fetch fails when data length equals column size with non-ASCII CP1252 characters#444

Open
subrata-ms wants to merge 5 commits intomainfrom
subrata-ms/CP1252_Conversion
Open

FIX: VARCHAR fetch fails when data length equals column size with non-ASCII CP1252 characters#444
subrata-ms wants to merge 5 commits intomainfrom
subrata-ms/CP1252_Conversion

Conversation

@subrata-ms
Copy link
Contributor

@subrata-ms subrata-ms commented Feb 23, 2026

Work Item / Issue Reference

AB#42604

GitHub Issue: #435


Summary

This pull request improves the handling of character encoding and buffer sizing for SQL CHAR/VARCHAR data in the ODBC Python bindings, especially for cross-platform compatibility between Linux/macOS and Windows. The changes ensure that character data is decoded correctly and that buffer sizes are sufficient to prevent corruption or truncation when dealing with multi-byte UTF-8 data returned by the ODBC driver on non-Windows systems.

Character Encoding Handling:

  • Introduced the GetEffectiveCharDecoding function to determine the correct decoding to use for SQL CHAR data: always UTF-8 on Linux/macOS (since the ODBC driver returns UTF-8), and the user-specified encoding on Windows. This function is now used consistently throughout the codebase to select the decoding method. [1] [2] [3] [4] [5] [6]

Buffer Sizing for UTF-8:

  • Updated buffer allocation logic to use columnSize * 4 + 1 for SQL CHAR/VARCHAR columns on Linux/macOS, accounting for the worst-case UTF-8 expansion (up to 4 bytes per character), preventing data truncation when multi-byte characters are present at the column boundary. [1] [2] [3]

Decoding and Data Fetching:

  • Modified all data fetching and decoding paths (FetchLobColumnData, SQLGetData_wrap, ProcessChar, and batch fetch functions) to use the effective character encoding and the correct buffer sizes, ensuring consistent and correct decoding regardless of platform. [1] [2] [3] [4]

API Changes:

  • Updated the FetchBatchData function and its callers to accept and propagate the character encoding parameter, ensuring the encoding context is preserved throughout the data-fetching stack. [1] [2] [3]

Minor Fixes:

  • Minor formatting and logging improvements for error messages and function signatures.

These changes collectively improve correctness and reliability when handling string data from SQL databases, especially in multi-platform environments.

CP1252 VARCHAR Boundary Fix — Summary
Problem:: VARCHAR columns with CP1252 non-ASCII characters (e.g., é, ñ, ö) returned corrupted data when the string length exactly equaled the column size. Inserting "café René!" into VARCHAR(10) returned "©!".

Root Cause::
Three bugs in ddbc_bindings.cpp:

Undersized buffer — SQLGetData / SQLBindCol allocated columnSize + 1 bytes, but on Linux/macOS the ODBC driver converts server data to UTF-8 where CP1252 é (1 byte) becomes 0xC3 0xA9 (2 bytes). A 10-char string with 2 accented characters needs 12 bytes, exceeding the 11-byte buffer → truncation → LOB fallback re-reads consumed data → corruption.

Wrong decode encoding — After fixing the buffer, data arrived intact but was decoded with the user's charEncoding (CP1252) instead of UTF-8. Since ODBC on Linux/macOS already converts to UTF-8, double-interpreting as CP1252 produced mojibake (café René!).

ProcessChar assumed UTF-8 on all platforms — The batch/fetchall hot path used PyUnicode_FromStringAndSize which assumes UTF-8 input. Correct on Linux (ODBC returns UTF-8), but wrong on Windows (ODBC returns native encoding like CP1252).

@github-actions github-actions bot added the pr-size: medium Moderate update size label Feb 23, 2026
@github-actions
Copy link

github-actions bot commented Feb 23, 2026

📊 Code Coverage Report

🔥 Diff Coverage

78%


🎯 Overall Coverage

76%


📈 Total Lines Covered: 5530 out of 7211
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (88.9%): Missing lines 1169-1170,2903,2991
  • mssql_python/pybind/ddbc_bindings.h (45.5%): Missing lines 835,843-847

Summary

  • Total: 47 lines
  • Missing: 10 lines
  • Coverage: 78%

mssql_python/pybind/ddbc_bindings.cpp

Lines 1165-1174

  1165     if (_type != SQL_HANDLE_STMT) {
  1166         // Log error but don't throw - we're likely in cleanup/destructor path
  1167         LOG_ERROR("SAFETY VIOLATION: Attempted to mark non-STMT handle as implicitly freed. "
  1168                   "Handle type=%d. This will cause handle leak. Only STMT handles are "
! 1169                   "automatically freed by parent DBC handles.",
! 1170                   _type);
  1171         return;  // Refuse to mark - let normal free() handle it
  1172     }
  1173     _implicitly_freed = true;
  1174 }

Lines 2899-2907

  2899             effectiveCharEncoding.c_str(), buffer.size(), py::len(decoded), colIndex);
  2900         return decoded;
  2901     } catch (const py::error_already_set& e) {
  2902         LOG_ERROR("FetchLobColumnData: Failed to decode with '%s' for column %d: %s",
! 2903                   effectiveCharEncoding.c_str(), colIndex, e.what());
  2904         // Return raw bytes as fallback
  2905         return raw_bytes;
  2906     }
  2907 }

Lines 2987-2995

  2987                                         py::len(decoded));
  2988                                 } catch (const py::error_already_set& e) {
  2989                                     LOG_ERROR(
  2990                                         "SQLGetData: Failed to decode CHAR column %d with '%s': %s",
! 2991                                         i, decodeEncoding.c_str(), e.what());
  2992                                     // Return raw bytes as fallback
  2993                                     row.append(raw_bytes);
  2994                                 }
  2995                             } else {

mssql_python/pybind/ddbc_bindings.h

  831                 PyUnicode_Decode(dataPtr, numCharsInData, colInfo->charEncoding.c_str(), "strict");
  832         }
  833 #endif
  834         if (!pyStr) {
! 835             PyErr_Clear();
  836             Py_INCREF(Py_None);
  837             PyList_SET_ITEM(row, col - 1, Py_None);
  838         } else {
  839             PyList_SET_ITEM(row, col - 1, pyStr);

  839             PyList_SET_ITEM(row, col - 1, pyStr);
  840         }
  841     } else {
  842         // Slow path: LOB data requires separate fetch call
! 843         PyList_SET_ITEM(
! 844             row, col - 1,
! 845             FetchLobColumnData(hStmt, col, SQL_C_CHAR, false, false, colInfo->charEncoding)
! 846                 .release()
! 847                 .ptr());
  848     }
  849 }
  850 
  851 // Process SQL NCHAR/NVARCHAR (wide/Unicode string) column into Python str


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.row.py: 66.2%
mssql_python.pybind.ddbc_bindings.cpp: 69.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.helpers.py: 77.4%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.cursor.py: 84.7%
mssql_python.__init__.py: 84.9%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a cross-platform fetch bug for CHAR/VARCHAR values at the exact column-size boundary when non-ASCII characters are involved (notably with CP1252), by adjusting buffer sizing and ensuring consistent, platform-appropriate decoding in the ODBC C++ bindings. It also adds regression coverage to prevent the issue from recurring.

Changes:

  • Add platform-aware “effective” decoding selection for SQL_C_CHAR (UTF-8 on Linux/macOS, user-configured encoding on Windows) and propagate it through fetch paths.
  • Increase SQL_C_CHAR fetch/bind buffer sizes on Linux/macOS to handle worst-case UTF-8 expansion at column boundaries.
  • Add a dedicated regression test suite for CP1252 VARCHAR boundary cases and a small decoding config update in the encoding test suite.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tests/test_017_varchar_cp1252_boundary.py New regression tests for CP1252 non-ASCII VARCHAR values at exact-length boundaries across multiple fetch paths.
tests/test_013_encoding_decoding.py Adds setdecoding(SQL_CHAR, "utf-8") to align decoding configuration in an existing parameter encoding test.
mssql_python/pybind/ddbc_bindings.h Updates ProcessChar fast/slow paths to decode correctly on Windows and to pass encoding through LOB fetch.
mssql_python/pybind/ddbc_bindings.cpp Introduces GetEffectiveCharDecoding, adjusts buffer sizing and decoding in SQLGetData_wrap, and propagates encoding into batch fetch metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +3689 to +3694
columnInfosExt[col].dataType = columnInfos[col].dataType;
columnInfosExt[col].columnSize = columnInfos[col].columnSize;
columnInfosExt[col].processedColumnSize = columnInfos[col].processedColumnSize;
columnInfosExt[col].fetchBufferSize = columnInfos[col].fetchBufferSize;
columnInfosExt[col].isLob = columnInfos[col].isLob;
columnInfosExt[col].charEncoding = GetEffectiveCharDecoding(charEncoding);
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetEffectiveCharDecoding(charEncoding) is evaluated and assigned into every ColumnInfoExt entry, even though the effective encoding is identical for all columns within a batch. Consider computing effectiveCharEncoding once before the for (col...) loop and reusing it (or storing a pointer/flag) to avoid redundant work in this hot path.

Copilot uses AI. Check for mistakes.
Comment on lines +835 to 837
PyErr_Clear();
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcessChar clears decode errors and stores None in the result when decoding fails. This differs from the SQLGetData_wrap / FetchLobColumnData paths which fall back to returning raw bytes, and it can silently turn non-NULL database values into None (data loss) on Windows when PyUnicode_Decode(..., "strict") fails. Consider falling back to a bytes object (or propagating the exception) instead of returning None after PyErr_Clear() so behavior is consistent across fetch paths.

Suggested change
PyErr_Clear();
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
// Decoding failed: clear the error and fall back to raw bytes to avoid data loss.
PyErr_Clear();
PyObject* pyBytes = PyBytes_FromStringAndSize(dataPtr, numCharsInData);
if (!pyBytes) {
// As a last resort, preserve previous behavior and return None.
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
} else {
PyList_SET_ITEM(row, col - 1, pyBytes);
}

Copilot uses AI. Check for mistakes.
Comment on lines +2959 to +2963
// Use columnSize * 4 + 1 to accommodate worst-case UTF-8 expansion.
// columnSize is in characters, but on Linux the ODBC driver returns
// UTF-8 encoded bytes where each character can be up to 4 bytes.
// Without this, data at exact column boundary with multi-byte chars
// (e.g., CP1252 é in VARCHAR(10)) causes truncation and corruption.
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explains columnSize * 4 + 1 as a Linux/macOS UTF-8 expansion workaround, but the code applies this buffer sizing unconditionally on all platforms. Either guard this sizing with the same #if defined(__APPLE__) || defined(__linux__) used elsewhere, or update the comment to reflect that the over-allocation is intentional cross-platform (e.g., to also handle Windows when charEncoding is UTF-8).

Suggested change
// Use columnSize * 4 + 1 to accommodate worst-case UTF-8 expansion.
// columnSize is in characters, but on Linux the ODBC driver returns
// UTF-8 encoded bytes where each character can be up to 4 bytes.
// Without this, data at exact column boundary with multi-byte chars
// (e.g., CP1252 é in VARCHAR(10)) causes truncation and corruption.
// Use columnSize * 4 + 1 to accommodate worst-case expansion for
// multi-byte encodings (e.g., UTF-8), where each character can be
// up to 4 bytes. columnSize is in characters, while SQL_C_CHAR
// returns encoded bytes. This intentional over-allocation is applied
// on all platforms to avoid truncation/corruption when data is at
// the exact column boundary and contains multi-byte characters.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +27
varchar_10 VARCHAR(10),
varchar_20 VARCHAR(20),
varchar_50 VARCHAR(50),
varchar_100 VARCHAR(100)
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test suite is intended to validate CP1252 behavior, but the VARCHAR columns are created without an explicit collation, so the actual code page depends on the database default (and could be UTF-8 or something non-CP1252 in some CI environments). To make the tests deterministic and ensure they actually exercise the CP1252 boundary, consider adding COLLATE SQL_Latin1_General_CP1_CI_AS (or another explicit CP1252 collation) to the VARCHAR columns.

Suggested change
varchar_10 VARCHAR(10),
varchar_20 VARCHAR(20),
varchar_50 VARCHAR(50),
varchar_100 VARCHAR(100)
varchar_10 VARCHAR(10) COLLATE SQL_Latin1_General_CP1_CI_AS,
varchar_20 VARCHAR(20) COLLATE SQL_Latin1_General_CP1_CI_AS,
varchar_50 VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS,
varchar_100 VARCHAR(100) COLLATE SQL_Latin1_General_CP1_CI_AS

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: medium Moderate update size

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant