Skip to content

feat(spark): add levenshtein with optional threshold support#20927

Open
davidlghellin wants to merge 3 commits intoapache:mainfrom
davidlghellin:feat/spark_levenshtein
Open

feat(spark): add levenshtein with optional threshold support#20927
davidlghellin wants to merge 3 commits intoapache:mainfrom
davidlghellin:feat/spark_levenshtein

Conversation

@davidlghellin
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Spark's levenshtein(str1, str2, threshold) supports an optional third argument: when the computed distance exceeds the threshold, it returns -1 instead of the actual distance. DataFusion core's levenshtein only supports the 2-argument form. This PR adds the Spark-compatible version to datafusion-spark.

What changes are included in this PR?

  • Add SparkLevenshtein UDF in datafusion/spark/src/function/string/levenshtein.rs:
    • 2-arg form: levenshtein(str1, str2) — standard Levenshtein distance
    • 3-arg form: levenshtein(str1, str2, threshold) — returns -1 when distance > threshold
    • Supports Utf8, Utf8View, and LargeUtf8 (with automatic type coercion)
    • NULL threshold is treated as 0 (Spark behavior)
    • NULL string arguments return NULL
  • Register the function in string/mod.rs
  • Comprehensive SLT tests covering basic usage, threshold, NULL handling, Unicode, column expressions, per-row threshold, and mixed type coercion

Are these changes tested?

Yes:

  • 22 Rust unit tests in levenshtein.rs
  • 20 SQL logic tests in spark/string/levenshtein.slt

Are there any user-facing changes?

No breaking changes. Adds the levenshtein function to the Spark function library with optional threshold support.

Copilot AI review requested due to automatic review settings March 13, 2026 16:17
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Mar 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Spark-compatible levenshtein UDF to datafusion-spark, including optional threshold semantics, and registers it in the Spark string function set with accompanying SQL logic tests.

Changes:

  • Introduces SparkLevenshtein UDF with 2-arg and 3-arg (threshold) forms, supporting Utf8/Utf8View/LargeUtf8 via coercion.
  • Registers levenshtein in datafusion/spark/src/function/string/mod.rs (UDF + expr_fn export).
  • Adds/expands Spark SQL logic tests for distances, threshold behavior, NULLs, Unicode, and type coercion.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
datafusion/sqllogictest/test_files/spark/string/levenshtein.slt Adds extensive SLT coverage for levenshtein behavior including threshold and NULL cases
datafusion/spark/src/function/string/mod.rs Wires the new levenshtein UDF into the Spark string module exports/registry
datafusion/spark/src/function/string/levenshtein.rs Implements Spark-compatible levenshtein UDF (including optional threshold) plus Rust unit tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

"Returns the Levenshtein distance between two strings. Optionally accepts a threshold; returns -1 if the distance exceeds it.",
str1 str2
));
export_functions!((
Comment on lines +94 to +111
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
if let Some(coercion_data_type) =
string_coercion(&arg_types[0], &arg_types[1])
.or_else(|| binary_to_string_coercion(&arg_types[0], &arg_types[1]))
{
match coercion_data_type {
DataType::LargeUtf8 => Ok(DataType::Int64),
DataType::Utf8 | DataType::Utf8View => Ok(DataType::Int32),
other => exec_err!(
"levenshtein requires Utf8, LargeUtf8 or Utf8View, got {other}"
),
}
} else {
exec_err!(
"Unsupported data types for levenshtein. Expected Utf8, LargeUtf8 or Utf8View"
)
}
}
Comment on lines +99 to +105
match coercion_data_type {
DataType::LargeUtf8 => Ok(DataType::Int64),
DataType::Utf8 | DataType::Utf8View => Ok(DataType::Int32),
other => exec_err!(
"levenshtein requires Utf8, LargeUtf8 or Utf8View, got {other}"
),
}
Comment on lines +123 to +129
// Spark returns NULL when any scalar argument is NULL.
let null_int = |dt: &DataType| match dt {
DataType::LargeUtf8 => ColumnarValue::Scalar(ScalarValue::Int64(None)),
_ => ColumnarValue::Scalar(ScalarValue::Int32(None)),
};
for arg in &args {
if matches!(arg, ColumnarValue::Scalar(s) if s.is_null()) {
Comment on lines +108 to +112
## Null threshold
query I
SELECT levenshtein('kitten', 'sitting', CAST(NULL AS INT));
----
NULL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants