feat(spark): add levenshtein with optional threshold support#20927
Open
davidlghellin wants to merge 3 commits intoapache:mainfrom
Open
feat(spark): add levenshtein with optional threshold support#20927davidlghellin wants to merge 3 commits intoapache:mainfrom
levenshtein with optional threshold support#20927davidlghellin wants to merge 3 commits intoapache:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a Spark-compatible levenshtein UDF to datafusion-spark, including optional threshold semantics, and registers it in the Spark string function set with accompanying SQL logic tests.
Changes:
- Introduces
SparkLevenshteinUDF with 2-arg and 3-arg (threshold) forms, supporting Utf8/Utf8View/LargeUtf8 via coercion. - Registers
levenshteinindatafusion/spark/src/function/string/mod.rs(UDF + expr_fn export). - Adds/expands Spark SQL logic tests for distances, threshold behavior, NULLs, Unicode, and type coercion.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| datafusion/sqllogictest/test_files/spark/string/levenshtein.slt | Adds extensive SLT coverage for levenshtein behavior including threshold and NULL cases |
| datafusion/spark/src/function/string/mod.rs | Wires the new levenshtein UDF into the Spark string module exports/registry |
| datafusion/spark/src/function/string/levenshtein.rs | Implements Spark-compatible levenshtein UDF (including optional threshold) plus Rust unit tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| "Returns the Levenshtein distance between two strings. Optionally accepts a threshold; returns -1 if the distance exceeds it.", | ||
| str1 str2 | ||
| )); | ||
| export_functions!(( |
Comment on lines
+94
to
+111
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| if let Some(coercion_data_type) = | ||
| string_coercion(&arg_types[0], &arg_types[1]) | ||
| .or_else(|| binary_to_string_coercion(&arg_types[0], &arg_types[1])) | ||
| { | ||
| match coercion_data_type { | ||
| DataType::LargeUtf8 => Ok(DataType::Int64), | ||
| DataType::Utf8 | DataType::Utf8View => Ok(DataType::Int32), | ||
| other => exec_err!( | ||
| "levenshtein requires Utf8, LargeUtf8 or Utf8View, got {other}" | ||
| ), | ||
| } | ||
| } else { | ||
| exec_err!( | ||
| "Unsupported data types for levenshtein. Expected Utf8, LargeUtf8 or Utf8View" | ||
| ) | ||
| } | ||
| } |
Comment on lines
+99
to
+105
| match coercion_data_type { | ||
| DataType::LargeUtf8 => Ok(DataType::Int64), | ||
| DataType::Utf8 | DataType::Utf8View => Ok(DataType::Int32), | ||
| other => exec_err!( | ||
| "levenshtein requires Utf8, LargeUtf8 or Utf8View, got {other}" | ||
| ), | ||
| } |
Comment on lines
+123
to
+129
| // Spark returns NULL when any scalar argument is NULL. | ||
| let null_int = |dt: &DataType| match dt { | ||
| DataType::LargeUtf8 => ColumnarValue::Scalar(ScalarValue::Int64(None)), | ||
| _ => ColumnarValue::Scalar(ScalarValue::Int32(None)), | ||
| }; | ||
| for arg in &args { | ||
| if matches!(arg, ColumnarValue::Scalar(s) if s.is_null()) { |
Comment on lines
+108
to
+112
| ## Null threshold | ||
| query I | ||
| SELECT levenshtein('kitten', 'sitting', CAST(NULL AS INT)); | ||
| ---- | ||
| NULL |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
datafusion-sparkSpark Compatible Functions #15914Rationale for this change
Spark's
levenshtein(str1, str2, threshold)supports an optional third argument: when the computed distance exceeds the threshold, it returns-1instead of the actual distance. DataFusion core'slevenshteinonly supports the 2-argument form. This PR adds the Spark-compatible version todatafusion-spark.What changes are included in this PR?
SparkLevenshteinUDF indatafusion/spark/src/function/string/levenshtein.rs:levenshtein(str1, str2)— standard Levenshtein distancelevenshtein(str1, str2, threshold)— returns-1when distance > threshold0(Spark behavior)string/mod.rsAre these changes tested?
Yes:
levenshtein.rsspark/string/levenshtein.sltAre there any user-facing changes?
No breaking changes. Adds the
levenshteinfunction to the Spark function library with optional threshold support.