feat: support optional threshold parameter for levenshtein function#20487
feat: support optional threshold parameter for levenshtein function#20487n0r0shi wants to merge 1 commit intoapache:mainfrom
Conversation
|
@neilconway please let me know if anything to be done for this PR. |
|
I wonder did you see dataufsion-function-spark (the spark compatible function library) ? That might also be a place to put this kind of change in. I started the CI checks but I am not likely to have time to review this PR soon |
850b40c to
2aac710
Compare
Move the Spark-specific levenshtein threshold logic from core DataFusion to datafusion-spark. The core 2-arg levenshtein remains unchanged. The Spark levenshtein(str1, str2, threshold) returns -1 when the edit distance exceeds the threshold, using a banded DP algorithm with early termination for better performance.
2aac710 to
7326754
Compare
|
@alamb Moved the threshold logic entirely into |
|
Thanks @n0r0shi -- I kicked off the tests to run |
|
Hi @n0r0shi, sorry I didn't notice your PR was already open — I submitted #20927 covering the same function. Happy to close mine. Feel free to grab any of my test cases if they're useful — I have 22 unit tests and 20 SLT tests covering Utf8View support, per-row NULL threshold behavior, negative thresholds, unicode, long strings, and case sensitivity. One edge case worth checking: in Spark JVM, a scalar NULL threshold returns NULL (planner constant folding), but a per-row NULL threshold in a column is treated as 0 and returns -1. The two paths behave differently by design. Great work on this! |
|
@davidlghellin Thanks for your work on this! I'd like to check your PR and port some of your test cases into mine. If I do, I'll add you as co-author so your contribution is recognized. Would that work for you? |
Which issue does this PR close?
Closes #20488
Rationale for this change
Apache Spark's
levenshtein(str1, str2, threshold)supports an optional third argument that caps the computation: if the edit distance exceeds the threshold, it returns -1 with early termination. This enables Apache DataFusion Comet to natively support Spark's 3-argumentlevenshteininstead of falling back to Spark.What changes are included in this PR?
datafusion-spark: AddedSparkLevenshteinUDF supporting both 2-arg and 3-arg (with threshold) signatures. The threshold algorithm uses banded dynamic programming with early termination, ported from Apache Spark'sUTF8String.levenshteinDistance. Null threshold produces null output.sqllogictests: Added spark levenshtein tests covering thresholds, empty strings, and null handling.
Core
levenshteinindatafusion-functionsis unchanged (2-arg only).Are these changes tested?
Are there any user-facing changes?
Yes. Spark-compatible
levenshteinnow accepts an optional third argument: