feat: support optional threshold parameter for levenshtein function by n0r0shi · Pull Request #20487 · apache/datafusion

n0r0shi · 2026-02-23T08:19:19Z

Which issue does this PR close?

Closes #20488

Rationale for this change

Apache Spark's levenshtein(str1, str2, threshold) supports an optional third argument that caps the computation: if the edit distance exceeds the threshold, it returns -1 with early termination. This enables Apache DataFusion Comet to natively support Spark's 3-argument levenshtein instead of falling back to Spark.

What changes are included in this PR?

datafusion-spark: Added SparkLevenshtein UDF supporting both 2-arg and 3-arg (with threshold) signatures. The threshold algorithm uses banded dynamic programming with early termination, ported from Apache Spark's UTF8String.levenshteinDistance. Null threshold produces null output.
sqllogictests: Added spark levenshtein tests covering thresholds, empty strings, and null handling.

Core levenshtein in datafusion-functions is unchanged (2-arg only).

Are these changes tested?

4 unit tests (basic, threshold, null handling, edge cases)
Spark sqllogictest queries (2-arg and 3-arg with threshold)

Are there any user-facing changes?

Yes. Spark-compatible levenshtein now accepts an optional third argument:

SELECT levenshtein('kitten', 'sitting');        -- 3
SELECT levenshtein('kitten', 'sitting', 2);     -- -1 (distance 3 > threshold 2)
SELECT levenshtein('kitten', 'sitting', 5);     -- 3  (distance 3 <= threshold 5)
SELECT levenshtein('kitten', 'sitting', NULL);  -- NULL

neilconway

lgtm overall!

datafusion/common/src/utils/mod.rs

n0r0shi · 2026-03-04T20:55:01Z

@neilconway please let me know if anything to be done for this PR.

neilconway · 2026-03-04T21:56:39Z

@n0r0shi Not from my perspective but I'm not a committer :) Perhaps @alamb has some time or can suggest a committer who might be interested.

alamb · 2026-03-04T22:58:38Z

I wonder did you see dataufsion-function-spark (the spark compatible function library) ? That might also be a place to put this kind of change in.

I started the CI checks but I am not likely to have time to review this PR soon

Move the Spark-specific levenshtein threshold logic from core DataFusion to datafusion-spark. The core 2-arg levenshtein remains unchanged. The Spark levenshtein(str1, str2, threshold) returns -1 when the edit distance exceeds the threshold, using a banded DP algorithm with early termination for better performance.

n0r0shi · 2026-03-07T09:38:21Z

@alamb Moved the threshold logic entirely into datafusion-spark. Now no changes to datafusion-common or datafusion-functions anymore. Thanks.

alamb · 2026-03-10T18:47:47Z

Thanks @n0r0shi -- I kicked off the tests to run

davidlghellin · 2026-03-14T15:01:36Z

Hi @n0r0shi, sorry I didn't notice your PR was already open — I submitted #20927 covering the same function. Happy to close mine.

Feel free to grab any of my test cases if they're useful — I have 22 unit tests and 20 SLT tests covering Utf8View support, per-row NULL threshold behavior, negative thresholds, unicode, long strings, and case sensitivity.

One edge case worth checking: in Spark JVM, a scalar NULL threshold returns NULL (planner constant folding), but a per-row NULL threshold in a column is treated as 0 and returns -1. The two paths behave differently by design.

Great work on this!

n0r0shi · 2026-03-14T17:21:24Z

@davidlghellin Thanks for your work on this! I'd like to check your PR and port some of your test cases into mine. If I do, I'll add you as co-author so your contribution is recognized. Would that work for you?

github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate functions Changes to functions implementation labels Feb 23, 2026

neilconway approved these changes Feb 26, 2026

View reviewed changes

datafusion/common/src/utils/mod.rs Outdated Show resolved Hide resolved

n0r0shi force-pushed the levenshtein-threshold branch from 850b40c to 2aac710 Compare March 7, 2026 09:26

github-actions bot added spark and removed functions Changes to functions implementation labels Mar 7, 2026

n0r0shi force-pushed the levenshtein-threshold branch from 2aac710 to 7326754 Compare March 7, 2026 09:36

github-actions bot removed the common Related to common crate label Mar 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support optional threshold parameter for levenshtein function#20487

feat: support optional threshold parameter for levenshtein function#20487
n0r0shi wants to merge 1 commit intoapache:mainfrom
n0r0shi:levenshtein-threshold

n0r0shi commented Feb 23, 2026 •

edited

Loading

Uh oh!

neilconway left a comment

Uh oh!

Uh oh!

n0r0shi commented Mar 4, 2026

Uh oh!

neilconway commented Mar 4, 2026

Uh oh!

alamb commented Mar 4, 2026

Uh oh!

n0r0shi commented Mar 7, 2026

Uh oh!

alamb commented Mar 10, 2026

Uh oh!

davidlghellin commented Mar 14, 2026

Uh oh!

n0r0shi commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

n0r0shi commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

n0r0shi commented Mar 4, 2026

Uh oh!

neilconway commented Mar 4, 2026

Uh oh!

alamb commented Mar 4, 2026

Uh oh!

n0r0shi commented Mar 7, 2026

Uh oh!

alamb commented Mar 10, 2026

Uh oh!

davidlghellin commented Mar 14, 2026

Uh oh!

n0r0shi commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

n0r0shi commented Feb 23, 2026 •

edited

Loading