StableTfl Similarity by txwei · Pull Request #16243 · apache/lucene

txwei · 2026-06-11T06:17:19Z

Description

Adds StableTflSimilarity, a new similarity algorithm that estimates term rarity from term length and document length instead of corpus-level stats (doc frequency, total doc count, avg doc length). I presented this model at Haystack 2026 at Charlottesville (slides).

It keeps the same overall shape as BM25 - a sum over query terms of a term-frequency saturation factor times a per-term rarity weigh, but swaps IDF for a synthetic, corpus-independent term rarity (TR) term, computed as the probability that a given term appears in a doc at least once based only on the term length and doc length.

Why?

BM25 scores depend on per-field corpus stats that could differ across search nodes and shift over time as segments merge. When users paginate by relevance score, these drifting scores could produce duplicate or skipped results across pages. Since StableTfl doesn't rely on any of the drifting stats, it always produces a deterministic score. If score and ranking consistency is more important than a few percentage points of recall loss, this is something worth exploring.

Recall and NDCG performance

Benchmark results across 21 BEIR datasets:

Model	NDCG@10	Recall@10
BM25	0.346	0.388
StableTfl	0.315	0.350
Gap	0.031	0.038

Since StableTfl uses less information than BM25, it has a slightly lower recall.

txwei added 5 commits June 5, 2026 00:03

stableTfl

6f8bf5c

optimization and unit tests

52b5d8e

support k3 and bulkScorer

7deec17

add comment

5563c0c

style fix

59bb341

github-actions Bot added the module:core/search label Jun 11, 2026

github-actions Bot added this to the 11.0.0 milestone Jun 11, 2026

txwei added 3 commits June 10, 2026 23:18

CHANGES.txt

6a341ec

fixes

91cc2aa

use 0f

59488c0

txwei mentioned this pull request Jun 11, 2026

Defer term collection when query term count is unknown #16222

Merged

txwei marked this pull request as ready for review June 11, 2026 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StableTfl Similarity#16243

StableTfl Similarity#16243
txwei wants to merge 8 commits into
apache:mainfrom
txwei:stableTfl

txwei commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

txwei commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why?

Recall and NDCG performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

txwei commented Jun 11, 2026 •

edited

Loading