Add Hangul composition char filter to analysis-nori by Incheonkirin · Pull Request #16242 · apache/lucene

Incheonkirin · 2026-06-11T05:00:28Z

Addresses #16241

Summary

Add an opt-in HangulCompositionCharFilter to analysis-nori. The filter composes modern Hangul conjoining-jamo sequences into precomposed Hangul syllables before KoreanTokenizer, so NFD-form Korean text can analyze like the equivalent NFC text while preserving offset correction back to the original input.

The filter is intentionally narrow: it handles only modern L/V/optional-T conjoining jamo sequences and leaves compatibility jamo, archaic jamo, partial sequences, already-precomposed Korean text, and non-Hangul text unchanged. For general Unicode normalization the ICU module's ICUNormalizer2CharFilter remains the right tool; this covers the common Korean-only case without adding the ICU dependency to nori deployments.

Tests

NFD Korean sentence through HangulCompositionCharFilter + KoreanTokenizer matches NFC KoreanTokenizer terms/POS
offsets from analyzed NFD text map back to the original NFD input
randomized modern Hangul NFD composition matches NFC
non-modern and partial jamo sequences unchanged
already-NFC and no-op inputs unchanged
precomposed-LV + trailing jamo passthrough (out-of-scope shape unchanged)
factory registration
bogus factory arguments
random analyzer data

Verification

./gradlew :lucene:analysis:nori:tidy
./gradlew :lucene:analysis:nori:check
./gradlew check

rmuir

the use-case makes sense to me.

constants and formula matches directly with 3.12.3 Hangul Syllable Composition

Add Hangul composition char filter

6dac6f5

github-actions Bot added the module:analysis label Jun 11, 2026

github-actions Bot added this to the 11.0.0 milestone Jun 11, 2026

rmuir approved these changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hangul composition char filter to analysis-nori#16242

Add Hangul composition char filter to analysis-nori#16242
Incheonkirin wants to merge 1 commit into
apache:mainfrom
Incheonkirin:hangul-composition-charfilter

Incheonkirin commented Jun 11, 2026

Uh oh!

rmuir left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Incheonkirin commented Jun 11, 2026

Summary

Tests

Verification

Uh oh!

rmuir left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants