Skip to content

Add Hangul composition char filter to analysis-nori#16242

Open
Incheonkirin wants to merge 1 commit into
apache:mainfrom
Incheonkirin:hangul-composition-charfilter
Open

Add Hangul composition char filter to analysis-nori#16242
Incheonkirin wants to merge 1 commit into
apache:mainfrom
Incheonkirin:hangul-composition-charfilter

Conversation

@Incheonkirin

Copy link
Copy Markdown

Addresses #16241

Summary

Add an opt-in HangulCompositionCharFilter to analysis-nori. The filter composes modern Hangul conjoining-jamo sequences into precomposed Hangul syllables before KoreanTokenizer, so NFD-form Korean text can analyze like the equivalent NFC text while preserving offset correction back to the original input.

The filter is intentionally narrow: it handles only modern L/V/optional-T conjoining jamo sequences and leaves compatibility jamo, archaic jamo, partial sequences, already-precomposed Korean text, and non-Hangul text unchanged. For general Unicode normalization the ICU module's ICUNormalizer2CharFilter remains the right tool; this covers the common Korean-only case without adding the ICU dependency to nori deployments.

Tests

  • NFD Korean sentence through HangulCompositionCharFilter + KoreanTokenizer matches NFC KoreanTokenizer terms/POS
  • offsets from analyzed NFD text map back to the original NFD input
  • randomized modern Hangul NFD composition matches NFC
  • non-modern and partial jamo sequences unchanged
  • already-NFC and no-op inputs unchanged
  • precomposed-LV + trailing jamo passthrough (out-of-scope shape unchanged)
  • factory registration
  • bogus factory arguments
  • random analyzer data

Verification

  • ./gradlew :lucene:analysis:nori:tidy
  • ./gradlew :lucene:analysis:nori:check
  • ./gradlew check

@github-actions github-actions Bot added this to the 11.0.0 milestone Jun 11, 2026

@rmuir rmuir left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use-case makes sense to me.

constants and formula matches directly with 3.12.3 Hangul Syllable Composition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants