Add new tokenizer interface in monlp. Added gojieba for dictionary based chinese word breaking.#24272
Add new tokenizer interface in monlp. Added gojieba for dictionary based chinese word breaking.#24272fengttt wants to merge 8 commits intomatrixorigin:mainfrom
Conversation
…ize takes input Add a Tokenizer interface with a single Tokenize([]byte) method. NewSimpleTokenizer() now takes no parameters; the input is passed to Tokenize, which resets internal state on each call. Update all callers in pkg/fulltext and pkg/sql/colexec/table_function accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds JiebaTokenizer implementing the Tokenizer interface using github.com/yanyiwu/gojieba for Chinese word segmentation. Factory NewJiebaTokenizer(useHmm bool) toggles HMM new-word discovery. Output drops pure breaker tokens, lowers Latin runs, and truncates to MAX_TOKEN_SIZE on a UTF-8 boundary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Allows CREATE FULLTEXT INDEX ... WITH PARSER gojieba to use gojieba word segmentation for both index build and query tokenization, replacing ngram bigrams for Chinese text. - DDL accepts WITH PARSER gojieba; rejects unknown parsers as before. - Index build (fulltext_index_tokenize) uses HMM=false for stable, reproducible segmentation across deployments. - Query path (ParsePatternInNLMode + boolean-mode keyword reseg in GenTextSql) uses HMM=true for broader recall on unseen terms. - ParsePattern / ParsePatternInNLMode take parser; NewSearchAccum extracts it from the JSON params already threaded through fulltext_index_scan, so existing callers keep their signatures. - SharedJiebaTokenizer(useHmm) memoizes two singletons; the ~1s dictionary load is paid once per HMM setting per process. BVT: test/distributed/cases/fulltext/gojieba.sql mirrors the default simple-tokenizer coverage in fulltext.sql against gojieba. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The catalog path that serializes fulltext index params into JSON (fullTextIndexParamsToMap) had its own parser allowlist that wasn't updated alongside build_ddl's. WITH PARSER gojieba reached DDL validation but failed during catalog write with "internal error: invalid parser gojieba". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gojieba is a Chinese-first segmenter: with HMM=false an English word
like "color" has no dictionary entry and falls through to per-character
tokens (c, o, l, o, r), which broke fulltext search on English columns
indexed WITH PARSER gojieba — "select ... where match against('red')"
returned no rows.
Split the input at the ASCII / non-ASCII byte boundary (always a safe
UTF-8 char edge since every multi-byte UTF-8 byte is >= 0x80). ASCII
spans go through SimpleTokenizer's Latin path (proper word splits,
lowercasing, MAX_TOKEN_SIZE truncation); the rest goes to gojieba with
the configured useHmm. This keeps useHmm=false stable for Chinese at
index build time without losing English/digit tokens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
Upstream gojieba.NewJieba() resolves its dict directory via runtime.Caller, baking the GOMODCACHE path into the binary at compile time. Multi-stage Docker builds and CI runners with stripped module caches panic at runtime with "Dictionary file does not exist". Vendor cppjieba's dict files under pkg/monlp/tokenizer/dict/ and pass them explicitly to NewJieba. Resolution checks MO_JIEBA_DICT_DIR first, then falls back to a runtime.Caller-relative path that is stable for source builds and go test. Production Dockerfiles copy the vendored dir to /usr/local/share/jieba and set MO_JIEBA_DICT_DIR.
There was a problem hiding this comment.
Pull request overview
This PR adds a new fulltext tokenizer/parser option (gojieba) to support dictionary-based Chinese word segmentation, threading the parser choice from DDL/index params through planning, execution (tokenize + scan), and fulltext pattern parsing. It also introduces a tokenizer interface in pkg/monlp/tokenizer and adds BVT coverage plus runtime dictionary packaging for Docker images.
Changes:
- Add
gojiebaas an allowed FULLTEXT parser and propagate the parser selection through fulltext params → pattern parsing → SQL generation/execution. - Introduce a
Tokenizerinterface, refactorSimpleTokenizerto a reusable tokenizer object, and add a newJiebaTokenizerimplementation (with shared singletons). - Add distributed tests for
WITH PARSER gojiebaand package jieba dictionaries into Docker images viaMO_JIEBA_DICT_DIR.
Reviewed changes
Copilot reviewed 24 out of 29 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/distributed/cases/fulltext/gojieba.sql | New distributed SQL test coverage for fulltext indexing/search using WITH PARSER gojieba. |
| test/distributed/cases/fulltext/gojieba.result | Expected output for the new gojieba distributed test case. |
| pkg/sql/plan/fulltext.go | Pass parser selection into fulltext pattern parsing when generating fulltext scan SQL. |
| pkg/sql/plan/build_ddl.go | Allow gojieba as a supported FULLTEXT parser in DDL validation. |
| pkg/sql/colexec/table_function/fulltext.go | Thread table-function params into NewSearchAccum so runtime pattern parsing can respect parser choice. |
| pkg/sql/colexec/table_function/fulltext_tokenize.go | Add gojieba tokenization path for index build-time tokenization. |
| pkg/monlp/tokenizer/simple2_test.go | Update tests to new SimpleTokenizer construction + Tokenize(input) API. |
| pkg/monlp/tokenizer/simple.go | Add Tokenizer interface; refactor SimpleTokenizer to accept input per Tokenize() call. |
| pkg/monlp/tokenizer/simple_test.go | Update tests for the refactored SimpleTokenizer API. |
| pkg/monlp/tokenizer/jieba.go | New gojieba-backed tokenizer implementation, including shared singleton instances and ASCII/CJK handling. |
| pkg/monlp/tokenizer/jieba_test.go | Unit tests for gojieba tokenization semantics (HMM vs dictionary-only, byte offsets, truncation, interface conformance). |
| pkg/monlp/tokenizer/jieba_dict.go | Dictionary path resolution logic via MO_JIEBA_DICT_DIR or vendored dict directory. |
| pkg/monlp/tokenizer/dict/user.dict.utf8 | Add custom user dictionary entries for jieba. |
| pkg/monlp/tokenizer/dict/stop_words.utf8 | Add stop-words list for jieba. |
| pkg/monlp/tokenizer/dict/README.md | Document dictionary file purposes (CppJieba dictionary set). |
| pkg/monlp/tokenizer/dict/LICENSE | Add dictionary license file (MIT). |
| pkg/fulltext/types.go | Update helpers to call the new ParsePattern(pattern, mode, parser) signature. |
| pkg/fulltext/sql.go | Route NL-mode pattern splitting through parser-aware ParsePatternInNLMode(..., parser). |
| pkg/fulltext/jieba_test.go | New unit tests verifying parser routing, params parsing, and boolean-mode jieba behavior. |
| pkg/fulltext/fulltext.go | Add parser extraction from params, jieba NL parsing path, and parser-aware fulltext parsing API surface. |
| pkg/catalog/secondary_index_utils.go | Allow gojieba parser in index option parsing/validation. |
| optools/images/gpu/Dockerfile | Copy jieba dicts into image and set MO_JIEBA_DICT_DIR for runtime resolution. |
| optools/images/Dockerfile | Same: package jieba dicts and set MO_JIEBA_DICT_DIR. |
| go.sum | Add gojieba dependency checksums. |
| go.mod | Add github.com/yanyiwu/gojieba v1.4.7 dependency. |
| .gitignore | Ignore additional local files (e.g. CLAUDE.md, logs, scratch SQL). |
| func resolveJiebaDictDir() string { | ||
| if d := os.Getenv("MO_JIEBA_DICT_DIR"); d != "" { | ||
| if _, err := os.Stat(filepath.Join(d, "jieba.dict.utf8")); err == nil { | ||
| return d | ||
| } | ||
| } | ||
| _, self, _, ok := runtime.Caller(0) | ||
| if ok { | ||
| d := filepath.Join(filepath.Dir(self), "dict") | ||
| if _, err := os.Stat(filepath.Join(d, "jieba.dict.utf8")); err == nil { | ||
| return d | ||
| } | ||
| } | ||
| return "" | ||
| } |
| list = append(list, &Pattern{Text: word, Operator: TEXT, Position: t.BytePos}) | ||
| } | ||
| if len(list) == 0 { | ||
| return nil, moerr.NewInternalErrorNoCtx("Invalid input search string. search string onverted to empty pattern") |
aunjgr
left a comment
There was a problem hiding this comment.
Revised Review: gojieba Chinese Word Breaking (deeper analysis)
I'm updating my previous review after deeper analysis. Some concerns are resolved, one remains blocking, and I have new findings.
Resolved from prior review
Thread safety: CONFIRMED SAFE. gojieba's C++ Cut methods are const — all mutable state is stack-local. The author confirmed thread safety in cppjieba issues #76/#82. SharedJiebaTokenizer concurrent access is safe as long as AddWord/RemoveWord are never called (they aren't in this PR).
Blocking Issues
1. resolveJiebaDictDir returns empty string on failure → panic at runtime
If neither MO_JIEBA_DICT_DIR is set nor the source tree exists on disk (standard case for compiled binaries on bare-metal/K8s without the env var), the function returns "". This feeds paths like "/jieba.dict.utf8" to gojieba.NewJieba(), which panics (not errors) if files are missing. The panic will happen inside sync.Once, making it unrecoverable.
Fix: return an explicit error from resolveJiebaDictDir and propagate it up through SharedJiebaTokenizer / NewJiebaTokenizer. Or at minimum, check the returned dir is non-empty and return a clear error from Tokenize.
The Dockerfile sets MO_JIEBA_DICT_DIR correctly, but not all deployment paths go through Docker (dev machines, bare-metal prod, systemd units).
2. ~610K lines of dictionary data in git (product decision, flagging for awareness)
I'll downgrade this from "blocking" to "strong advisory" — it's a legitimate tradeoff (no network dependency at build time vs. permanent clone bloat). But the team should explicitly decide. These files will exist in git history forever even if later removed. Consider git-lfs as a middle ground.
Advisory Issues
3. .gitignore adds unrelated entries
CLAUDE.md, mo-service.log, bug.sql are unrelated to gojieba. Should be a separate commit.
4. SimpleTokenizer struct reuse is not concurrency-safe (non-issue in practice)
The new Tokenize(input) mutates receiver state, making a shared *SimpleTokenizer unsafe for concurrent use. In practice every call site creates a fresh instance via NewSimpleTokenizer(), so this is fine. Just documenting for awareness — don't share instances.
5. parsePatternInNLModeJieba typo in error message
Line: "Invalid input search string. search string onverted to empty pattern" — "onverted" should be "converted", double space.
Design Observations (positive)
- Index/query tokenization consistency is correct: both sides use
SharedJiebaTokenizer(false)(HMM disabled), ensuring the same segmentation at index and query time. - The ASCII/CJK pre-split in
JiebaTokenizer.Tokenizeis sound: delegates Latin to SimpleTokenizer's existing logic, avoids gojieba's per-character fallback for ASCII. - Boolean mode threading is correct:
ParsePatternInBooleanModeparses structure, thenGenTextSql→ParsePatternInNLMode(text, "gojieba")re-tokenizes individual terms with jieba. fulltextIndexMatchnow correctly passestableFunction.Paramsinstead of"", connecting the search side to the parser config.- Token BytePos tracking with
chunkOff + w.Startcorrectly computes global byte offsets across the ASCII/CJK split.
Summary
The architecture is well thought out. The one real blocker is the panic-on-missing-dict path — a production service shouldn't panic inside a sync.Once with no recovery path. Everything else is minor.
What type of PR is this?
Which issue(s) this PR fixes:
issue #24271
What this PR does / why we need it:
Add dictionary based Chinese word breaking using gojieba.