Add new tokenizer interface in monlp. Added gojieba for dictionary based chinese word breaking. by fengttt · Pull Request #24272 · matrixorigin/matrixone

fengttt · 2026-05-05T00:01:21Z

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Add dictionary based Chinese word breaking using gojieba.

…ize takes input Add a Tokenizer interface with a single Tokenize([]byte) method. NewSimpleTokenizer() now takes no parameters; the input is passed to Tokenize, which resets internal state on each call. Update all callers in pkg/fulltext and pkg/sql/colexec/table_function accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds JiebaTokenizer implementing the Tokenizer interface using github.com/yanyiwu/gojieba for Chinese word segmentation. Factory NewJiebaTokenizer(useHmm bool) toggles HMM new-word discovery. Output drops pure breaker tokens, lowers Latin runs, and truncates to MAX_TOKEN_SIZE on a UTF-8 boundary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Allows CREATE FULLTEXT INDEX ... WITH PARSER gojieba to use gojieba word segmentation for both index build and query tokenization, replacing ngram bigrams for Chinese text. - DDL accepts WITH PARSER gojieba; rejects unknown parsers as before. - Index build (fulltext_index_tokenize) uses HMM=false for stable, reproducible segmentation across deployments. - Query path (ParsePatternInNLMode + boolean-mode keyword reseg in GenTextSql) uses HMM=true for broader recall on unseen terms. - ParsePattern / ParsePatternInNLMode take parser; NewSearchAccum extracts it from the JSON params already threaded through fulltext_index_scan, so existing callers keep their signatures. - SharedJiebaTokenizer(useHmm) memoizes two singletons; the ~1s dictionary load is paid once per HMM setting per process. BVT: test/distributed/cases/fulltext/gojieba.sql mirrors the default simple-tokenizer coverage in fulltext.sql against gojieba. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The catalog path that serializes fulltext index params into JSON (fullTextIndexParamsToMap) had its own parser allowlist that wasn't updated alongside build_ddl's. WITH PARSER gojieba reached DDL validation but failed during catalog write with "internal error: invalid parser gojieba". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gojieba is a Chinese-first segmenter: with HMM=false an English word like "color" has no dictionary entry and falls through to per-character tokens (c, o, l, o, r), which broke fulltext search on English columns indexed WITH PARSER gojieba — "select ... where match against('red')" returned no rows. Split the input at the ASCII / non-ASCII byte boundary (always a safe UTF-8 char edge since every multi-byte UTF-8 byte is >= 0x80). ASCII spans go through SimpleTokenizer's Latin path (proper word splits, lowercasing, MAX_TOKEN_SIZE truncation); the rest goes to gojieba with the configured useHmm. This keeps useHmm=false stable for Chinese at index build time without losing English/digit tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-code-review · 2026-05-05T00:01:24Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Upstream gojieba.NewJieba() resolves its dict directory via runtime.Caller, baking the GOMODCACHE path into the binary at compile time. Multi-stage Docker builds and CI runners with stripped module caches panic at runtime with "Dictionary file does not exist". Vendor cppjieba's dict files under pkg/monlp/tokenizer/dict/ and pass them explicitly to NewJieba. Resolution checks MO_JIEBA_DICT_DIR first, then falls back to a runtime.Caller-relative path that is stable for source builds and go test. Production Dockerfiles copy the vendored dir to /usr/local/share/jieba and set MO_JIEBA_DICT_DIR.

Copilot

Pull request overview

This PR adds a new fulltext tokenizer/parser option (gojieba) to support dictionary-based Chinese word segmentation, threading the parser choice from DDL/index params through planning, execution (tokenize + scan), and fulltext pattern parsing. It also introduces a tokenizer interface in pkg/monlp/tokenizer and adds BVT coverage plus runtime dictionary packaging for Docker images.

Changes:

Add gojieba as an allowed FULLTEXT parser and propagate the parser selection through fulltext params → pattern parsing → SQL generation/execution.
Introduce a Tokenizer interface, refactor SimpleTokenizer to a reusable tokenizer object, and add a new JiebaTokenizer implementation (with shared singletons).
Add distributed tests for WITH PARSER gojieba and package jieba dictionaries into Docker images via MO_JIEBA_DICT_DIR.

Reviewed changes

Copilot reviewed 24 out of 29 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/distributed/cases/fulltext/gojieba.sql	New distributed SQL test coverage for fulltext indexing/search using `WITH PARSER gojieba`.
test/distributed/cases/fulltext/gojieba.result	Expected output for the new `gojieba` distributed test case.
pkg/sql/plan/fulltext.go	Pass parser selection into fulltext pattern parsing when generating fulltext scan SQL.
pkg/sql/plan/build_ddl.go	Allow `gojieba` as a supported FULLTEXT parser in DDL validation.
pkg/sql/colexec/table_function/fulltext.go	Thread table-function params into `NewSearchAccum` so runtime pattern parsing can respect parser choice.
pkg/sql/colexec/table_function/fulltext_tokenize.go	Add `gojieba` tokenization path for index build-time tokenization.
pkg/monlp/tokenizer/simple2_test.go	Update tests to new `SimpleTokenizer` construction + `Tokenize(input)` API.
pkg/monlp/tokenizer/simple.go	Add `Tokenizer` interface; refactor `SimpleTokenizer` to accept input per `Tokenize()` call.
pkg/monlp/tokenizer/simple_test.go	Update tests for the refactored `SimpleTokenizer` API.
pkg/monlp/tokenizer/jieba.go	New gojieba-backed tokenizer implementation, including shared singleton instances and ASCII/CJK handling.
pkg/monlp/tokenizer/jieba_test.go	Unit tests for gojieba tokenization semantics (HMM vs dictionary-only, byte offsets, truncation, interface conformance).
pkg/monlp/tokenizer/jieba_dict.go	Dictionary path resolution logic via `MO_JIEBA_DICT_DIR` or vendored dict directory.
pkg/monlp/tokenizer/dict/user.dict.utf8	Add custom user dictionary entries for jieba.
pkg/monlp/tokenizer/dict/stop_words.utf8	Add stop-words list for jieba.
pkg/monlp/tokenizer/dict/README.md	Document dictionary file purposes (CppJieba dictionary set).
pkg/monlp/tokenizer/dict/LICENSE	Add dictionary license file (MIT).
pkg/fulltext/types.go	Update helpers to call the new `ParsePattern(pattern, mode, parser)` signature.
pkg/fulltext/sql.go	Route NL-mode pattern splitting through parser-aware `ParsePatternInNLMode(..., parser)`.
pkg/fulltext/jieba_test.go	New unit tests verifying parser routing, params parsing, and boolean-mode jieba behavior.
pkg/fulltext/fulltext.go	Add parser extraction from params, jieba NL parsing path, and parser-aware fulltext parsing API surface.
pkg/catalog/secondary_index_utils.go	Allow `gojieba` parser in index option parsing/validation.
optools/images/gpu/Dockerfile	Copy jieba dicts into image and set `MO_JIEBA_DICT_DIR` for runtime resolution.
optools/images/Dockerfile	Same: package jieba dicts and set `MO_JIEBA_DICT_DIR`.
go.sum	Add gojieba dependency checksums.
go.mod	Add `github.com/yanyiwu/gojieba v1.4.7` dependency.
.gitignore	Ignore additional local files (e.g. `CLAUDE.md`, logs, scratch SQL).

+func resolveJiebaDictDir() string {
+	if d := os.Getenv("MO_JIEBA_DICT_DIR"); d != "" {
+		if _, err := os.Stat(filepath.Join(d, "jieba.dict.utf8")); err == nil {
+			return d
+		}
+	}
+	_, self, _, ok := runtime.Caller(0)
+	if ok {
+		d := filepath.Join(filepath.Dir(self), "dict")
+		if _, err := os.Stat(filepath.Join(d, "jieba.dict.utf8")); err == nil {
+			return d
+		}
+	}
+	return ""
+}


+		list = append(list, &Pattern{Text: word, Operator: TEXT, Position: t.BytePos})
+	}
+	if len(list) == 0 {
+		return nil, moerr.NewInternalErrorNoCtx("Invalid input search string.  search string onverted to empty pattern")


aunjgr

Revised Review: gojieba Chinese Word Breaking (deeper analysis)

I'm updating my previous review after deeper analysis. Some concerns are resolved, one remains blocking, and I have new findings.

Resolved from prior review

Thread safety: CONFIRMED SAFE. gojieba's C++ Cut methods are const — all mutable state is stack-local. The author confirmed thread safety in cppjieba issues #76/#82. SharedJiebaTokenizer concurrent access is safe as long as AddWord/RemoveWord are never called (they aren't in this PR).

Blocking Issues

1. resolveJiebaDictDir returns empty string on failure → panic at runtime

If neither MO_JIEBA_DICT_DIR is set nor the source tree exists on disk (standard case for compiled binaries on bare-metal/K8s without the env var), the function returns "". This feeds paths like "/jieba.dict.utf8" to gojieba.NewJieba(), which panics (not errors) if files are missing. The panic will happen inside sync.Once, making it unrecoverable.

Fix: return an explicit error from resolveJiebaDictDir and propagate it up through SharedJiebaTokenizer / NewJiebaTokenizer. Or at minimum, check the returned dir is non-empty and return a clear error from Tokenize.

The Dockerfile sets MO_JIEBA_DICT_DIR correctly, but not all deployment paths go through Docker (dev machines, bare-metal prod, systemd units).

2. ~610K lines of dictionary data in git (product decision, flagging for awareness)

I'll downgrade this from "blocking" to "strong advisory" — it's a legitimate tradeoff (no network dependency at build time vs. permanent clone bloat). But the team should explicitly decide. These files will exist in git history forever even if later removed. Consider git-lfs as a middle ground.

Advisory Issues

3. .gitignore adds unrelated entries

CLAUDE.md, mo-service.log, bug.sql are unrelated to gojieba. Should be a separate commit.

4. SimpleTokenizer struct reuse is not concurrency-safe (non-issue in practice)

The new Tokenize(input) mutates receiver state, making a shared *SimpleTokenizer unsafe for concurrent use. In practice every call site creates a fresh instance via NewSimpleTokenizer(), so this is fine. Just documenting for awareness — don't share instances.

5. parsePatternInNLModeJieba typo in error message

Line: "Invalid input search string. search string onverted to empty pattern" — "onverted" should be "converted", double space.

Design Observations (positive)

Index/query tokenization consistency is correct: both sides use SharedJiebaTokenizer(false) (HMM disabled), ensuring the same segmentation at index and query time.
The ASCII/CJK pre-split in JiebaTokenizer.Tokenize is sound: delegates Latin to SimpleTokenizer's existing logic, avoids gojieba's per-character fallback for ASCII.
Boolean mode threading is correct: ParsePatternInBooleanMode parses structure, then GenTextSql → ParsePatternInNLMode(text, "gojieba") re-tokenizes individual terms with jieba.
fulltextIndexMatch now correctly passes tableFunction.Params instead of "", connecting the search side to the parser config.
Token BytePos tracking with chunkOff + w.Start correctly computes global byte offsets across the ASCII/CJK split.

Summary

The architecture is well thought out. The one real blocker is the panic-on-missing-dict path — a production service shouldn't panic inside a sync.Once with no recovery path. Everything else is minor.

fengttt and others added 7 commits May 4, 2026 10:47

Add test result file.

85a6d12

Update test results.

b2d5b8b

fengttt requested review from XuPeng-SH, aunjgr, heni02 and ouyuanning as code owners May 5, 2026 00:01

fengttt temporarily deployed to ci May 5, 2026 00:01 — with GitHub Actions Inactive

fengttt had a problem deploying to ci May 5, 2026 00:01 — with GitHub Actions Failure

fengttt temporarily deployed to ci May 5, 2026 00:01 — with GitHub Actions Inactive

fengttt had a problem deploying to ci May 5, 2026 00:01 — with GitHub Actions Failure

fengttt temporarily deployed to ci May 5, 2026 00:01 — with GitHub Actions Inactive

fengttt had a problem deploying to ci May 5, 2026 00:01 — with GitHub Actions Failure

fengttt temporarily deployed to ci May 5, 2026 00:01 — with GitHub Actions Inactive

matrix-meow added the size/XL Denotes a PR that changes [1000, 1999] lines label May 5, 2026

mergify Bot added the kind/feature label May 5, 2026

fengttt temporarily deployed to ci May 5, 2026 02:26 — with GitHub Actions Inactive

fengttt temporarily deployed to ci May 5, 2026 02:27 — with GitHub Actions Inactive

matrix-meow added size/XXL Denotes a PR that changes 2000+ lines and removed size/XL Denotes a PR that changes [1000, 1999] lines labels May 5, 2026

fengttt requested a review from cpegeric May 5, 2026 14:45

heni02 approved these changes May 6, 2026

View reviewed changes

fengttt requested a review from Copilot May 6, 2026 03:09

Copilot started reviewing on behalf of fengttt May 6, 2026 03:09 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

aunjgr requested changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new tokenizer interface in monlp. Added gojieba for dictionary based chinese word breaking.#24272

Add new tokenizer interface in monlp. Added gojieba for dictionary based chinese word breaking.#24272
fengttt wants to merge 8 commits intomatrixorigin:mainfrom
fengttt:main

fengttt commented May 5, 2026

Uh oh!

qodo-code-review Bot commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

This comment was marked as outdated.

Uh oh!

aunjgr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fengttt commented May 5, 2026

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

Uh oh!

qodo-code-review Bot commented May 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment was marked as outdated.

Uh oh!

aunjgr left a comment

Choose a reason for hiding this comment

Revised Review: gojieba Chinese Word Breaking (deeper analysis)

Resolved from prior review

Blocking Issues

Advisory Issues

Design Observations (positive)

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants