feat(backend): sentence-level translation dedup + merge Redis lookup (DD-008)#7954
feat(backend): sentence-level translation dedup + merge Redis lookup (DD-008)#7954Git-on-my-level wants to merge 13 commits into
Conversation
…(DD-008) Phase 1 (sentence-level dedup): Rewrote translate_units_batch() to split each text into sentences before cache lookup. Identical sentences across different segments/users now share cache hits instead of each triggering a separate API call. Preserves full-sentence NMT quality while eliminating 25-35% of redundant character throughput. Phase 3 (merge-aware Redis lookup): On segment merge (prefix reset), check Redis for the merged text before re-translating. Eliminates 10-15% of redundant translations from STT segment combination. Documentation: Added DESIGN DECISION comment at the batch_buffer call site, updated TranslationCoordinator class docstring with architecture/cost notes, and created ADR-001 document at backend/docs/translation-architecture.md. Estimated savings: $1,500–2,140/mo (25-50% of $4,282/mo Google Translate bill). Files changed: - backend/utils/translation.py: translate_units_batch() sentence-level rewrite - backend/utils/translation_coordinator.py: comments + merge Redis lookup - backend/docs/translation-architecture.md: NEW — ADR-001
- Call on_translation_ready when prefix-reset adopts a Redis-cached translation so Firestore persist and WebSocket push run for cached segments. - Cache assembled full-text translations in translate_units_batch so prefix-reset lookups hit Redis for multi-sentence segments.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f15573d9e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| text_hash = hashlib.md5(original_text.encode()).hexdigest() | ||
| self._set_memory_cache(text_hash, dest_language, assembled, dominant_lang) | ||
| cache_translation(text_hash, dest_language, assembled, dominant_lang) |
There was a problem hiding this comment.
Avoid caching failed translation fallbacks
When a Google Translate chunk raises above, the exception path records each sentence as its original text with an empty detected language, but this unconditional Redis write then stores that fallback as a successful full-text translation. For single-sentence inputs, or any chunk where all translated parts fell back, a transient API failure will poison translate:v1:* for the 14-day TTL and future requests will hit Redis instead of retrying, leaving the text untranslated.
Useful? React with 👍 / 👎.
| await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id) | ||
| continue # Don't add to batch buffer |
There was a problem hiding this comment.
Invalidate pending batches after cached prefix resets
When this cached prefix-reset branch runs while the same segment already has an older translation entry buffered, the old entry keeps matching state.version because this path commits the Redis result and continues without bumping the version or removing buffered entries. In that 250ms batching window, _flush_batch() can later treat the stale buffered text as valid and overwrite/persist the older translation after the cached merged translation was already sent.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR introduces two cost-reduction optimisations to the translation pipeline: sentence-level dedup in
Confidence Score: 3/5Fix 2 can cause an in-flight batch result to overwrite the correct merge-cached translation in the UI; several other gaps flagged in prior review rounds are still unaddressed. The merge-aware Redis path fires on_translation_ready immediately but leaves state.version unchanged. If the GCP call for the previous version of the same segment is still in flight, _flush_batch will match the stale version and fire the callback a second time, replacing the correct translation with the old one in both Firestore and the WebSocket stream. Combined with the three open gaps carried over from the previous review, the coordinator path has multiple unresolved correctness issues. backend/utils/translation_coordinator.py — specifically the Fix 2 block (lines 202–214) where the version counter is not incremented before firing the callback. Important Files Changed
Sequence DiagramsequenceDiagram
participant O as observe()
participant BB as _batch_buffer
participant FB as _flush_batch()
participant GCP as Google Translate API
participant CB as on_translation_ready()
Note over O: observe() call A segment X version V
O->>BB: append(seg_X, text_old, conv, V)
Note over FB: _flush_batch starts timer fires
FB->>GCP: await run_blocking translate_units_batch
Note over O: observe() call B segment X prefix reset Fix 2 Redis hit
O->>CB: await on_translation_ready(seg_X, T_new)
Note over O: state.version still V not bumped
O-->>BB: continue skips _next_version
GCP-->>FB: T_old returned
FB->>FB: "state.version == V matches stale version"
FB->>CB: await on_translation_ready(seg_X, T_old) overwrites correct result
Reviews (1): Last reviewed commit: "fix: address review feedback — cache poi..." | Re-trigger Greptile |
| # Check if the new merged text was already translated (Redis cache) | ||
| text_hash = hashlib.md5(text.encode()).hexdigest() | ||
| redis_cached = get_cached_translation(text_hash, self.target_language) | ||
| if redis_cached: | ||
| # Found in Redis — adopt as committed, skip re-translation | ||
| translated_text = redis_cached['text'] | ||
| detected_lang = redis_cached.get('detected_lang', '') | ||
| state.committed_text = text | ||
| state.assembled_translation = translated_text | ||
| state.detected_lang = detected_lang | ||
| self.metrics['prefix_resets'] += 1 | ||
| await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id) | ||
| continue # Don't add to batch buffer |
There was a problem hiding this comment.
Missing
should_persist_translation guard in merge-aware Redis path
When Fix 2 finds a Redis hit during a prefix reset, it calls on_translation_ready directly without checking should_persist_translation. The normal batch path (line 358) applies this check before firing the callback; skipping it here means a no-op "translation" (e.g. English text being "translated" to English) stored in the Redis positive cache can trigger on_translation_ready and create a spurious translation badge in the UI — exactly the case should_persist_translation was built to prevent.
The scenario is concrete: the batch path calls cache_translation in Phase 3 of translate_units_batch before the should_persist_translation check runs in _flush_batch. If that check then returns False it sets the negative cache but leaves the positive cache entry intact. A later merge on the same text hits the positive cache in Fix 2 and fires the callback unconditionally.
| # Fallback: should not happen, but use original text | ||
| translated_parts.append(sent_text) | ||
|
|
||
| assembled = ' '.join(translated_parts) |
There was a problem hiding this comment.
ASCII space joining breaks CJK target-language output
' '.join(translated_parts) inserts a Latin space between sentences even when the target language is Chinese, Japanese, or Korean. Those languages don't delimit sentences with spaces, so the assembled string will contain spurious U+0020 characters (e.g. 你好。 你好吗? instead of 你好。你好吗?). This affects every CJK target-language user and degrades the quality of the persisted Firestore translation.
| assembled = ' '.join(translated_parts) | |
| _CJK_LANGS = {'zh', 'ja', 'ko', 'zh-cn', 'zh-tw', 'zh-hk'} | |
| sep = '' if dest_language.lower().split('-')[0] in _CJK_LANGS else ' ' | |
| assembled = sep.join(translated_parts) |
| # Check if the new merged text was already translated (Redis cache) | ||
| text_hash = hashlib.md5(text.encode()).hexdigest() | ||
| redis_cached = get_cached_translation(text_hash, self.target_language) | ||
| if redis_cached: | ||
| # Found in Redis — adopt as committed, skip re-translation | ||
| translated_text = redis_cached['text'] | ||
| detected_lang = redis_cached.get('detected_lang', '') | ||
| state.committed_text = text | ||
| state.assembled_translation = translated_text | ||
| state.detected_lang = detected_lang | ||
| self.metrics['prefix_resets'] += 1 | ||
| await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id) |
There was a problem hiding this comment.
language_state not updated on Redis cache hit path
The normal batch path updates self.language_state.observe(original_text, speaker_id=None) with the API-detected language (line 373) so the monolingual gate can adapt. When Fix 2 short-circuits via the Redis cache, this observe() call is skipped entirely. Over time the language model can drift if the Redis path is the primary path for high-frequency segments, weakening the monolingual gate's ability to detect language switches.
| # Phase 3: Reassemble per-unit results from sentence translations | ||
| results = [] | ||
| for unit_idx, (unit_id, original_text, sentences) in enumerate(unit_sentences): | ||
| if not sentences: | ||
| results.append((unit_id, original_text, '')) | ||
| continue | ||
|
|
||
| translated_parts = [] | ||
| detected_langs = [] | ||
| for sent_text, sent_hash in sentences: | ||
| if sent_hash in sent_translation: | ||
| trans_text, det_lang = sent_translation[sent_hash] | ||
| translated_parts.append(trans_text) | ||
| if det_lang: | ||
| detected_langs.append(det_lang) | ||
| else: | ||
| # Fallback: should not happen, but use original text | ||
| translated_parts.append(sent_text) | ||
|
|
||
| assembled = ' '.join(translated_parts) | ||
| # Dominant detected language from constituent sentences | ||
| dominant_lang = '' | ||
| if detected_langs: | ||
| lang_counts = Counter(detected_langs) | ||
| dominant_lang = lang_counts.most_common(1)[0][0] | ||
|
|
||
| text_hash = hashlib.md5(original_text.encode()).hexdigest() | ||
| self._set_memory_cache(text_hash, dest_language, assembled, dominant_lang) | ||
| cache_translation(text_hash, dest_language, assembled, dominant_lang) | ||
|
|
||
| results.append((unit_id, assembled, dominant_lang)) |
There was a problem hiding this comment.
Phase 3 unconditionally writes full-text positive cache even for same-language text
cache_translation(text_hash, ...) is always called in Phase 3 regardless of whether the assembled translation actually differs from the original. If all constituent sentences were detected as same-language (returning original text), the full-text cache entry will hold the original text — an entry that downstream callers (including the merge-aware Fix 2 lookup) will treat as a valid translation. The _flush_batch check in the coordinator catches this for the normal flow, but the Fix 2 shortcut path bypasses it, so a "translation" equalling the source text can reach on_translation_ready.
…tion
P1: Skip Redis cache write when any sentence fell back to original text
due to API failure. Previously the exception path stored untranslated
fallback text into Redis, poisoning the 14-day TTL cache so future
requests would return raw text instead of retrying.
P2: Filter pending batch buffer entries for a segment when adopting a
Redis-cached translation on prefix reset. Without this, a stale
buffered entry could later overwrite the cached result in _flush_batch().
Git-on-my-level
left a comment
There was a problem hiding this comment.
Review feedback addressed in e13a77b
P1 — Cache poisoning on API failure:
- Added
_failed_sent_hashesset tracking sentences that fell back to original text - Phase 3 now skips Redis
cache_translation()when any constituent sentence is a fallback - Transient API failures no longer poison the 14-day TTL cache with untranslated text
P2 — Stale batch buffer overwrite after prefix reset:
- When adopting a Redis-cached translation on prefix reset, the code now filters out pending batch buffer entries for that segment ID
- Prevents
_flush_batch()from later overwriting with stale buffered text
| text_hash = hashlib.md5(text.encode()).hexdigest() | ||
| redis_cached = get_cached_translation(text_hash, self.target_language) | ||
| if redis_cached: | ||
| # Found in Redis — adopt as committed, skip re-translation | ||
| translated_text = redis_cached['text'] | ||
| detected_lang = redis_cached.get('detected_lang', '') | ||
| state.committed_text = text | ||
| state.assembled_translation = translated_text | ||
| state.detected_lang = detected_lang | ||
| self.metrics['prefix_resets'] += 1 | ||
| # Invalidate any pending batch entries for this segment so | ||
| # _flush_batch() does not overwrite with stale buffered text. |
There was a problem hiding this comment.
Stale in-flight batch job fires after Fix 2 callback
Fix 2 fires on_translation_ready and then continues, bypassing the _next_version() call on line 272. If a batch job for this segment was already queued (e.g., submitted 200 ms earlier and still awaiting the GCP API response), _flush_batch will see state.version == version still matching, and fire on_translation_ready a second time with the old (pre-merge) translation — overwriting the correct Fix 2 result in the UI.
The old code avoids this because any prefix reset that reaches _batch_buffer.append always passes through _next_version(), invalidating the prior in-flight version. Fix 2 needs to increment the version before firing the callback to close this race window. Adding state.version = self._next_version() before the await self.on_translation_ready(...) call is sufficient.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e13a77b323
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self._batch_buffer = [ | ||
| entry for entry in self._batch_buffer if entry[0] != segment.id | ||
| ] |
There was a problem hiding this comment.
Bump the version when adopting cached merged text
Fresh evidence: the current patch only removes entries that are still in _batch_buffer, but _flush_batch() copies and clears that buffer before awaiting translate_units_batch(). If this Redis-hit path runs while that API call is already in flight, state.version stays at the old value, so the stale batch result still passes the version check and can overwrite/persist an older translation after this cached merged translation was sent.
Useful? React with 👍 / 👎.
| self._batch_buffer = [ | ||
| entry for entry in self._batch_buffer if entry[0] != segment.id | ||
| ] | ||
| await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id) |
There was a problem hiding this comment.
Reapply the no-op filter before cached callbacks
When the cached entry is an unchanged translation, this shortcut bypasses the should_persist_translation() check used in _flush_batch() and immediately persists/sends a translation identical to the source. That cache state is possible because translate_units_batch() writes Redis before _flush_batch() decides a result is a no-op, so prefix resets for already-target-language text can reintroduce translation badges that the coordinator normally suppresses.
Useful? React with 👍 / 👎.
| self.metrics['prefix_resets'] += 1 | ||
| # Check if the new merged text was already translated (Redis cache) | ||
| text_hash = hashlib.md5(text.encode()).hexdigest() | ||
| redis_cached = get_cached_translation(text_hash, self.target_language) |
There was a problem hiding this comment.
Offload the Redis lookup from observe()
On a prefix reset this calls the synchronous redis.Redis.get path from inside async observe(). The backend rule in AGENTS.md says Redis CRUD in async def must be offloaded with run_blocking(..., db_executor, ...); otherwise a slow Redis read stalls the WebSocket event loop for other sessions on the same worker.
Useful? React with 👍 / 👎.
kodjima33
left a comment
There was a problem hiding this comment.
Backend feature (sentence-level translation dedup) — approve only; note: a required check is failing, do not merge until green.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd21d935e8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # _flush_batch() does not overwrite with stale buffered text. | ||
| self._batch_buffer = [entry for entry in self._batch_buffer if entry[0] != segment.id] | ||
| await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id) | ||
| continue # Don't add to batch buffer |
There was a problem hiding this comment.
Refresh segment timing before continuing
When this cached-prefix path is taken, it sets committed_text and then continues before refreshing state.latest_text and state.last_update_at. If the next STT update appends to that same segment, _compute_stability_signals() uses the timestamp from before the cached reset, so the new tail can satisfy the 3s soft-boundary immediately and be translated as stable even though it just arrived; update the state text/timestamp before this early continue.
Useful? React with 👍 / 👎.
kodjima33
left a comment
There was a problem hiding this comment.
Backend translation dedup feature — approve only (re-approve on new sha)
… bump + timing refresh Addresses remaining P1/P2 bot review comments: - Add should_persist_translation guard to prevent no-op translations from triggering spurious UI badges (P1) - Bump state.version after adopting cached translation so in-flight batch jobs are rejected as stale (P1) - Refresh latest_text/last_update_at timing to prevent premature stability detection (P2)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cfaa910f26
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| dominant_lang = lang_counts.most_common(1)[0][0] | ||
|
|
||
| text_hash = hashlib.md5(original_text.encode()).hexdigest() | ||
| self._set_memory_cache(text_hash, dest_language, assembled, dominant_lang) |
There was a problem hiding this comment.
Avoid caching failed single-sentence fallbacks in memory
When a Google Translate chunk fails, the exception path stores the original sentence in sent_translation and marks it failed; for a single-sentence unit, this full-text hash is the same as the failed sentence hash, so this line still writes the fallback into the in-process LRU even though the Redis write is skipped. Subsequent occurrences in the same WebSocket/session then hit _check_memory_cache(sent_hash, ...) and return the original text without retrying Google until eviction, so a transient API failure can keep live translations untranslated.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
3 issues found across 3 files
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
P1 fixes: - Bump version + clear batch buffer in no-op translation path so in-flight batch jobs cannot overwrite cached results - Gate memory cache write on _failed_sent_hashes same as Redis cache to prevent poisoning LRU with untranslated fallback text P2 fix: - Offload Redis get_cached_translation to run_blocking in async observe() to avoid blocking the event loop
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1b8ebf0efd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # unit_sentences[i] = list of (sentence_text, sentence_hash) for unit i | ||
| unit_sentences = [] | ||
| for unit_id, text in units: | ||
| sentences = split_into_sentences(text) |
There was a problem hiding this comment.
Check full-text caches before splitting
For multi-sentence segments, this now splits the text before any cache lookup, so the old full-text LRU/Redis/negative keys are never consulted. That means existing 14-day Redis entries produced by the previous batch path or by translate_text() for texts like Hola. Gracias. will be bypassed and sent to Google again until sentence-level keys happen to be populated, undercutting the cost-saving goal on deploy; add the full-text cache/negative-cache check before falling back to sentence-level dedup.
Useful? React with 👍 / 👎.
| # unit_sentences[i] = list of (sentence_text, sentence_hash) for unit i | ||
| unit_sentences = [] | ||
| for unit_id, text in units: | ||
| sentences = split_into_sentences(text) |
There was a problem hiding this comment.
Preserve abbreviations when splitting sentences
With this new batch path, any transcript containing common abbreviations or decimals is sent to Google as broken fragments because split_into_sentences() splits on every period: for example I live in the U.S. now. becomes I live in the U., S., now., and Version 3.11 is installed. becomes Version 3., 11 is installed.. The previous full-text batch translation preserved that context, so these inputs can now persist noticeably worse translations; use safer sentence-boundary rules or avoid sentence splitting for these cases.
Useful? React with 👍 / 👎.
…edge-case tests Address 2 new Codex bot review comments on PR BasedHardware#7954: P2 - Abbreviation splitting (bot-reported): - split_into_sentences() now uses placeholder protection for internal periods in multi-part abbreviations (U.S., U.K., 3.14, e.g., i.e., etc., vs.) - Post-split merge heuristic prevents false sentence boundaries at abbreviation tails (e.g., 'I live in the U.S. now.' stays as 1 sentence) - Handles the specific bot examples: 'U.S.' no longer splits into ['U.', 'S.'] and '3.11' no longer splits into ['3.', '11'] P2 - Full-text cache bypassed: - translate_units_batch() now checks full-text Redis/LRU caches (Phase -1) BEFORE falling through to sentence-level dedup (Phase 0) - Prevents bypassing existing full-text cache entries from pre-DD-008 path Test coverage (18 new tests in test_translation_dedup_edge_cases.py): - 12 abbreviation splitting tests (U.S., 3.11, Dr., e.g., etc., vs., CJK, decimals) - 6 structural code path tests verifying all bot fixes are present: * No-op path bumps version + clears batch buffer * Failed-fallback memory cache gated on _failed_sent_hashes * Async observe() offloads Redis to run_blocking * Full-text cache phase precedes sentence splitting phase * Abbreviation protection patterns exist * Merge heuristic function exists
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 321d96869b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """Verify that bot-identified code paths exist and are correctly structured.""" | ||
|
|
||
| def setUp(self): | ||
| self.pr_root = "/tmp/omi-pr" |
There was a problem hiding this comment.
Derive test source paths from the checkout
When backend/test.sh runs in a normal checkout, this hard-coded /tmp/omi-pr path does not exist, so _read_source() returns None and the structural assertions in this newly added test file fail before exercising the translation code. Since this test is now added to test.sh, CI is blocked unless it happens to use that exact temporary path; derive the repo root from __file__ instead.
Useful? React with 👍 / 👎.
| for unit_id, text in units: | ||
| sentences = split_into_sentences(text) | ||
| hashed = [(s, hashlib.md5(s.encode()).hexdigest()) for s in sentences] | ||
| unit_sentences.append((unit_id, text, hashed)) |
There was a problem hiding this comment.
Preserve partial full-text cache hits
When a mixed batch contains one unit with a full-text LRU/Redis hit and another uncached unit, the cached value is recorded in full_text_results but then discarded because only the all-hit case returns; this loop still splits and reprocesses every unit. In that scenario the cached segment is sent through sentence caches/API again, and if the API path fails it can return the source text despite already having a known cached translation.
Useful? React with 👍 / 👎.
| # Prev ends with a short abbreviation-like token (≤ 6 char body before punctuation) | ||
| prev_body = prev.rstrip('.!?。!؟ ') | ||
| if len(prev_body) <= 6: | ||
| return True |
There was a problem hiding this comment.
Restrict short-fragment merges to abbreviations
For common short sentences followed by another sentence, such as Sí. Gracias. or OK. Thanks., this length-only rule merges the two real sentences into one because the previous body is ≤6 characters. Since translate_units_batch() hashes after this split, those high-frequency short utterances stop getting sentence-level cache reuse and are translated as larger combined fragments, undermining the cost-saving path this PR adds.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
3 issues found across 3 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e3bb7b69d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| redis_cached = await run_blocking( | ||
| sync_executor, get_cached_translation, text_hash, self.target_language |
There was a problem hiding this comment.
Use db_executor for Redis prefix lookups
In the prefix-reset path this offloads a Redis get_cached_translation call onto sync_executor. The backend AGENTS guidance assigns Firestore/Redis CRUD to db_executor, while sync_executor is reserved for sync/STT pipeline work; under many prefix resets or slow Redis, these cache reads can occupy the same 16 workers used by translation/audio pipeline tasks and queue that work behind Redis I/O. Route this lookup through db_executor instead.
Useful? React with 👍 / 👎.
| # translate_text() calls that wrote full-text entries. | ||
| full_text_results = {} | ||
| for unit_id, text in units: | ||
| text_hash = hashlib.md5(text.encode()).hexdigest() |
There was a problem hiding this comment.
Honor full-text negative cache entries
When a full segment was previously skipped or filtered as a no-op, the coordinator stores set_negative_cache(md5(full_text), lang), but this new full-text phase checks only positive LRU/Redis entries before splitting into sentence hashes. For a multi-sentence target-language segment that already has a full-text negative key, none of the sentence hashes match that key, so the previous full-text negative-cache behavior is bypassed and the same text can be sent to Google again.
Useful? React with 👍 / 👎.
kodjima33
left a comment
There was a problem hiding this comment.
Backend feature (sentence-level translation dedup + Redis merge) — approve only per policy (feature; Nik owns backend deploy). Re-approved on new sha.
P1 — test file: - Derive repo root from __file__ instead of hardcoded /tmp/omi-pr (CI runs in different checkout paths) P2 — translation.py: - Honor negative cache entries in Phase -1 full-text lookup (prevents re-translating known no-op texts) - Restrict short-fragment merge heuristic to abbreviation patterns only (prevents merging real short sentences like 'Sí.' + 'Gracias.') - Preserve partial full-text cache hits in mixed batch scenarios (cached units skip sentence splitting, emitted directly in Phase 3) - Skip sentence splitting for units with full-text cache hits P2 — translation_coordinator.py: - Use db_executor instead of sync_executor for Redis prefix lookups (per backend AGENTS guidance: Redis CRUD → db_executor)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1400409854
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not is_abbrev_like: | ||
| return False | ||
|
|
||
| return False |
There was a problem hiding this comment.
Return true for recognized abbreviation fragments
For abbreviation fragments followed by capitalized text, such as Dr. Smith arrived. or Mr. Jones left., this branch computes is_abbrev_like but then falls through to the unconditional return False, so split_into_sentences() still emits Dr. and Smith arrived. as separate translation units. That regresses the new sentence-level path by sending titles/names to Google without their context and can persist visibly worse translations.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
2 issues found across 3 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
P1 #1: _should_merge returned False for ALL abbreviations — missing return True when is_abbrev_like check passed. Also fixed operator precedence bug in decimal/version check (and binds before or). P1 #2: translate_units_batch emitted full-text cache hits before sentence-translated units, breaking input-order contract. Pre-allocate results array and fill by position. P1 #3+5: Cache write guard only caught exception-path failures but not quiet fallbacks (sent_hash missing from sent_translation dict). Track fallback hashes separately and block caching on any failure path. P1 #4: Prefix-reset Redis paths bumped version and cleaned buffer but did NOT cancel in-flight batch tasks. Added explicit task cancellation to both no-op and persist cache-hit paths.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7658df2623
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # so downstream consumers can rely on positional mapping. | ||
| results: list[tuple[str, str, str] | None] = [None] * len(units) # pre-allocate for in-order assembly | ||
|
|
||
| for unit_idx, (unit_id, original_text, sentences) in enumerate(unit_sentences): |
There was a problem hiding this comment.
Preserve original indexes for compacted sentence units
Fresh evidence: in this revision full-text cache hits are skipped when building unit_sentences, so enumerate(unit_sentences) is a compacted index rather than the original units index. When an earlier batched segment hits the full-text cache and a later segment misses, this writes the later segment's result into the earlier slot; the fill-in loop then sees that slot as occupied and never returns the cached translation, while the later segment can also get a duplicate fallback. In mixed cached/uncached batches, the coordinator can therefore omit translations that were already available from cache.
Useful? React with 👍 / 👎.
P2 #1: Return original text when any sentence falls back or fails — prevents caller from persisting partial/mixed translations that block retry on transient API failures. P2 #2: Extract last token for abbreviation detection in _should_merge() — embedded titles like 'I spoke to Dr.' have long prev_body but the trailing token is still a title abbrev. Check last token instead of full prev_body length.
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b8a0beeb3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| is_abbrev_like = ( | ||
| _last_token.isupper() | ||
| and len(_last_token) <= 3 # U.S., UK, Dr | ||
| or _last_token in ('Dr', 'Mr', 'Mrs', 'Ms', 'St', 'Prof') # Title abbrevs | ||
| or _last_token in ('etc', 'vs') # Latin abbrevs | ||
| or (_last_token[0].isupper() and len(_last_token) > 1 and _last_token[1:].isdigit()) # v2, V3 | ||
| ) | ||
| if is_abbrev_like: | ||
| return True |
There was a problem hiding this comment.
Avoid merging across real sentence boundaries
For sentences that legitimately end with a protected abbreviation/acronym, this unconditional abbreviation merge collapses the next capitalized sentence into the same cache unit; for example split_into_sentences("She is from the U.K. She likes tea.") returns one entry, even though the new test file expects two. That makes sentence-level cache keys larger combined chunks, so common following sentences cannot dedupe and are sent to Google again.
Useful? React with 👍 / 👎.
|
Addressed the latest sentence-splitting review feedback and generalized the tests so this class of issue is less one-off:
Validation run locally: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d321460b64
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Latin abbreviations: e.g. → eⓛⓐⓣg. | ||
| (re.compile(r'\b([a-z])\.([a-z])\.'), lambda m: m.group(1) + _LAT + m.group(2) + '.'), | ||
| # etc. → etcⓛⓐⓣ | ||
| (re.compile(r'\betc\.'), 'etc' + _LAT), |
There was a problem hiding this comment.
Preserve boundaries after sentence-final etc.
When a transcript uses etc. at a real sentence boundary, e.g. We bought apples, oranges, etc. Then we left., this replacement removes the only sentence-ending period before SENTENCE_FINDALL_RE runs, so split_into_sentences() returns one combined unit instead of two. Because the new dedup/cache keys are sentence-based, the following sentence is then translated and cached only as part of a larger combined chunk, reducing the cache reuse this change is meant to add; protect only the internal abbreviation text or restore a split point when the next token starts a new sentence.
Useful? React with 👍 / 👎.
|
Addressed the newest Codex review comment about Root causeThis one was not in became a single pre-split unit and never reached the merge heuristic as two candidate sentences. Why we kept missing edge casesThe pattern is that the splitter currently has two independent boundary decisions:
The last fix generalized only layer 2 ( The deeper test gap was that examples were still organized around individual bot-reported strings, not around invariants for each layer:
Change
Validation: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8555e2e77
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Next segment is a lone lowercase word/fragment — not a real sentence | ||
| nxt_body = nxt.rstrip('.!?。!؟؟۔।॥ ') | ||
| if len(nxt_body) <= 15 and nxt and nxt[0].islower(): | ||
| return True |
There was a problem hiding this comment.
Restrict lowercase merges to real abbreviation continuations
When STT emits lower-case sentences, this unconditional branch merges real short sentences whenever the next sentence starts lower-case and is ≤15 chars; for example split_into_sentences("hola. gracias.") and "I agree. thanks." each become one combined unit. Since this PR's cost reduction depends on per-sentence hashes, common short utterances stop deduping and get cached/translated only as larger combined chunks; the fresh evidence is that this lowercase-continuation branch is independent of the abbreviation checks below.
Useful? React with 👍 / 👎.
|
Addressed the newest Codex comment about lowercase STT sentences being over-merged. My takeThis comment is valid, and it exposes the same deeper problem from another angle: we were treating Those should remain separate cache units. Lowercase is only useful as a continuation signal when the previous token is a known abbreviation class ( Change
Validation: Bigger-picture thoughtI think these are still real edge cases, not just style/opinion. But we are also reaching the limit of patching a regex + heuristic splitter one bot comment at a time. The more general approach should be to define the splitter contract as invariants and test a generated matrix across abbreviation classes:
Longer term, we should consider replacing this bespoke splitter with a library-backed sentence boundary detector or moving sentence-boundary detection closer to transcript metadata when available. The current heuristic is workable, but only if we keep the rule surface deliberately small and encode the trade-offs explicitly in tests. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 109d95dfdd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self.metrics['prefix_resets'] += 1 | ||
| # Check if the new merged text was already translated (Redis cache) | ||
| text_hash = hashlib.md5(text.encode()).hexdigest() | ||
| redis_cached = await run_blocking(db_executor, get_cached_translation, text_hash, self.target_language) |
There was a problem hiding this comment.
Invalidate stale batches before Redis lookup
When a prefix reset arrives while an older translation batch is queued or already in _flush_batch(), this awaits the Redis lookup before bumping state.version or removing buffered entries for the segment. If Redis is slow enough for the old batch to finish during that await, the stale result still matches the previous version and can call on_translation_ready for text that no longer matches the segment; move the version bump/buffer invalidation before the lookup so the in-flight batch is rejected immediately.
Useful? React with 👍 / 👎.
|
|
||
| _ABBREV_PATTERNS = [ | ||
| # Country codes: U.S. → UⓐⓑⓒS. (internal period protected, trailing period kept) | ||
| (re.compile(r'\b([A-Z])\.([A-Z])\.'), lambda m: m.group(1) + _ABBR + m.group(2) + '.'), |
There was a problem hiding this comment.
Protect multi-part acronyms in sentence splitting
This only protects two-letter initialisms, so common three-part acronyms are still split into fragments before translation; for example split_into_sentences("I live in the U.S.A. now.") returns ['I live in the U.S.', 'A. now.']. Since translate_units_batch() now translates and caches these sentence fragments independently, transcripts containing acronyms like U.S.A. or F.B.I. can get lower-quality translations and poor cache keys; protect repeated initialism groups rather than a single A.B. pair.
Useful? React with 👍 / 👎.
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
|
Switched sentence splitting to use What changed
Why this shapeThis avoids the silent-failure footgun: if Local validationFormatting: Prod/Docker validationI also rebuilt the backend Dockerfile from scratch to verify the production Runtime smoke inside the built image: Note: same as earlier Docker smoke, this local machine cannot pull the private GCR base anonymously, so I used the public Python 3.11 slim stand-in tagged as |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6c13aa101d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return TranslationNeed.DEFER | ||
|
|
||
|
|
||
| _sentence_segmenter = pysbd.Segmenter(language="en", clean=False) |
There was a problem hiding this comment.
Preserve non-English sentence enders
When transcripts use sentence enders that the repo already supports outside English/CJK, such as Arabic ؟/۔ or Hindi ।/॥ from models/transcript_segment.SENTENCE_ENDERS, this singleton language="en" segmenter does not use PySBD's Arabic/Hindi punctuation rules, so examples like كيف حالك؟ أنا بخير. or the existing Hindi danda unit-test input remain one sentence. That regresses the multilingual split cases in backend/tests/unit/test_translation_optimization.py and prevents the new sentence-level cache reuse for those languages; choose the segmenter from the detected/source language or keep a fallback for the full repo sentence-ender set.
Useful? React with 👍 / 👎.
|
Addressed the latest multilingual sentence-ender review without reintroducing a custom splitter. Approach
Why this avoids reinventing the wheelThe new code only chooses a PySBD language ( Coverage addedAdded regressions for:
ValidationFormatting: Docker/runtime smoke after rebuild: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29a4cc3071
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id) | ||
| continue # Don't add to batch buffer |
There was a problem hiding this comment.
Update language state on cached prefix hits
When a prefix-reset Redis hit is a foreign-language segment while the conversation gate is already monolingual, this shortcut sends the cached translation and returns without the language-state update that _flush_batch() performs for translated results. The next update for the same segment only observes the newly appended tail; if that tail is short or inconclusive, ConversationLanguageState.observe() preserves monolingual=True, so the continuation can be committed as a skip instead of being translated.
Useful? React with 👍 / 👎.
|
Addressed the latest cached prefix-hit language-state review. Root causeThe prefix-reset Redis hit fast path updated the segment state and emitted FixOn meaningful cached prefix hits with a if detected_lang:
self.language_state.observe(text, speaker_id=segment.speaker_id)
await self.on_translation_ready(...)This keeps the shortcut behavior aligned with Regression coverageAdded RED check before fix: GREEN validation: Formatting: Async blocker scan was also run from repo root. It reports existing unrelated findings in other backend files; this change adds no new async blocking calls. |
|
Also addressed the prefix-reset race where a slow Redis lookup could let an older buffered batch flush before the reset invalidated it. Root causeThe prefix-reset path awaited the full-text Redis lookup before bumping FixAs soon as a prefix reset is detected, before any awaited Redis lookup:
Then the Redis fast path can still reuse a cached translation, but stale buffered/in-flight work is already invalidated. Regression coverageAdded RED check before fix: GREEN validation after both coordinator fixes: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 48f209395c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _sentence_segmenters = { | ||
| language: pysbd.Segmenter(language=language, clean=False) for language in _SUPPORTED_PYSBD_LANGUAGES | ||
| } |
There was a problem hiding this comment.
Avoid sharing PySBD segmenters across translation threads
_flush_batch() runs translate_units_batch() on sync_executor, so concurrent WebSocket sessions can enter split_into_sentences() at the same time and reuse the same global segmenter for a language. PySBD's Segmenter.segment() mutates instance state (self.original_text) during each call (upstream source), so one translation can overwrite another call's text before spans are computed, causing wrong sentence boundaries or sentences from another transcript to be sent to Google/cache; construct a fresh segmenter per call or serialize access per segmenter.
Useful? React with 👍 / 👎.
| if detected_lang: | ||
| self.language_state.observe(text, speaker_id=segment.speaker_id) |
There was a problem hiding this comment.
Use cached detected language to break monolingual gate
Fresh evidence: this added cached-hit path calls ConversationLanguageState.observe() with the text, but that re-runs local langdetect instead of applying redis_cached['detected_lang']. When the Redis hit is short foreign text such as Sí. while monolingual=True, local detection is inconclusive and leaves the gate enabled, so the next short appended tail can be committed as a monolingual skip; update the gate from the cached detected_lang (or explicitly clear monolingual when its base differs from the target) before returning.
Useful? React with 👍 / 👎.
|
Addressed the two latest Codex review comments. 1) Avoid sharing PySBD segmenters across translation threadsCodex was right: Added/updated structural coverage to assert we do not keep 2) Use cached detected language to break monolingual gateCodex was also right that re-running local langdetect on a cached short foreign hit can be inconclusive. I added Both paths now use that trusted detected language:
This avoids leaving RED checks observedBefore implementation: ValidationFormatting: Async blocker scan was re-run from repo root. It reports the same existing unrelated backend findings; this change adds no sync Redis/Firestore calls in async paths. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 36711373c0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for unit_id, text in units: | ||
| if unit_id in full_text_results: | ||
| continue # Already have a full-text cache hit — no need to split | ||
| sentences = split_into_sentences(text) |
There was a problem hiding this comment.
Route Latin text through the detected source language
When translating a Spanish/French/etc. segment, this calls split_into_sentences(text) without a source language, and _pysbd_language_for_text() falls back to English for Latin-script text. That ignores language-specific PySBD abbreviation rules; for example a Spanish transcript like El Sr. García llegó. Después habló. can be split after Sr., so the new sentence-level cache sends and stores fragments that the previous full-text batch path translated with context. Please pass the detected source language into the splitter, or avoid sentence splitting when it is unknown.
Useful? React with 👍 / 👎.
|
Addressed the latest Codex comment about Latin-script text being routed through English PySBD rules. Root cause
split_into_sentences(text)without a source language. For Latin-script foreign text, FixAdded Both sentence-splitting callers now pass the detected source language into PySBD routing:
Regression coverageAdded coverage for a Spanish Latin-script abbreviation case: The test asserts that the API batch contents are split with Spanish PySBD rules: ["La Sra. García llegó.", "Después habló."]ValidationFormatting: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9b9d0f1d9c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| self._batch_task.cancel() | ||
| self._batch_task = None | ||
| self.metrics['prefix_resets'] += 1 | ||
| continue # Don't add to batch buffer |
There was a problem hiding this comment.
Apply cached language before no-op prefix exits
Fresh evidence: this no-op cached-hit branch exits before the update_from_detected_language() call that was added in the normal Redis-hit branch. When the sentence cache contains an unchanged but foreign translation (e.g. Spanish No. cached as {"text":"No.","detected_lang":"es"} from a longer segment), a prefix reset to that text while the conversation is monolingual commits it and leaves the gate in monolingual mode; the next short appended tail can then be skipped as target-language instead of translated.
Useful? React with 👍 / 👎.
|
Follow-up after an external review pass on the latest sentence-splitting changes. Issue addressedThe previous fix routed Latin-script text through the detected PySBD source language, but PySBD's Latin abbreviation coverage still varies by title/language. For example, Spanish FixAdded a conservative translation-unit helper:
This preserves the low-maintenance approach: no custom splitter and no fallback sentence-boundary engine. The safety valve only declines sentence-level splitting for risky Latin abbreviation output. Regression coverageAdded/updated coverage for: Expected API contents now stay whole: ["El Sr. García llegó. Después habló."]ValidationFormatting: |
|
Addressed the latest Codex comment about the no-op cached prefix-hit branch. Root causeThe normal cached Redis hit path now updates FixThe no-op cached prefix-hit branch now also calls: self.language_state.update_from_detected_language(detected_lang, speaker_id=segment.speaker_id)before exiting. Regression coverageAdded ValidationFormatting: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8cade17379
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if _any_failure: | ||
| results[orig_idx] = (unit_id, original_text, '') |
There was a problem hiding this comment.
Don’t commit failed chunks as no-op translations
When any uncached sentence chunk hits the exception path, _any_failure makes this return the original text with an empty detected language. _flush_batch() then treats that as a no-op translation, sets the negative cache, and advances committed_text, so a transient Google error on one sentence prevents the segment from being retried and can leave it untranslated for the negative-cache TTL.
Useful? React with 👍 / 👎.
| if normalized in _LATIN_COMPATIBLE_PYSBD_LANGUAGES and split_base != target_base and normalized != "en": | ||
| if any(re.search(r"\b[A-Za-zÀ-ÖØ-öø-ÿ]{1,4}\.$", sentence.strip()) for sentence in sentences): | ||
| stripped = text.strip() | ||
| return [stripped] if stripped else [] |
There was a problem hiding this comment.
Restrict the Latin safety valve to abbreviations
Fresh evidence: this helper now recombines correctly split PySBD sentences whenever any non-English Latin sentence ends in a 1–4 letter word plus a period, so common utterances like Sí. Gracias. or French/Spanish short replies are returned as one full-text unit. That bypasses the new per-sentence hashes and cache reuse for high-frequency short sentences, undercutting the cost-saving path this PR is adding; only treat known abbreviation fragments like Sr. as unsafe.
Useful? React with 👍 / 👎.
|
Addressed the two latest Codex comments. 1) Narrow the Latin safety valveCodex was right that the previous safety valve was too broad: it treated any non-English Latin sentence ending in a 1–4 letter word plus Fix:
Added regression coverage that keeps 2) Keep failed sentence chunks retryableCodex was right that returning Fix:
Added regression coverage that a retryable batch failure does not commit text, invoke callback, or set negative cache. ValidationFormatting: |
|
@codex review |
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
P1 #1 (thread safety): Clarify split_into_sentences is pure regex (no PySBD) — fully thread-safe, no shared mutable state. P1 #2 (non-English enders): Add Arabic/Hindi enders to _should_merge's punctuation check set (؟۔।॥). P2s — _should_merge rewrite: - Restrict lowercase merge to single-token fragments only (not multi-word sentences like 'Sí. Gracias.' or 'I agree. thanks.') - Don't merge across real sentence boundaries after abbreviations when next segment starts capitalized + has multiple words - Preserve boundaries after etc./vs. at real sentence boundaries - Extend acronym protection to 3+ part initialisms (U.S.A., F.B.I.) - Increase token length limit to 8 for multi-part acronyms - Add 'Sr' to title abbreviation list P2s — translate_units_batch: - Failed chunks no longer committed as no-op translations that set negative cache and advance committed_text (blocking retry). Empty detected_lang now signals failure vs genuine no-op. P2s — translation_coordinator: - Version bump + batch invalidation moved BEFORE Redis lookup so in-flight batches are rejected immediately (not after slow Redis) - Cached detected language applied to ConversationLanguageState on prefix-reset cache hit (breaks monolingual gate for foreign text) - Non-no-op cached path: removed redundant version bump/cancel
8e20617 to
a1add57
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a1add5797a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| _ABBREV_PATTERNS = [ | ||
| # Country codes: U.S. → UⓐⓑⓒS. (internal period protected, trailing period kept) | ||
| (re.compile(r'\b([A-Z])\.([A-Z])\.'), lambda m: m.group(1) + _ABBR + m.group(2) + '.'), |
There was a problem hiding this comment.
Match longer acronyms before two-part abbreviations
Fresh evidence in this revision is that a three-part acronym pattern was added, but it is placed after this broader two-part pattern, so U.S.A./F.B.I. are partially rewritten as U.S./F.B. before the longer regex ever runs. In the new sentence-level path, inputs like I live in the U.S.A. now. still split into fragments such as I live in the U.S. and A. now., which are translated and cached without the full sentence context.
Useful? React with 👍 / 👎.
Summary
Reduces Google Translate V3 spend by 25–50% ($1,500–$2,140/mo) through two targeted fixes to the translation pipeline, plus architecture documentation to prevent future cost regressions.
The Problem
The
TranslationCoordinatorsends full segment text to Google Translate on every STT update cycle. As speech evolves ("Hola"→"Hola como"→"Hola como estas"), each version generates a unique MD5 cache key — so even though 60–70% of content is redundant, nothing hits cache.Result: $4,282/mo for 284 million characters translated, with an effective cache hit rate far below what the 3-tier cache system (LRU → Redis → negative) should deliver.
Fix 1: Sentence-Level Dedup in Batch Path (Primary)
File:
backend/utils/translation.py—translate_units_batch()Fix 2: Merge-Aware Redis Lookup (Secondary)
File:
backend/utils/translation_coordinator.pyDocumentation
coordinator.py:272-291coordinator.py:97-122backend/docs/translation-architecture.mdCost / Production Impact
Risk Mitigations
split_into_sentences()' '.join()'preserves original sequenceValidation
Rollback
Safe revert of 2 code files + 1 new doc file. No data migration needed — caches are ephemeral.