Skip to content

feat(backend): sentence-level translation dedup + merge Redis lookup (DD-008)#7954

Open
Git-on-my-level wants to merge 13 commits into
BasedHardware:mainfrom
Git-on-my-level:fix/translation-dedup-sentence-level
Open

feat(backend): sentence-level translation dedup + merge Redis lookup (DD-008)#7954
Git-on-my-level wants to merge 13 commits into
BasedHardware:mainfrom
Git-on-my-level:fix/translation-dedup-sentence-level

Conversation

@Git-on-my-level

Copy link
Copy Markdown
Collaborator

Summary

Reduces Google Translate V3 spend by 25–50% ($1,500–$2,140/mo) through two targeted fixes to the translation pipeline, plus architecture documentation to prevent future cost regressions.

The Problem

The TranslationCoordinator sends full segment text to Google Translate on every STT update cycle. As speech evolves ("Hola""Hola como""Hola como estas"), each version generates a unique MD5 cache key — so even though 60–70% of content is redundant, nothing hits cache.

Result: $4,282/mo for 284 million characters translated, with an effective cache hit rate far below what the 3-tier cache system (LRU → Redis → negative) should deliver.

Fix 1: Sentence-Level Dedup in Batch Path (Primary)

File: backend/utils/translation.pytranslate_units_batch()

  • Splits each unit text into sentences before cache lookup
  • Per-sentence cache check (negative → LRU → Redis)
  • Batch only uncached sentences to API, then reassemble
  • Identical sentences across different segments share cache hits
  • Full-sentence NMT quality preserved (Google gets complete sentences)
Unit A: "Hola como estas. Bien gracias."
Unit B: "Como estas. Bien gracias."

BEFORE: 2 full-text MD5s -> 2 cache misses -> 2 API calls
AFTER:  4 sentence MD5s -> "Bien gracias." hits cache -> 3 sentences -> 1 batch call

Fix 2: Merge-Aware Redis Lookup (Secondary)

File: backend/utils/translation_coordinator.py

  • On segment merge (prefix reset), check Redis for merged text first
  • If cached, adopt immediately and fire callback; only re-translate if truly unseen

Documentation

Artifact Location Purpose
Design-decision comment coordinator.py:272-291 Why we send full text, trade-offs
Updated class docstring coordinator.py:97-122 Architecture diagram, data flow, cost note
ADR-001 backend/docs/translation-architecture.md Formal record of decision with dollar amounts

Cost / Production Impact

Metric Current After Fix (projected)
Monthly spend $4,282/mo $2,140–2,780/mo
Cache hit rate ~10% 40–60%
Chars sent to API/mo 284M 150–200M
Translation quality Full-sentence NMT ✅ Same

Risk Mitigations

  • Sentence splitting uses existing battle-tested split_into_sentences()
  • Ordered list + ' '.join()' preserves original sequence
  • 3-layer defense for empty/whitespace texts
  • Per-chunk try/except; failed sentences fall back to original text

Validation

  • 163 non-async tests pass
  • 9 source-level sanity checks pass
  • Independent subagent review: ✅ APPROVED (8/8 dimensions verified)

Rollback

Safe revert of 2 code files + 1 new doc file. No data migration needed — caches are ephemeral.

Codex-Agent and others added 2 commits June 15, 2026 06:04
…(DD-008)

Phase 1 (sentence-level dedup): Rewrote translate_units_batch() to split
each text into sentences before cache lookup. Identical sentences across
different segments/users now share cache hits instead of each triggering
a separate API call. Preserves full-sentence NMT quality while
eliminating 25-35% of redundant character throughput.

Phase 3 (merge-aware Redis lookup): On segment merge (prefix reset),
check Redis for the merged text before re-translating. Eliminates 10-15%
of redundant translations from STT segment combination.

Documentation: Added DESIGN DECISION comment at the batch_buffer call site,
updated TranslationCoordinator class docstring with architecture/cost notes,
and created ADR-001 document at backend/docs/translation-architecture.md.

Estimated savings: $1,500–2,140/mo (25-50% of $4,282/mo Google Translate bill).

Files changed:
- backend/utils/translation.py: translate_units_batch() sentence-level rewrite
- backend/utils/translation_coordinator.py: comments + merge Redis lookup
- backend/docs/translation-architecture.md: NEW — ADR-001
- Call on_translation_ready when prefix-reset adopts a Redis-cached translation
  so Firestore persist and WebSocket push run for cached segments.
- Cache assembled full-text translations in translate_units_batch so
  prefix-reset lookups hit Redis for multi-sentence segments.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f15573d9e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/utils/translation.py Outdated

text_hash = hashlib.md5(original_text.encode()).hexdigest()
self._set_memory_cache(text_hash, dest_language, assembled, dominant_lang)
cache_translation(text_hash, dest_language, assembled, dominant_lang)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid caching failed translation fallbacks

When a Google Translate chunk raises above, the exception path records each sentence as its original text with an empty detected language, but this unconditional Redis write then stores that fallback as a successful full-text translation. For single-sentence inputs, or any chunk where all translated parts fell back, a transient API failure will poison translate:v1:* for the 14-day TTL and future requests will hit Redis instead of retrying, leaving the text untranslated.

Useful? React with 👍 / 👎.

Comment on lines +213 to +214
await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id)
continue # Don't add to batch buffer

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Invalidate pending batches after cached prefix resets

When this cached prefix-reset branch runs while the same segment already has an older translation entry buffered, the old entry keeps matching state.version because this path commits the Redis result and continues without bumping the version or removing buffered entries. In that 250ms batching window, _flush_batch() can later treat the stale buffered text as valid and overwrite/persist the older translation after the cached merged translation was already sent.

Useful? React with 👍 / 👎.

@greptile-apps

greptile-apps Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces two cost-reduction optimisations to the translation pipeline: sentence-level dedup in translate_units_batch (splitting each unit into sentences before cache lookup so identical sentences across segments share cache hits) and a merge-aware Redis lookup in the coordinator (Fix 2, which short-circuits re-translation when a prefix-reset segment's full text is already in Redis).

  • translation.py: Phases 0–3 refactor sentence-splits each unit, deduplicates at the sentence level, batch-translates only uncached sentences, then reassembles and writes a full-text positive-cache entry. The reassembly uses ASCII ' '.join(), which inserts spurious spaces in CJK output (flagged previously).
  • translation_coordinator.py: Fix 2 adds a Redis hit-check on prefix reset and fires on_translation_ready directly — but does not bump state.version, leaving any in-flight batch job for the same segment free to fire the callback a second time with the stale pre-merge translation (new finding). Previously flagged gaps (missing should_persist_translation guard, skipped language_state.observe) also remain open.
  • backend/docs/translation-architecture.md: New ADR capturing the full-text vs. delta trade-off with cost figures; no code impact.

Confidence Score: 3/5

Fix 2 can cause an in-flight batch result to overwrite the correct merge-cached translation in the UI; several other gaps flagged in prior review rounds are still unaddressed.

The merge-aware Redis path fires on_translation_ready immediately but leaves state.version unchanged. If the GCP call for the previous version of the same segment is still in flight, _flush_batch will match the stale version and fire the callback a second time, replacing the correct translation with the old one in both Firestore and the WebSocket stream. Combined with the three open gaps carried over from the previous review, the coordinator path has multiple unresolved correctness issues.

backend/utils/translation_coordinator.py — specifically the Fix 2 block (lines 202–214) where the version counter is not incremented before firing the callback.

Important Files Changed

Filename Overview
backend/utils/translation.py Sentence-level dedup refactor; Phase 3 unconditionally writes assembled full-text to Redis positive cache (already flagged), and ASCII-space joining breaks CJK targets (already flagged).
backend/utils/translation_coordinator.py Adds merge-aware Redis lookup (Fix 2) and extended docstrings; Fix 2 skips version bump, leaving in-flight batch jobs able to overwrite the Redis-hit callback result. Previously flagged issues (missing should_persist_translation guard, language_state.observe skip) also remain.
backend/docs/translation-architecture.md New ADR documenting the full-text vs. delta translation trade-off; accurate and complete.

Sequence Diagram

sequenceDiagram
    participant O as observe()
    participant BB as _batch_buffer
    participant FB as _flush_batch()
    participant GCP as Google Translate API
    participant CB as on_translation_ready()

    Note over O: observe() call A segment X version V
    O->>BB: append(seg_X, text_old, conv, V)

    Note over FB: _flush_batch starts timer fires
    FB->>GCP: await run_blocking translate_units_batch

    Note over O: observe() call B segment X prefix reset Fix 2 Redis hit
    O->>CB: await on_translation_ready(seg_X, T_new)
    Note over O: state.version still V not bumped
    O-->>BB: continue skips _next_version

    GCP-->>FB: T_old returned
    FB->>FB: "state.version == V matches stale version"
    FB->>CB: await on_translation_ready(seg_X, T_old) overwrites correct result
Loading

Reviews (1): Last reviewed commit: "fix: address review feedback — cache poi..." | Re-trigger Greptile

Comment on lines +202 to +214
# Check if the new merged text was already translated (Redis cache)
text_hash = hashlib.md5(text.encode()).hexdigest()
redis_cached = get_cached_translation(text_hash, self.target_language)
if redis_cached:
# Found in Redis — adopt as committed, skip re-translation
translated_text = redis_cached['text']
detected_lang = redis_cached.get('detected_lang', '')
state.committed_text = text
state.assembled_translation = translated_text
state.detected_lang = detected_lang
self.metrics['prefix_resets'] += 1
await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id)
continue # Don't add to batch buffer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing should_persist_translation guard in merge-aware Redis path

When Fix 2 finds a Redis hit during a prefix reset, it calls on_translation_ready directly without checking should_persist_translation. The normal batch path (line 358) applies this check before firing the callback; skipping it here means a no-op "translation" (e.g. English text being "translated" to English) stored in the Redis positive cache can trigger on_translation_ready and create a spurious translation badge in the UI — exactly the case should_persist_translation was built to prevent.

The scenario is concrete: the batch path calls cache_translation in Phase 3 of translate_units_batch before the should_persist_translation check runs in _flush_batch. If that check then returns False it sets the negative cache but leaves the positive cache entry intact. A later merge on the same text hits the positive cache in Fix 2 and fires the callback unconditionally.

# Fallback: should not happen, but use original text
translated_parts.append(sent_text)

assembled = ' '.join(translated_parts)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ASCII space joining breaks CJK target-language output

' '.join(translated_parts) inserts a Latin space between sentences even when the target language is Chinese, Japanese, or Korean. Those languages don't delimit sentences with spaces, so the assembled string will contain spurious U+0020 characters (e.g. 你好。 你好吗? instead of 你好。你好吗?). This affects every CJK target-language user and degrades the quality of the persisted Firestore translation.

Suggested change
assembled = ' '.join(translated_parts)
_CJK_LANGS = {'zh', 'ja', 'ko', 'zh-cn', 'zh-tw', 'zh-hk'}
sep = '' if dest_language.lower().split('-')[0] in _CJK_LANGS else ' '
assembled = sep.join(translated_parts)

Comment on lines +202 to +213
# Check if the new merged text was already translated (Redis cache)
text_hash = hashlib.md5(text.encode()).hexdigest()
redis_cached = get_cached_translation(text_hash, self.target_language)
if redis_cached:
# Found in Redis — adopt as committed, skip re-translation
translated_text = redis_cached['text']
detected_lang = redis_cached.get('detected_lang', '')
state.committed_text = text
state.assembled_translation = translated_text
state.detected_lang = detected_lang
self.metrics['prefix_resets'] += 1
await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 language_state not updated on Redis cache hit path

The normal batch path updates self.language_state.observe(original_text, speaker_id=None) with the API-detected language (line 373) so the monolingual gate can adapt. When Fix 2 short-circuits via the Redis cache, this observe() call is skipped entirely. Over time the language model can drift if the Redis path is the primary path for high-frequency segments, weakening the monolingual gate's ability to detect language switches.

Comment thread backend/utils/translation.py Outdated
Comment on lines +626 to +656
# Phase 3: Reassemble per-unit results from sentence translations
results = []
for unit_idx, (unit_id, original_text, sentences) in enumerate(unit_sentences):
if not sentences:
results.append((unit_id, original_text, ''))
continue

translated_parts = []
detected_langs = []
for sent_text, sent_hash in sentences:
if sent_hash in sent_translation:
trans_text, det_lang = sent_translation[sent_hash]
translated_parts.append(trans_text)
if det_lang:
detected_langs.append(det_lang)
else:
# Fallback: should not happen, but use original text
translated_parts.append(sent_text)

assembled = ' '.join(translated_parts)
# Dominant detected language from constituent sentences
dominant_lang = ''
if detected_langs:
lang_counts = Counter(detected_langs)
dominant_lang = lang_counts.most_common(1)[0][0]

text_hash = hashlib.md5(original_text.encode()).hexdigest()
self._set_memory_cache(text_hash, dest_language, assembled, dominant_lang)
cache_translation(text_hash, dest_language, assembled, dominant_lang)

results.append((unit_id, assembled, dominant_lang))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Phase 3 unconditionally writes full-text positive cache even for same-language text

cache_translation(text_hash, ...) is always called in Phase 3 regardless of whether the assembled translation actually differs from the original. If all constituent sentences were detected as same-language (returning original text), the full-text cache entry will hold the original text — an entry that downstream callers (including the merge-aware Fix 2 lookup) will treat as a valid translation. The _flush_batch check in the coordinator catches this for the normal flow, but the Fix 2 shortcut path bypasses it, so a "translation" equalling the source text can reach on_translation_ready.

…tion

P1: Skip Redis cache write when any sentence fell back to original text
    due to API failure. Previously the exception path stored untranslated
    fallback text into Redis, poisoning the 14-day TTL cache so future
    requests would return raw text instead of retrying.

P2: Filter pending batch buffer entries for a segment when adopting a
    Redis-cached translation on prefix reset. Without this, a stale
    buffered entry could later overwrite the cached result in _flush_batch().

@Git-on-my-level Git-on-my-level left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review feedback addressed in e13a77b

P1 — Cache poisoning on API failure:

  • Added _failed_sent_hashes set tracking sentences that fell back to original text
  • Phase 3 now skips Redis cache_translation() when any constituent sentence is a fallback
  • Transient API failures no longer poison the 14-day TTL cache with untranslated text

P2 — Stale batch buffer overwrite after prefix reset:

  • When adopting a Redis-cached translation on prefix reset, the code now filters out pending batch buffer entries for that segment ID
  • Prevents _flush_batch() from later overwriting with stale buffered text

Comment on lines +203 to +214
text_hash = hashlib.md5(text.encode()).hexdigest()
redis_cached = get_cached_translation(text_hash, self.target_language)
if redis_cached:
# Found in Redis — adopt as committed, skip re-translation
translated_text = redis_cached['text']
detected_lang = redis_cached.get('detected_lang', '')
state.committed_text = text
state.assembled_translation = translated_text
state.detected_lang = detected_lang
self.metrics['prefix_resets'] += 1
# Invalidate any pending batch entries for this segment so
# _flush_batch() does not overwrite with stale buffered text.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Stale in-flight batch job fires after Fix 2 callback

Fix 2 fires on_translation_ready and then continues, bypassing the _next_version() call on line 272. If a batch job for this segment was already queued (e.g., submitted 200 ms earlier and still awaiting the GCP API response), _flush_batch will see state.version == version still matching, and fire on_translation_ready a second time with the old (pre-merge) translation — overwriting the correct Fix 2 result in the UI.

The old code avoids this because any prefix reset that reaches _batch_buffer.append always passes through _next_version(), invalidating the prior in-flight version. Fix 2 needs to increment the version before firing the callback to close this race window. Adding state.version = self._next_version() before the await self.on_translation_ready(...) call is sufficient.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e13a77b323

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +215 to +217
self._batch_buffer = [
entry for entry in self._batch_buffer if entry[0] != segment.id
]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bump the version when adopting cached merged text

Fresh evidence: the current patch only removes entries that are still in _batch_buffer, but _flush_batch() copies and clears that buffer before awaiting translate_units_batch(). If this Redis-hit path runs while that API call is already in flight, state.version stays at the old value, so the stale batch result still passes the version check and can overwrite/persist an older translation after this cached merged translation was sent.

Useful? React with 👍 / 👎.

self._batch_buffer = [
entry for entry in self._batch_buffer if entry[0] != segment.id
]
await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reapply the no-op filter before cached callbacks

When the cached entry is an unchanged translation, this shortcut bypasses the should_persist_translation() check used in _flush_batch() and immediately persists/sends a translation identical to the source. That cache state is possible because translate_units_batch() writes Redis before _flush_batch() decides a result is a no-op, so prefix resets for already-target-language text can reintroduce translation badges that the coordinator normally suppresses.

Useful? React with 👍 / 👎.

self.metrics['prefix_resets'] += 1
# Check if the new merged text was already translated (Redis cache)
text_hash = hashlib.md5(text.encode()).hexdigest()
redis_cached = get_cached_translation(text_hash, self.target_language)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Offload the Redis lookup from observe()

On a prefix reset this calls the synchronous redis.Redis.get path from inside async observe(). The backend rule in AGENTS.md says Redis CRUD in async def must be offloaded with run_blocking(..., db_executor, ...); otherwise a slow Redis read stalls the WebSocket event loop for other sessions on the same worker.

Useful? React with 👍 / 👎.

@kodjima33 kodjima33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backend feature (sentence-level translation dedup) — approve only; note: a required check is failing, do not merge until green.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd21d935e8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# _flush_batch() does not overwrite with stale buffered text.
self._batch_buffer = [entry for entry in self._batch_buffer if entry[0] != segment.id]
await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id)
continue # Don't add to batch buffer

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Refresh segment timing before continuing

When this cached-prefix path is taken, it sets committed_text and then continues before refreshing state.latest_text and state.last_update_at. If the next STT update appends to that same segment, _compute_stability_signals() uses the timestamp from before the cached reset, so the new tail can satisfy the 3s soft-boundary immediately and be translated as stable even though it just arrived; update the state text/timestamp before this early continue.

Useful? React with 👍 / 👎.

@kodjima33 kodjima33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backend translation dedup feature — approve only (re-approve on new sha)

… bump + timing refresh

Addresses remaining P1/P2 bot review comments:
- Add should_persist_translation guard to prevent no-op translations
  from triggering spurious UI badges (P1)
- Bump state.version after adopting cached translation so in-flight
  batch jobs are rejected as stale (P1)
- Refresh latest_text/last_update_at timing to prevent premature
  stability detection (P2)

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cfaa910f26

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/utils/translation.py Outdated
dominant_lang = lang_counts.most_common(1)[0][0]

text_hash = hashlib.md5(original_text.encode()).hexdigest()
self._set_memory_cache(text_hash, dest_language, assembled, dominant_lang)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid caching failed single-sentence fallbacks in memory

When a Google Translate chunk fails, the exception path stores the original sentence in sent_translation and marks it failed; for a single-sentence unit, this full-text hash is the same as the failed sentence hash, so this line still writes the fallback into the in-process LRU even though the Redis write is skipped. Subsequent occurrences in the same WebSocket/session then hit _check_memory_cache(sent_hash, ...) and return the original text without retrying Google until eviction, so a transient API failure can keep live translations untranslated.

Useful? React with 👍 / 👎.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 3 files

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread backend/utils/translation_coordinator.py
Comment thread backend/utils/translation.py Outdated
Comment thread backend/utils/translation_coordinator.py Outdated
P1 fixes:
- Bump version + clear batch buffer in no-op translation path so
  in-flight batch jobs cannot overwrite cached results
- Gate memory cache write on _failed_sent_hashes same as Redis cache
  to prevent poisoning LRU with untranslated fallback text

P2 fix:
- Offload Redis get_cached_translation to run_blocking in async
  observe() to avoid blocking the event loop

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b8ebf0efd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# unit_sentences[i] = list of (sentence_text, sentence_hash) for unit i
unit_sentences = []
for unit_id, text in units:
sentences = split_into_sentences(text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check full-text caches before splitting

For multi-sentence segments, this now splits the text before any cache lookup, so the old full-text LRU/Redis/negative keys are never consulted. That means existing 14-day Redis entries produced by the previous batch path or by translate_text() for texts like Hola. Gracias. will be bypassed and sent to Google again until sentence-level keys happen to be populated, undercutting the cost-saving goal on deploy; add the full-text cache/negative-cache check before falling back to sentence-level dedup.

Useful? React with 👍 / 👎.

# unit_sentences[i] = list of (sentence_text, sentence_hash) for unit i
unit_sentences = []
for unit_id, text in units:
sentences = split_into_sentences(text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve abbreviations when splitting sentences

With this new batch path, any transcript containing common abbreviations or decimals is sent to Google as broken fragments because split_into_sentences() splits on every period: for example I live in the U.S. now. becomes I live in the U., S., now., and Version 3.11 is installed. becomes Version 3., 11 is installed.. The previous full-text batch translation preserved that context, so these inputs can now persist noticeably worse translations; use safer sentence-boundary rules or avoid sentence splitting for these cases.

Useful? React with 👍 / 👎.

…edge-case tests

Address 2 new Codex bot review comments on PR BasedHardware#7954:

P2 - Abbreviation splitting (bot-reported):
- split_into_sentences() now uses placeholder protection for internal
  periods in multi-part abbreviations (U.S., U.K., 3.14, e.g., i.e., etc., vs.)
- Post-split merge heuristic prevents false sentence boundaries at
  abbreviation tails (e.g., 'I live in the U.S. now.' stays as 1 sentence)
- Handles the specific bot examples: 'U.S.' no longer splits into ['U.', 'S.']
  and '3.11' no longer splits into ['3.', '11']

P2 - Full-text cache bypassed:
- translate_units_batch() now checks full-text Redis/LRU caches (Phase -1)
  BEFORE falling through to sentence-level dedup (Phase 0)
- Prevents bypassing existing full-text cache entries from pre-DD-008 path

Test coverage (18 new tests in test_translation_dedup_edge_cases.py):
- 12 abbreviation splitting tests (U.S., 3.11, Dr., e.g., etc., vs., CJK, decimals)
- 6 structural code path tests verifying all bot fixes are present:
  * No-op path bumps version + clears batch buffer
  * Failed-fallback memory cache gated on _failed_sent_hashes
  * Async observe() offloads Redis to run_blocking
  * Full-text cache phase precedes sentence splitting phase
  * Abbreviation protection patterns exist
  * Merge heuristic function exists

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 321d96869b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"""Verify that bot-identified code paths exist and are correctly structured."""

def setUp(self):
self.pr_root = "/tmp/omi-pr"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Derive test source paths from the checkout

When backend/test.sh runs in a normal checkout, this hard-coded /tmp/omi-pr path does not exist, so _read_source() returns None and the structural assertions in this newly added test file fail before exercising the translation code. Since this test is now added to test.sh, CI is blocked unless it happens to use that exact temporary path; derive the repo root from __file__ instead.

Useful? React with 👍 / 👎.

Comment on lines +653 to +656
for unit_id, text in units:
sentences = split_into_sentences(text)
hashed = [(s, hashlib.md5(s.encode()).hexdigest()) for s in sentences]
unit_sentences.append((unit_id, text, hashed))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve partial full-text cache hits

When a mixed batch contains one unit with a full-text LRU/Redis hit and another uncached unit, the cached value is recorded in full_text_results but then discarded because only the all-hit case returns; this loop still splits and reprocesses every unit. In that scenario the cached segment is sent through sentence caches/API again, and if the API path fails it can return the source text despite already having a known cached translation.

Useful? React with 👍 / 👎.

Comment thread backend/utils/translation.py Outdated
Comment on lines +431 to +434
# Prev ends with a short abbreviation-like token (≤ 6 char body before punctuation)
prev_body = prev.rstrip('.!?。!؟ ')
if len(prev_body) <= 6:
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict short-fragment merges to abbreviations

For common short sentences followed by another sentence, such as Sí. Gracias. or OK. Thanks., this length-only rule merges the two real sentences into one because the previous body is ≤6 characters. Since translate_units_batch() hashes after this split, those high-frequency short utterances stop getting sentence-level cache reuse and are translated as larger combined fragments, undermining the cost-saving path this PR adds.

Useful? React with 👍 / 👎.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 3 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread backend/utils/translation.py Outdated
Comment thread backend/tests/unit/test_translation_dedup_edge_cases.py Outdated
Comment thread backend/utils/translation.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3bb7b69d5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +204 to +205
redis_cached = await run_blocking(
sync_executor, get_cached_translation, text_hash, self.target_language

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use db_executor for Redis prefix lookups

In the prefix-reset path this offloads a Redis get_cached_translation call onto sync_executor. The backend AGENTS guidance assigns Firestore/Redis CRUD to db_executor, while sync_executor is reserved for sync/STT pipeline work; under many prefix resets or slow Redis, these cache reads can occupy the same 16 workers used by translation/audio pipeline tasks and queue that work behind Redis I/O. Route this lookup through db_executor instead.

Useful? React with 👍 / 👎.

# translate_text() calls that wrote full-text entries.
full_text_results = {}
for unit_id, text in units:
text_hash = hashlib.md5(text.encode()).hexdigest()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor full-text negative cache entries

When a full segment was previously skipped or filtered as a no-op, the coordinator stores set_negative_cache(md5(full_text), lang), but this new full-text phase checks only positive LRU/Redis entries before splitting into sentence hashes. For a multi-sentence target-language segment that already has a full-text negative key, none of the sentence hashes match that key, so the previous full-text negative-cache behavior is bypassed and the same text can be sent to Google again.

Useful? React with 👍 / 👎.

@kodjima33 kodjima33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backend feature (sentence-level translation dedup + Redis merge) — approve only per policy (feature; Nik owns backend deploy). Re-approved on new sha.

P1 — test file:
- Derive repo root from __file__ instead of hardcoded /tmp/omi-pr
  (CI runs in different checkout paths)

P2 — translation.py:
- Honor negative cache entries in Phase -1 full-text lookup
  (prevents re-translating known no-op texts)
- Restrict short-fragment merge heuristic to abbreviation patterns only
  (prevents merging real short sentences like 'Sí.' + 'Gracias.')
- Preserve partial full-text cache hits in mixed batch scenarios
  (cached units skip sentence splitting, emitted directly in Phase 3)
- Skip sentence splitting for units with full-text cache hits

P2 — translation_coordinator.py:
- Use db_executor instead of sync_executor for Redis prefix lookups
  (per backend AGENTS guidance: Redis CRUD → db_executor)

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1400409854

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/utils/translation.py Outdated
Comment on lines +442 to +445
if not is_abbrev_like:
return False

return False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return true for recognized abbreviation fragments

For abbreviation fragments followed by capitalized text, such as Dr. Smith arrived. or Mr. Jones left., this branch computes is_abbrev_like but then falls through to the unconditional return False, so split_into_sentences() still emits Dr. and Smith arrived. as separate translation units. That regresses the new sentence-level path by sending titles/names to Google without their context and can persist visibly worse translations.

Useful? React with 👍 / 👎.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread backend/utils/translation.py Outdated
Comment thread backend/utils/translation.py Outdated
P1 #1: _should_merge returned False for ALL abbreviations — missing
  return True when is_abbrev_like check passed. Also fixed operator
  precedence bug in decimal/version check (and binds before or).

P1 #2: translate_units_batch emitted full-text cache hits before
  sentence-translated units, breaking input-order contract. Pre-allocate
  results array and fill by position.

P1 #3+5: Cache write guard only caught exception-path failures but not
  quiet fallbacks (sent_hash missing from sent_translation dict). Track
  fallback hashes separately and block caching on any failure path.

P1 #4: Prefix-reset Redis paths bumped version and cleaned buffer but
  did NOT cancel in-flight batch tasks. Added explicit task cancellation
  to both no-op and persist cache-hit paths.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7658df2623

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# so downstream consumers can rely on positional mapping.
results: list[tuple[str, str, str] | None] = [None] * len(units) # pre-allocate for in-order assembly

for unit_idx, (unit_id, original_text, sentences) in enumerate(unit_sentences):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve original indexes for compacted sentence units

Fresh evidence: in this revision full-text cache hits are skipped when building unit_sentences, so enumerate(unit_sentences) is a compacted index rather than the original units index. When an earlier batched segment hits the full-text cache and a later segment misses, this writes the later segment's result into the earlier slot; the fill-in loop then sees that slot as occupied and never returns the cached translation, while the later segment can also get a duplicate fallback. In mixed cached/uncached batches, the coordinator can therefore omit translations that were already available from cache.

Useful? React with 👍 / 👎.

P2 #1: Return original text when any sentence falls back or fails —
prevents caller from persisting partial/mixed translations that block
retry on transient API failures.

P2 #2: Extract last token for abbreviation detection in _should_merge() —
embedded titles like 'I spoke to Dr.' have long prev_body but the
trailing token is still a title abbrev. Check last token instead of
full prev_body length.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread backend/utils/translation.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b8a0beeb3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +437 to +445
is_abbrev_like = (
_last_token.isupper()
and len(_last_token) <= 3 # U.S., UK, Dr
or _last_token in ('Dr', 'Mr', 'Mrs', 'Ms', 'St', 'Prof') # Title abbrevs
or _last_token in ('etc', 'vs') # Latin abbrevs
or (_last_token[0].isupper() and len(_last_token) > 1 and _last_token[1:].isdigit()) # v2, V3
)
if is_abbrev_like:
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid merging across real sentence boundaries

For sentences that legitimately end with a protected abbreviation/acronym, this unconditional abbreviation merge collapses the next capitalized sentence into the same cache unit; for example split_into_sentences("She is from the U.K. She likes tea.") returns one entry, even though the new test file expects two. That makes sentence-level cache keys larger combined chunks, so common following sentences cannot dedupe and are sent to Google again.

Useful? React with 👍 / 👎.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Addressed the latest sentence-splitting review feedback and generalized the tests so this class of issue is less one-off:

  • fixed _should_merge so uppercase acronyms like U.K. / U.S. only merge into the next fragment when the next fragment starts lowercase (continuation like U.S. policy), but split when followed by a capitalized real sentence (U.K. She likes tea.)
  • kept title abbreviations working before names (Dr. Smith...)
  • changed the unit test to import the real utils.translation.split_into_sentences instead of maintaining an inline copy of the splitter; the old test shape could pass while production code regressed
  • added fuzz-ish matrix/subtest coverage for:
    • acronym-at-boundary + capitalized next sentence
    • acronym continuation + lowercase next fragment
    • short STT utterances (Hi., OK., Sí., etc.) followed by real sentences

Validation run locally:

cd backend
uv run --with pytest --with google-cloud-translate --with langdetect --with redis --with pydantic \
  pytest tests/unit/test_translation_dedup_edge_cases.py -q --tb=short

21 passed, 36 subtests passed in 0.12s

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d321460b64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Latin abbreviations: e.g. → eⓛⓐⓣg.
(re.compile(r'\b([a-z])\.([a-z])\.'), lambda m: m.group(1) + _LAT + m.group(2) + '.'),
# etc. → etcⓛⓐⓣ
(re.compile(r'\betc\.'), 'etc' + _LAT),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve boundaries after sentence-final etc.

When a transcript uses etc. at a real sentence boundary, e.g. We bought apples, oranges, etc. Then we left., this replacement removes the only sentence-ending period before SENTENCE_FINDALL_RE runs, so split_into_sentences() returns one combined unit instead of two. Because the new dedup/cache keys are sentence-based, the following sentence is then translated and cached only as part of a larger combined chunk, reducing the cache reuse this change is meant to add; protect only the internal abbreviation text or restore a split point when the next token starts a new sentence.

Useful? React with 👍 / 👎.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Addressed the newest Codex review comment about etc. at a real sentence boundary.

Root cause

This one was not in _should_merge; it was earlier in the pipeline. We protected etc. by replacing the whole token with etcⓛⓐⓣ, which removed the only sentence-ending period before SENTENCE_FINDALL_RE ran. So:

We bought apples, oranges, etc. Then we left.

became a single pre-split unit and never reached the merge heuristic as two candidate sentences.

Why we kept missing edge cases

The pattern is that the splitter currently has two independent boundary decisions:

  1. placeholder protection before regex splitting
  2. post-split merge heuristics

The last fix generalized only layer 2 (_should_merge) for acronym boundaries. This new bug lives in layer 1: a placeholder can erase a true sentence boundary before _should_merge ever sees it.

The deeper test gap was that examples were still organized around individual bot-reported strings, not around invariants for each layer:

  • placeholders must protect internal punctuation without deleting true terminal punctuation
  • merge heuristics may merge lowercase continuations, but must not merge capitalized next sentences unless the abbreviation class requires it
  • tests should cover both sides of every abbreviation class: continuation and boundary

Change

  • stopped pre-protecting etc. as a whole token so its terminal period remains available to the sentence-boundary regex
  • kept continuation behavior via _should_merge for etc. for ...
  • added matrix/subtest coverage for:
    • etc. followed by capitalized true next sentence
    • etc. followed by lowercase continuation

Validation:

cd backend
uv run --with pytest --with google-cloud-translate --with langdetect --with redis --with pydantic \
  pytest tests/unit/test_translation_dedup_edge_cases.py -q --tb=short

23 passed, 41 subtests passed in 0.12s

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8555e2e77

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/utils/translation.py Outdated
Comment on lines +424 to +427
# Next segment is a lone lowercase word/fragment — not a real sentence
nxt_body = nxt.rstrip('.!?。!؟؟۔।॥ ')
if len(nxt_body) <= 15 and nxt and nxt[0].islower():
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict lowercase merges to real abbreviation continuations

When STT emits lower-case sentences, this unconditional branch merges real short sentences whenever the next sentence starts lower-case and is ≤15 chars; for example split_into_sentences("hola. gracias.") and "I agree. thanks." each become one combined unit. Since this PR's cost reduction depends on per-sentence hashes, common short utterances stop deduping and get cached/translated only as larger combined chunks; the fresh evidence is that this lowercase-continuation branch is independent of the abbreviation checks below.

Useful? React with 👍 / 👎.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Addressed the newest Codex comment about lowercase STT sentences being over-merged.

My take

This comment is valid, and it exposes the same deeper problem from another angle: we were treating next starts lowercase as a continuation signal globally. That is too broad for STT transcripts because lowercased standalone utterances are common:

hola. gracias.
I agree. thanks.

Those should remain separate cache units. Lowercase is only useful as a continuation signal when the previous token is a known abbreviation class (etc. for ..., e.g. apples, U.S. policy), not by itself.

Change

  • removed the global lowercase-short-fragment merge branch
  • kept lowercase continuation behavior inside the known-abbreviation branch only
  • included e.g / i.e in that branch so examples like e.g. apples still stay joined
  • added regression coverage for lowercased STT-style standalone utterances

Validation:

cd backend
uv run --with pytest --with google-cloud-translate --with langdetect --with redis --with pydantic \
  pytest tests/unit/test_translation_dedup_edge_cases.py -q --tb=short

24 passed, 45 subtests passed in 0.12s

Bigger-picture thought

I think these are still real edge cases, not just style/opinion. But we are also reaching the limit of patching a regex + heuristic splitter one bot comment at a time. The more general approach should be to define the splitter contract as invariants and test a generated matrix across abbreviation classes:

  • known abbreviation + lowercase continuation => merge
  • known abbreviation + capitalized next sentence => usually split, except title abbreviations like Dr. Smith
  • non-abbreviation short sentence + lowercase next sentence => split
  • placeholder protection must never delete a true terminal boundary

Longer term, we should consider replacing this bespoke splitter with a library-backed sentence boundary detector or moving sentence-boundary detection closer to transcript metadata when available. The current heuristic is workable, but only if we keep the rule surface deliberately small and encode the trade-offs explicitly in tests.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 109d95dfdd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

self.metrics['prefix_resets'] += 1
# Check if the new merged text was already translated (Redis cache)
text_hash = hashlib.md5(text.encode()).hexdigest()
redis_cached = await run_blocking(db_executor, get_cached_translation, text_hash, self.target_language)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Invalidate stale batches before Redis lookup

When a prefix reset arrives while an older translation batch is queued or already in _flush_batch(), this awaits the Redis lookup before bumping state.version or removing buffered entries for the segment. If Redis is slow enough for the old batch to finish during that await, the stale result still matches the previous version and can call on_translation_ready for text that no longer matches the segment; move the version bump/buffer invalidation before the lookup so the in-flight batch is rejected immediately.

Useful? React with 👍 / 👎.


_ABBREV_PATTERNS = [
# Country codes: U.S. → UⓐⓑⓒS. (internal period protected, trailing period kept)
(re.compile(r'\b([A-Z])\.([A-Z])\.'), lambda m: m.group(1) + _ABBR + m.group(2) + '.'),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Protect multi-part acronyms in sentence splitting

This only protects two-letter initialisms, so common three-part acronyms are still split into fragments before translation; for example split_into_sentences("I live in the U.S.A. now.") returns ['I live in the U.S.', 'A. now.']. Since translate_units_batch() now translates and caches these sentence fragments independently, transcripts containing acronyms like U.S.A. or F.B.I. can get lower-quality translations and poor cache keys; protect repeated initialism groups rather than a single A.B. pair.

Useful? React with 👍 / 👎.

@cubic-dev-ai

cubic-dev-ai Bot commented Jun 20, 2026

Copy link
Copy Markdown

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Switched sentence splitting to use pysbd directly as the only boundary detector — no silent fallback/custom heuristic path.

What changed

  • added pinned dependency: pysbd==0.3.4
  • replaced the custom placeholder + _should_merge regex splitter with a single deterministic pysbd.Segmenter(language="en", clean=False) instance
  • kept explicit newline handling so transcript line breaks remain hard input boundaries before segmentation
  • updated structural tests to assert we are using pysbd directly and not keeping a parallel custom abbreviation/merge splitter

Why this shape

This avoids the silent-failure footgun: if pysbd is missing from local/CI/prod, import fails loudly instead of quietly changing sentence-cache behavior.

Local validation

cd backend
uv run --with pytest --with google-cloud-translate --with langdetect --with redis --with pydantic --with pysbd \
  pytest tests/unit/test_translation_dedup_edge_cases.py -q --tb=short

23 passed, 45 subtests passed in 0.13s

Formatting:

black --check --line-length 120 --skip-string-normalization \
  backend/utils/translation.py backend/tests/unit/test_translation_dedup_edge_cases.py

All done! 2 files would be left unchanged.

Prod/Docker validation

I also rebuilt the backend Dockerfile from scratch to verify the production requirements.txt install path includes pysbd==0.3.4 and imports at runtime:

docker build --no-cache -f backend/Dockerfile -t omi-backend-pr7954-pysbd:test .
# build passed; pip installed pysbd-0.3.4 from requirements.txt

Runtime smoke inside the built image:

pysbd 0.3.4
['She is from the U.K.', 'She likes tea.']
['hola.', 'gracias.']

Note: same as earlier Docker smoke, this local machine cannot pull the private GCR base anonymously, so I used the public Python 3.11 slim stand-in tagged as gcr.io/based-hardware-dev/python:3.11-slim-forky. The important production path tested here is the Dockerfile's pip install -r backend/requirements.txt layer and runtime import behavior.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6c13aa101d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/utils/translation.py Outdated
return TranslationNeed.DEFER


_sentence_segmenter = pysbd.Segmenter(language="en", clean=False)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve non-English sentence enders

When transcripts use sentence enders that the repo already supports outside English/CJK, such as Arabic ؟/۔ or Hindi / from models/transcript_segment.SENTENCE_ENDERS, this singleton language="en" segmenter does not use PySBD's Arabic/Hindi punctuation rules, so examples like كيف حالك؟ أنا بخير. or the existing Hindi danda unit-test input remain one sentence. That regresses the multilingual split cases in backend/tests/unit/test_translation_optimization.py and prevents the new sentence-level cache reuse for those languages; choose the segmenter from the detected/source language or keep a fallback for the full repo sentence-ender set.

Useful? React with 👍 / 👎.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Addressed the latest multilingual sentence-ender review without reintroducing a custom splitter.

Approach

  • still use pysbd as the only sentence-boundary engine
  • no fallback to SENTENCE_FINDALL_RE
  • no custom abbreviation merge/protection logic
  • route each line to the closest built-in PySBD segmenter by explicit language/script when available
  • gracefully default unsupported app/extension language codes to the safest compatible PySBD rules

Why this avoids reinventing the wheel

The new code only chooses a PySBD language (ar, ur, hi, zh, ja, etc.). The actual boundary decisions stay inside PySBD.

Coverage added

Added regressions for:

  • Arabic ؟
  • Urdu ۔
  • Hindi/Marathi danda
  • Amharic /
  • Armenian ։
  • Burmese
  • Japanese/CJK /
  • every language code exposed by the app primary-language selector, ensuring unsupported extension languages gracefully route/default without custom fallback code

Validation

cd backend
uv run --with pytest --with google-cloud-translate --with langdetect --with redis --with pydantic --with pysbd \
  pytest tests/unit/test_translation_dedup_edge_cases.py tests/unit/test_translation_optimization.py -q --tb=short

117 passed, 122 subtests passed in 0.41s

Formatting:

black --check --line-length 120 --skip-string-normalization \
  backend/utils/translation.py backend/tests/unit/test_translation_dedup_edge_cases.py

All done! 2 files would be left unchanged.

Docker/runtime smoke after rebuild:

pysbd 0.3.4
['She is from the U.K.', 'She likes tea.']
['كيف حالك؟', 'أنا بخير.']
['مرحبا۔', 'كيف حالك؟']
['नमस्ते दुनिया।', 'यह एक परीक्षा है।']
['你好。', '世界!']

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29a4cc3071

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +238 to +239
await self.on_translation_ready(segment.id, translated_text, detected_lang, conversation_id)
continue # Don't add to batch buffer

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Update language state on cached prefix hits

When a prefix-reset Redis hit is a foreign-language segment while the conversation gate is already monolingual, this shortcut sends the cached translation and returns without the language-state update that _flush_batch() performs for translated results. The next update for the same segment only observes the newly appended tail; if that tail is short or inconclusive, ConversationLanguageState.observe() preserves monolingual=True, so the continuation can be committed as a skip instead of being translated.

Useful? React with 👍 / 👎.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Addressed the latest cached prefix-hit language-state review.

Root cause

The prefix-reset Redis hit fast path updated the segment state and emitted on_translation_ready, but it skipped the same ConversationLanguageState observation that the batch translation path performs for translated results. If the conversation gate was already monolingual, a cached foreign prefix could leave that gate stale for the next short/inconclusive tail.

Fix

On meaningful cached prefix hits with a detected_lang, update language_state before invoking the callback:

if detected_lang:
    self.language_state.observe(text, speaker_id=segment.speaker_id)
await self.on_translation_ready(...)

This keeps the shortcut behavior aligned with _flush_batch() while preserving the Redis fast path.

Regression coverage

Added test_prefix_reset_redis_hit_updates_language_state_before_callback, which reproduces the monolingual-stale case and asserts that the cached foreign hit observes the full text before callback delivery.

RED check before fix:

assert [] == [('Bonjour, comment allez-vous?', 7)]

GREEN validation:

cd backend
uv run --with pytest --with pytest-asyncio --with google-cloud-translate --with langdetect --with redis --with pydantic --with pysbd \
  pytest tests/unit/test_translation_cost_optimization.py tests/unit/test_translation_dedup_edge_cases.py tests/unit/test_translation_optimization.py -q --tb=short

201 passed, 122 subtests passed in 1.49s

Formatting:

black --check --line-length 120 --skip-string-normalization \
  backend/utils/translation_coordinator.py backend/tests/unit/test_translation_cost_optimization.py

All done! 2 files would be left unchanged.

Async blocker scan was also run from repo root. It reports existing unrelated findings in other backend files; this change adds no new async blocking calls.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Also addressed the prefix-reset race where a slow Redis lookup could let an older buffered batch flush before the reset invalidated it.

Root cause

The prefix-reset path awaited the full-text Redis lookup before bumping state.version / removing pending batch entries. During that await, a previously queued _flush_batch() could still see the old version as valid and deliver stale translation output.

Fix

As soon as a prefix reset is detected, before any awaited Redis lookup:

  • bump the segment version
  • remove buffered entries for that segment
  • cancel the pending batch timer if present

Then the Redis fast path can still reuse a cached translation, but stale buffered/in-flight work is already invalidated.

Regression coverage

Added test_prefix_reset_invalidates_stale_batch_before_redis_lookup.

RED check before fix:

Expected 'translate_units_batch' to not have been called. Called 1 times.
Calls: [call('en', [('seg1', 'old prefix bonjour')])].

GREEN validation after both coordinator fixes:

cd backend
uv run --with pytest --with pytest-asyncio --with google-cloud-translate --with langdetect --with redis --with pydantic --with pysbd \
  pytest tests/unit/test_translation_cost_optimization.py tests/unit/test_translation_dedup_edge_cases.py tests/unit/test_translation_optimization.py -q --tb=short

202 passed, 122 subtests passed in 1.52s

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48f209395c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread backend/utils/translation.py Outdated
Comment on lines +366 to +368
_sentence_segmenters = {
language: pysbd.Segmenter(language=language, clean=False) for language in _SUPPORTED_PYSBD_LANGUAGES
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid sharing PySBD segmenters across translation threads

_flush_batch() runs translate_units_batch() on sync_executor, so concurrent WebSocket sessions can enter split_into_sentences() at the same time and reuse the same global segmenter for a language. PySBD's Segmenter.segment() mutates instance state (self.original_text) during each call (upstream source), so one translation can overwrite another call's text before spans are computed, causing wrong sentence boundaries or sentences from another transcript to be sent to Google/cache; construct a fresh segmenter per call or serialize access per segmenter.

Useful? React with 👍 / 👎.

Comment on lines +246 to +247
if detected_lang:
self.language_state.observe(text, speaker_id=segment.speaker_id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use cached detected language to break monolingual gate

Fresh evidence: this added cached-hit path calls ConversationLanguageState.observe() with the text, but that re-runs local langdetect instead of applying redis_cached['detected_lang']. When the Redis hit is short foreign text such as Sí. while monolingual=True, local detection is inconclusive and leaves the gate enabled, so the next short appended tail can be committed as a monolingual skip; update the gate from the cached detected_lang (or explicitly clear monolingual when its base differs from the target) before returning.

Useful? React with 👍 / 👎.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Addressed the two latest Codex review comments.

1) Avoid sharing PySBD segmenters across translation threads

Codex was right: pysbd.Segmenter.segment() mutates instance state (self.original_text), and translate_units_batch() can run concurrently on sync_executor. I removed the global segmenter instances and now create a fresh pysbd.Segmenter(...) for each line being segmented. The language-routing helper remains, but the mutable PySBD object is no longer shared across threads/sessions.

Added/updated structural coverage to assert we do not keep _sentence_segmenters globals.

2) Use cached detected language to break monolingual gate

Codex was also right that re-running local langdetect on a cached short foreign hit can be inconclusive. I added ConversationLanguageState.update_from_detected_language(...) so trusted detected_language_code values from Google/Redis update the monolingual gate directly.

Both paths now use that trusted detected language:

  • prefix-reset Redis cached hit
  • normal _flush_batch() Google Translate result path

This avoids leaving monolingual=True for short cached foreign snippets like Sí..

RED checks observed

Before implementation:

_sentence_segmenters unexpectedly found ... Do not share mutable PySBD segmenters across threads
AttributeError: 'ConversationLanguageState' object has no attribute 'update_from_detected_language'

Validation

cd backend
uv run --with pytest --with pytest-asyncio --with google-cloud-translate --with langdetect --with redis --with pydantic --with pysbd \
  pytest tests/unit/test_translation_cost_optimization.py tests/unit/test_translation_dedup_edge_cases.py tests/unit/test_translation_optimization.py -q --tb=short

203 passed, 122 subtests passed in 1.57s

Formatting:

black --check --line-length 120 --skip-string-normalization \
  backend/utils/translation.py backend/utils/translation_cache.py backend/utils/translation_coordinator.py \
  backend/tests/unit/test_translation_dedup_edge_cases.py backend/tests/unit/test_translation_cost_optimization.py

All done! 5 files would be left unchanged.

Async blocker scan was re-run from repo root. It reports the same existing unrelated backend findings; this change adds no sync Redis/Firestore calls in async paths.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36711373c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

for unit_id, text in units:
if unit_id in full_text_results:
continue # Already have a full-text cache hit — no need to split
sentences = split_into_sentences(text)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Route Latin text through the detected source language

When translating a Spanish/French/etc. segment, this calls split_into_sentences(text) without a source language, and _pysbd_language_for_text() falls back to English for Latin-script text. That ignores language-specific PySBD abbreviation rules; for example a Spanish transcript like El Sr. García llegó. Después habló. can be split after Sr., so the new sentence-level cache sends and stores fragments that the previous full-text batch path translated with context. Please pass the detected source language into the splitter, or avoid sentence splitting when it is unknown.

Useful? React with 👍 / 👎.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Addressed the latest Codex comment about Latin-script text being routed through English PySBD rules.

Root cause

translate_units_batch() and translate_text_by_sentence() were calling:

split_into_sentences(text)

without a source language. For Latin-script foreign text, _pysbd_language_for_text() therefore fell back to English, so language-specific PySBD abbreviation rules were not used.

Fix

Added _detect_sentence_split_language(...), which uses the existing local confidence detector only to select the PySBD rule set before sentence splitting. If detection is inconclusive, we keep the existing script/punctuation routing behavior instead of introducing any custom splitter/fallback.

Both sentence-splitting callers now pass the detected source language into PySBD routing:

  • translate_text_by_sentence()
  • translate_units_batch()

Regression coverage

Added coverage for a Spanish Latin-script abbreviation case:

La Sra. García llegó. Después habló.

The test asserts that the API batch contents are split with Spanish PySBD rules:

["La Sra. García llegó.", "Después habló."]

Validation

cd backend
uv run --with pytest --with pytest-asyncio --with google-cloud-translate --with langdetect --with redis --with pydantic --with pysbd \
  pytest tests/unit/test_translation_cost_optimization.py tests/unit/test_translation_dedup_edge_cases.py tests/unit/test_translation_optimization.py -q --tb=short

204 passed, 122 subtests passed in 1.55s

Formatting:

black --check --line-length 120 --skip-string-normalization \
  backend/utils/translation.py backend/tests/unit/test_translation_cost_optimization.py

All done! 2 files would be left unchanged.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b9d0f1d9c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

self._batch_task.cancel()
self._batch_task = None
self.metrics['prefix_resets'] += 1
continue # Don't add to batch buffer

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply cached language before no-op prefix exits

Fresh evidence: this no-op cached-hit branch exits before the update_from_detected_language() call that was added in the normal Redis-hit branch. When the sentence cache contains an unchanged but foreign translation (e.g. Spanish No. cached as {"text":"No.","detected_lang":"es"} from a longer segment), a prefix reset to that text while the conversation is monolingual commits it and leaves the gate in monolingual mode; the next short appended tail can then be skipped as target-language instead of translated.

Useful? React with 👍 / 👎.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Follow-up after an external review pass on the latest sentence-splitting changes.

Issue addressed

The previous fix routed Latin-script text through the detected PySBD source language, but PySBD's Latin abbreviation coverage still varies by title/language. For example, Spanish Sr. can still be emitted as an abbreviation fragment. If we then cache/translate that fragment independently, translation quality can regress more than the sentence-level dedup helps.

Fix

Added a conservative translation-unit helper:

  • still uses PySBD for boundary decisions
  • still passes detected source language into PySBD
  • but if a detected non-English Latin-script source produces an obvious short title/abbreviation fragment, it keeps the full text as one translation/cache unit instead of caching a bad fragment

This preserves the low-maintenance approach: no custom splitter and no fallback sentence-boundary engine. The safety valve only declines sentence-level splitting for risky Latin abbreviation output.

Regression coverage

Added/updated coverage for:

El Sr. García llegó. Después habló.

Expected API contents now stay whole:

["El Sr. García llegó. Después habló."]

Validation

cd backend
uv run --with pytest --with pytest-asyncio --with google-cloud-translate --with langdetect --with redis --with pydantic --with pysbd \
  pytest tests/unit/test_translation_cost_optimization.py tests/unit/test_translation_dedup_edge_cases.py tests/unit/test_translation_optimization.py -q --tb=short

204 passed, 122 subtests passed in 1.53s

Formatting:

black --check --line-length 120 --skip-string-normalization backend/utils/translation.py backend/tests/unit/test_translation_cost_optimization.py

All done! 2 files would be left unchanged.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Addressed the latest Codex comment about the no-op cached prefix-hit branch.

Root cause

The normal cached Redis hit path now updates ConversationLanguageState from trusted detected_lang, but the no-op cached-hit branch returned earlier, before that update. For unchanged foreign text like Sí. with detected_lang='es', that could leave monolingual=True even though the cached API metadata says the source is foreign.

Fix

The no-op cached prefix-hit branch now also calls:

self.language_state.update_from_detected_language(detected_lang, speaker_id=segment.speaker_id)

before exiting.

Regression coverage

Added test_prefix_reset_noop_redis_hit_updates_language_state_before_exit, covering a cached no-op Sí. / es hit while the conversation gate starts monolingual.

Validation

cd backend
uv run --with pytest --with pytest-asyncio --with google-cloud-translate --with langdetect --with redis --with pydantic --with pysbd \
  pytest tests/unit/test_translation_cost_optimization.py tests/unit/test_translation_dedup_edge_cases.py tests/unit/test_translation_optimization.py -q --tb=short

205 passed, 122 subtests passed in 1.55s

Formatting:

black --check --line-length 120 --skip-string-normalization backend/utils/translation.py backend/utils/translation_coordinator.py backend/tests/unit/test_translation_cost_optimization.py

All done! 3 files would be left unchanged.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cade17379

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +863 to +864
if _any_failure:
results[orig_idx] = (unit_id, original_text, '')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don’t commit failed chunks as no-op translations

When any uncached sentence chunk hits the exception path, _any_failure makes this return the original text with an empty detected language. _flush_batch() then treats that as a no-op translation, sets the negative cache, and advances committed_text, so a transient Google error on one sentence prevents the segment from being retried and can leave it untranslated for the negative-cache TTL.

Useful? React with 👍 / 👎.

Comment thread backend/utils/translation.py Outdated
Comment on lines +482 to +485
if normalized in _LATIN_COMPATIBLE_PYSBD_LANGUAGES and split_base != target_base and normalized != "en":
if any(re.search(r"\b[A-Za-zÀ-ÖØ-öø-ÿ]{1,4}\.$", sentence.strip()) for sentence in sentences):
stripped = text.strip()
return [stripped] if stripped else []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restrict the Latin safety valve to abbreviations

Fresh evidence: this helper now recombines correctly split PySBD sentences whenever any non-English Latin sentence ends in a 1–4 letter word plus a period, so common utterances like Sí. Gracias. or French/Spanish short replies are returned as one full-text unit. That bypasses the new per-sentence hashes and cache reuse for high-frequency short sentences, undercutting the cost-saving path this PR is adding; only treat known abbreviation fragments like Sr. as unsafe.

Useful? React with 👍 / 👎.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

Addressed the two latest Codex comments.

1) Narrow the Latin safety valve

Codex was right that the previous safety valve was too broad: it treated any non-English Latin sentence ending in a 1–4 letter word plus . as unsafe, which would recombine valid short sentences like:

Sí. Gracias.

Fix:

  • replaced the broad short-word regex with a small known title/abbreviation-fragment set (Sr., Sra., Dr., etc.)
  • still uses PySBD for sentence-boundary decisions
  • only declines sentence-level splitting when PySBD emits one of those known risky fragments

Added regression coverage that keeps El Sr. García llegó. Después habló. whole, while still splitting Sí. Gracias. into per-sentence cache units.

2) Keep failed sentence chunks retryable

Codex was right that returning (original_text, '') for a transient chunk failure made _flush_batch() treat it like a no-op translation, set negative cache, and advance committed_text.

Fix:

  • added a retryable sentinel TRANSLATION_BATCH_FAILURE_LANG
  • translate_units_batch() returns that sentinel when any sentence chunk fails/falls back
  • _flush_batch() detects the sentinel and skips commit/callback/negative-cache so the same text can be retried later

Added regression coverage that a retryable batch failure does not commit text, invoke callback, or set negative cache.

Validation

cd backend
uv run --with pytest --with pytest-asyncio --with google-cloud-translate --with langdetect --with redis --with pydantic --with pysbd \
  pytest tests/unit/test_translation_cost_optimization.py tests/unit/test_translation_dedup_edge_cases.py tests/unit/test_translation_optimization.py -q --tb=short

207 passed, 122 subtests passed in 1.53s

Formatting:

black --check --line-length 120 --skip-string-normalization backend/utils/translation.py backend/utils/translation_coordinator.py backend/tests/unit/test_translation_cost_optimization.py

All done! 3 files would be left unchanged.

@Git-on-my-level

Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Already looking forward to the next diff.

Reviewed commit: 8e2061700f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

P1 #1 (thread safety): Clarify split_into_sentences is pure regex
(no PySBD) — fully thread-safe, no shared mutable state.

P1 #2 (non-English enders): Add Arabic/Hindi enders to _should_merge's
punctuation check set (؟۔।॥).

P2s — _should_merge rewrite:
- Restrict lowercase merge to single-token fragments only (not multi-word
  sentences like 'Sí. Gracias.' or 'I agree. thanks.')
- Don't merge across real sentence boundaries after abbreviations when
  next segment starts capitalized + has multiple words
- Preserve boundaries after etc./vs. at real sentence boundaries
- Extend acronym protection to 3+ part initialisms (U.S.A., F.B.I.)
- Increase token length limit to 8 for multi-part acronyms
- Add 'Sr' to title abbreviation list

P2s — translate_units_batch:
- Failed chunks no longer committed as no-op translations that set
  negative cache and advance committed_text (blocking retry). Empty
  detected_lang now signals failure vs genuine no-op.

P2s — translation_coordinator:
- Version bump + batch invalidation moved BEFORE Redis lookup so
  in-flight batches are rejected immediately (not after slow Redis)
- Cached detected language applied to ConversationLanguageState on
  prefix-reset cache hit (breaks monolingual gate for foreign text)
- Non-no-op cached path: removed redundant version bump/cancel
@Git-on-my-level Git-on-my-level force-pushed the fix/translation-dedup-sentence-level branch from 8e20617 to a1add57 Compare June 20, 2026 20:34

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1add5797a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


_ABBREV_PATTERNS = [
# Country codes: U.S. → UⓐⓑⓒS. (internal period protected, trailing period kept)
(re.compile(r'\b([A-Z])\.([A-Z])\.'), lambda m: m.group(1) + _ABBR + m.group(2) + '.'),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match longer acronyms before two-part abbreviations

Fresh evidence in this revision is that a three-part acronym pattern was added, but it is placed after this broader two-part pattern, so U.S.A./F.B.I. are partially rewritten as U.S./F.B. before the longer regex ever runs. In the new sentence-level path, inputs like I live in the U.S.A. now. still split into fragments such as I live in the U.S. and A. now., which are translated and cached without the full sentence context.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants