Skip to content

Conversation

@chenghao-mou
Copy link
Member

@chenghao-mou chenghao-mou commented Jan 26, 2026

This adds commit_user_turn support for realtime models:

This allows users to use turn_detection="manual" with a realtime model.

Summary by CodeRabbit

  • New Features
    • Added an explicit "commit user turn" action for real-time agent sessions to finalize a user's turn.
    • OpenAI provider: committing a user turn also triggers response creation.
    • Google/AWS/Ultravox providers: the action exists but logs warnings or acts as a placeholder where unsupported.
    • Voice agent: when a real-time session is active, the realtime commit path is invoked before continuing audio processing.

✏️ Tip: You can customize this high-level summary in your review settings.

@chenghao-mou chenghao-mou requested a review from a team January 26, 2026 15:44
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

Adds an abstract commit_user_turn() to RealtimeSession and implements it across realtime provider plugins; agent_activity now invokes the realtime session's commit_user_turn() when present and still proceeds to the audio-recognition commit path. Provider implementations either perform finalization (OpenAI) or log unsupported warnings.

Changes

Cohort / File(s) Summary
Abstract Interface
livekit-agents/livekit/agents/llm/realtime.py
Adds abstract def commit_user_turn(self) -> None to RealtimeSession.
Agent Layer
livekit-agents/livekit/agents/voice/agent_activity.py
commit_user_turn now calls self._rt_session.commit_user_turn() when _rt_session exists, then continues to assert and call the existing audio-recognition commit_user_turn (no early return).
AWS Realtime Plugin
livekit-plugins/livekit-plugins-aws/.../realtime/realtime_model.py
Adds commit_user_turn() that logs a warning indicating Nova Sonic Realtime API does not support user-turn commit.
Google Realtime Plugin
livekit-plugins/livekit-plugins-google/.../realtime/realtime_api.py
Adds commit_user_turn() (warning). Replaces commit_audio() and clear_audio() no-ops with warning logs about Gemini Realtime API not supporting those actions.
OpenAI Realtime Plugin (stable & beta)
livekit-plugins/livekit-plugins-openai/.../realtime/realtime_model.py, .../realtime_model_beta.py
Implements commit_user_turn() to warn on auto-response/turn-detection, call commit_audio(), and emit a ResponseCreateEvent (empty params) to finalize the user turn.
Ultravox Realtime Plugin
livekit-plugins/livekit-plugins-ultravox/.../realtime/realtime_model.py
Adds commit_user_turn() that logs unsupported warning; changes push_video() from no-op to a warning log.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant AgentActivity as AgentActivity
participant RTSession as RealtimeSession
participant AudioRec as AudioRecognition
Note over AgentActivity,RTSession,AudioRec: High-level user-turn commit flow
AgentActivity->>RTSession: commit_user_turn()
alt RT session exists
RTSession-->>AgentActivity: ack / handle turn-finalization
AgentActivity->>AudioRec: commit_user_turn(...) 0.5
AudioRec-->>AgentActivity: commit result
else No RT session
AgentActivity->>AudioRec: commit_user_turn(...) 0.5
AudioRec-->>AgentActivity: commit result
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I thumped my paw to mark the turn,

A gentle hop, the paddles churn,
Some plugins warn, some send a bell,
A tiny hop to say all's well,
— rabbit clap and carrot swell 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: adding a new commit_user_turn method across realtime model implementations to support manual turn detection.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/commit-user-turn-realtime

Tip

🧪 Unit Test Generation v2 is now available!

We have significantly improved our unit test generation capabilities.

To enable: Add this to your .coderabbit.yaml configuration:

reviews:
  finishing_touches:
    unit_tests:
      enabled: true

Try it out by using the @coderabbitai generate unit tests command on your code files or under ✨ Finishing Touches on the walkthrough!

Have feedback? Share your thoughts on our Discord thread!


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@bml1g12 bml1g12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, as it means the docs here https://docs.livekit.io/agents/logic/turns/#manual would apply also to realtime model

it might make sense to have clear_user_turn() also call self.clear_audio() for realtime model? I say this because I think then https://docs.livekit.io/agents/logic/turns/#manual would fully apply

# When user starts speaking
@ctx.room.local_participant.register_rpc_method("start_turn")
async def start_turn(data: rtc.RpcInvocationData):
    session.interrupt()  # Stop any current agent speech
    session.clear_user_turn()  # Clear any previous input
    session.input.set_audio_enabled(True)  # Start listening

As for cascaded models clear_user_turn() clears any previous model input, but for realtime model we also need to clear the audio I think

response=RealtimeResponseCreateParams(),
)
)
self.clear_audio()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a clear_audio is needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, right? It seems redundant, but it is required for OpenAI according to their doc:
Screenshot 2026-01-27 at 12 55 20

Copy link
Contributor

@longcw longcw Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Send input_audio_buffer.clear before beginning a new user input.

I think it means you need to clear the buffer before next time you want to start a new user speech but not means it's required after response.create for this turn.

maybe it's similar to the session.clear_user_turn in the example @bml1g12 mentioned above

# When user starts speaking
@ctx.room.local_participant.register_rpc_method("start_turn")
async def start_turn(data: rtc.RpcInvocationData):
    session.interrupt()  # Stop any current agent speech
    session.clear_user_turn()  # Clear any previous input
    session.input.set_audio_enabled(True)  # Start listening

Copy link
Member Author

@chenghao-mou chenghao-mou Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can put that call in the clear_user_turn part.

Turns out we don't need this if we call session.clear_user_turn.

Copy link
Contributor

@bml1g12 bml1g12 Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes clear_user_turn already calls clear_audio() under the hood it seems, so indeed when you start the new turn you probably want to clear audio, now when you end the turn - and indeed that means probably not needed in this PR

Copy link
Contributor

@longcw longcw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, my last question:

def commit_user_turn(self, *, transcript_timeout: float, stt_flush_duration: float) -> None:
if self._rt_session is not None:
self._rt_session.commit_user_turn()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we commit to both rt session and audio recognition?

Copy link
Member Author

@chenghao-mou chenghao-mou Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point. I don't see why we can't. I will skip the return here.

Copy link
Contributor

@bml1g12 bml1g12 Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be relevant when using external STT with openai realtime model right? Actually we are currently exploring that, I.e. using manual turn taking with a discrete STT engine (not using openai server side whisper as STT)

I wonder if we take this approach, if it's possible to ensure the local and remote chat contexts are consistent and also that we have a local chat context we can use to recover the real time session if there's any networking issues.

It seems to me that when we commit the STT transcripts to local context (which maybe is what would happen if we commit the audio recognition in this code?) that we get a healthy looking local livekit chat context, but that somehow (presumably by livekit design) gets upserted automatically to the remote openai chat context, and I'm not sure if that's safe or not, given from OpenAI server side perspective they automatically add an empty user "content" message when server side STT is disabled (which I assume represents OpenAI's internal representation of those audio tokens, which is not human readable hence no "content")

This slack post explains in more detail what I am refering to

It looks to me that maybe if using an external STT, we would ideally want to put the user STT transcriptions to local chat context (i.e. so it can be recovered if the server connection is broken) but there is no need to upset it to remot context (because OpenAI realtime API model has no need at all for user side STT transcriptions to my understanding, given it processes raw audio tokens). Is that possible in livekit I wonder though? I imagine typically livekit tries to keep local and remote contexts in sync, which creates a (theoretical) problem here when we have local STT not needed by remote model. Although this is a theoretical problem, I should note I have not yet seen any adverse effects of this, just being cautious as it would be tough to spot if there was an issue I think

Copy link
Member Author

@chenghao-mou chenghao-mou Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STT is essentially emitting the final transcript for the realtime mode in this case, and realtime model doesn't really process/respond to final transcripts (different from generate_reply which is direct user input)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@livekit-agents/livekit/agents/voice/agent_activity.py`:
- Around line 978-981: Wrap the realtime call in commit_user_turn so exceptions
from self._rt_session.commit_user_turn() (e.g., RealtimeError) are caught and do
not prevent the STT commit from running; specifically, in the commit_user_turn
method surround the call to self._rt_session.commit_user_turn() with a
try/except that catches RealtimeError (or a broad Exception if RealtimeError
isn't imported) and ignores or logs it, then ensure the subsequent STT commit
still executes unconditionally.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd8f80a and 6e82766.

📒 Files selected for processing (1)
  • livekit-agents/livekit/agents/voice/agent_activity.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/voice/agent_activity.py
🧠 Learnings (1)
📚 Learning: 2026-01-22T03:28:16.289Z
Learnt from: longcw
Repo: livekit/agents PR: 4563
File: livekit-agents/livekit/agents/beta/tools/end_call.py:65-65
Timestamp: 2026-01-22T03:28:16.289Z
Learning: In code paths that check capabilities or behavior of the LLM processing the current interaction, prefer using the activity's LLM obtained via ctx.session.current_agent._get_activity_or_raise().llm instead of ctx.session.llm. The session-level LLM may be a fallback and not reflect the actual agent handling the interaction. Use the activity LLM to determine capabilities and to make capability checks or feature toggles relevant to the current processing agent.

Applied to files:

  • livekit-agents/livekit/agents/voice/agent_activity.py
🧬 Code graph analysis (1)
livekit-agents/livekit/agents/voice/agent_activity.py (2)
livekit-agents/livekit/agents/voice/audio_recognition.py (1)
  • commit_user_turn (248-320)
livekit-agents/livekit/agents/voice/agent_session.py (1)
  • commit_user_turn (978-998)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: livekit-plugins-deepgram
  • GitHub Check: unit-tests
  • GitHub Check: type-check (3.9)
  • GitHub Check: type-check (3.13)

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 978 to +981
def commit_user_turn(self, *, transcript_timeout: float, stt_flush_duration: float) -> None:
if self._rt_session is not None:
self._rt_session.commit_user_turn()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check for commit_user_turn implementation and exception types
rg -n "def commit_user_turn" --type py -A 5

Repository: livekit/agents

Length of output: 6596


🏁 Script executed:

#!/bin/bash
# Search for llm.RealtimeError or similar exception definitions
rg -n "class.*RealtimeError|RealtimeError" --type py -A 2

Repository: livekit/agents

Length of output: 9008


🏁 Script executed:

#!/bin/bash
# Check the context around lines 978-981 to see what follows commit_user_turn
sed -n '975,990p' livekit-agents/livekit/agents/voice/agent_activity.py

Repository: livekit/agents

Length of output: 782


🏁 Script executed:

#!/bin/bash
# Search for AudioRecognition.commit_user_turn to understand the full flow
rg -n "AudioRecognition.*commit_user_turn|\.commit_user_turn" livekit-agents/livekit/agents/voice/agent_activity.py -B 2 -A 2

Repository: livekit/agents

Length of output: 494


Wrap realtime commit to ensure STT commit always runs.

The _rt_session.commit_user_turn() call should be guarded with exception handling. While current provider implementations don't raise exceptions, this ensures the STT commit at line 983 always executes even if a future provider implementation raises RealtimeError. This pattern is already established elsewhere in the codebase for other realtime operations.

Suggested fix
        if self._rt_session is not None:
-            self._rt_session.commit_user_turn()
+            try:
+                self._rt_session.commit_user_turn()
+            except llm.RealtimeError:
+                logger.exception("failed to commit user turn on realtime session")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def commit_user_turn(self, *, transcript_timeout: float, stt_flush_duration: float) -> None:
if self._rt_session is not None:
self._rt_session.commit_user_turn()
def commit_user_turn(self, *, transcript_timeout: float, stt_flush_duration: float) -> None:
if self._rt_session is not None:
try:
self._rt_session.commit_user_turn()
except llm.RealtimeError:
logger.exception("failed to commit user turn on realtime session")
🤖 Prompt for AI Agents
In `@livekit-agents/livekit/agents/voice/agent_activity.py` around lines 978 -
981, Wrap the realtime call in commit_user_turn so exceptions from
self._rt_session.commit_user_turn() (e.g., RealtimeError) are caught and do not
prevent the STT commit from running; specifically, in the commit_user_turn
method surround the call to self._rt_session.commit_user_turn() with a
try/except that catches RealtimeError (or a broad Exception if RealtimeError
isn't imported) and ignores or logs it, then ensure the subsequent STT commit
still executes unconditionally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants