-
Notifications
You must be signed in to change notification settings - Fork 2.7k
AGT-2474: add commit user turn support for realtime models #4622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds an abstract commit_user_turn() to RealtimeSession and implements it across realtime provider plugins; agent_activity now invokes the realtime session's commit_user_turn() when present and still proceeds to the audio-recognition commit path. Provider implementations either perform finalization (OpenAI) or log unsupported warnings. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 🧪 Unit Test Generation v2 is now available!We have significantly improved our unit test generation capabilities. To enable: Add this to your reviews:
finishing_touches:
unit_tests:
enabled: trueTry it out by using the Have feedback? Share your thoughts on our Discord thread! Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
bml1g12
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, as it means the docs here https://docs.livekit.io/agents/logic/turns/#manual would apply also to realtime model
it might make sense to have clear_user_turn() also call self.clear_audio() for realtime model? I say this because I think then https://docs.livekit.io/agents/logic/turns/#manual would fully apply
# When user starts speaking
@ctx.room.local_participant.register_rpc_method("start_turn")
async def start_turn(data: rtc.RpcInvocationData):
session.interrupt() # Stop any current agent speech
session.clear_user_turn() # Clear any previous input
session.input.set_audio_enabled(True) # Start listening
As for cascaded models clear_user_turn() clears any previous model input, but for realtime model we also need to clear the audio I think
| response=RealtimeResponseCreateParams(), | ||
| ) | ||
| ) | ||
| self.clear_audio() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a clear_audio is needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Send
input_audio_buffer.clearbefore beginning a new user input.
I think it means you need to clear the buffer before next time you want to start a new user speech but not means it's required after response.create for this turn.
maybe it's similar to the session.clear_user_turn in the example @bml1g12 mentioned above
# When user starts speaking
@ctx.room.local_participant.register_rpc_method("start_turn")
async def start_turn(data: rtc.RpcInvocationData):
session.interrupt() # Stop any current agent speech
session.clear_user_turn() # Clear any previous input
session.input.set_audio_enabled(True) # Start listening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can put that call in the clear_user_turn part.
Turns out we don't need this if we call session.clear_user_turn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes clear_user_turn already calls clear_audio() under the hood it seems, so indeed when you start the new turn you probably want to clear audio, now when you end the turn - and indeed that means probably not needed in this PR
longcw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, my last question:
| def commit_user_turn(self, *, transcript_timeout: float, stt_flush_duration: float) -> None: | ||
| if self._rt_session is not None: | ||
| self._rt_session.commit_user_turn() | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we commit to both rt session and audio recognition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point. I don't see why we can't. I will skip the return here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be relevant when using external STT with openai realtime model right? Actually we are currently exploring that, I.e. using manual turn taking with a discrete STT engine (not using openai server side whisper as STT)
I wonder if we take this approach, if it's possible to ensure the local and remote chat contexts are consistent and also that we have a local chat context we can use to recover the real time session if there's any networking issues.
It seems to me that when we commit the STT transcripts to local context (which maybe is what would happen if we commit the audio recognition in this code?) that we get a healthy looking local livekit chat context, but that somehow (presumably by livekit design) gets upserted automatically to the remote openai chat context, and I'm not sure if that's safe or not, given from OpenAI server side perspective they automatically add an empty user "content" message when server side STT is disabled (which I assume represents OpenAI's internal representation of those audio tokens, which is not human readable hence no "content")
This slack post explains in more detail what I am refering to
It looks to me that maybe if using an external STT, we would ideally want to put the user STT transcriptions to local chat context (i.e. so it can be recovered if the server connection is broken) but there is no need to upset it to remot context (because OpenAI realtime API model has no need at all for user side STT transcriptions to my understanding, given it processes raw audio tokens). Is that possible in livekit I wonder though? I imagine typically livekit tries to keep local and remote contexts in sync, which creates a (theoretical) problem here when we have local STT not needed by remote model. Although this is a theoretical problem, I should note I have not yet seen any adverse effects of this, just being cautious as it would be tough to spot if there was an issue I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STT is essentially emitting the final transcript for the realtime mode in this case, and realtime model doesn't really process/respond to final transcripts (different from generate_reply which is direct user input)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@livekit-agents/livekit/agents/voice/agent_activity.py`:
- Around line 978-981: Wrap the realtime call in commit_user_turn so exceptions
from self._rt_session.commit_user_turn() (e.g., RealtimeError) are caught and do
not prevent the STT commit from running; specifically, in the commit_user_turn
method surround the call to self._rt_session.commit_user_turn() with a
try/except that catches RealtimeError (or a broad Exception if RealtimeError
isn't imported) and ignores or logs it, then ensure the subsequent STT commit
still executes unconditionally.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
livekit-agents/livekit/agents/voice/agent_activity.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings
Files:
livekit-agents/livekit/agents/voice/agent_activity.py
🧠 Learnings (1)
📚 Learning: 2026-01-22T03:28:16.289Z
Learnt from: longcw
Repo: livekit/agents PR: 4563
File: livekit-agents/livekit/agents/beta/tools/end_call.py:65-65
Timestamp: 2026-01-22T03:28:16.289Z
Learning: In code paths that check capabilities or behavior of the LLM processing the current interaction, prefer using the activity's LLM obtained via ctx.session.current_agent._get_activity_or_raise().llm instead of ctx.session.llm. The session-level LLM may be a fallback and not reflect the actual agent handling the interaction. Use the activity LLM to determine capabilities and to make capability checks or feature toggles relevant to the current processing agent.
Applied to files:
livekit-agents/livekit/agents/voice/agent_activity.py
🧬 Code graph analysis (1)
livekit-agents/livekit/agents/voice/agent_activity.py (2)
livekit-agents/livekit/agents/voice/audio_recognition.py (1)
commit_user_turn(248-320)livekit-agents/livekit/agents/voice/agent_session.py (1)
commit_user_turn(978-998)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: livekit-plugins-deepgram
- GitHub Check: unit-tests
- GitHub Check: type-check (3.9)
- GitHub Check: type-check (3.13)
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| def commit_user_turn(self, *, transcript_timeout: float, stt_flush_duration: float) -> None: | ||
| if self._rt_session is not None: | ||
| self._rt_session.commit_user_turn() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for commit_user_turn implementation and exception types
rg -n "def commit_user_turn" --type py -A 5Repository: livekit/agents
Length of output: 6596
🏁 Script executed:
#!/bin/bash
# Search for llm.RealtimeError or similar exception definitions
rg -n "class.*RealtimeError|RealtimeError" --type py -A 2Repository: livekit/agents
Length of output: 9008
🏁 Script executed:
#!/bin/bash
# Check the context around lines 978-981 to see what follows commit_user_turn
sed -n '975,990p' livekit-agents/livekit/agents/voice/agent_activity.pyRepository: livekit/agents
Length of output: 782
🏁 Script executed:
#!/bin/bash
# Search for AudioRecognition.commit_user_turn to understand the full flow
rg -n "AudioRecognition.*commit_user_turn|\.commit_user_turn" livekit-agents/livekit/agents/voice/agent_activity.py -B 2 -A 2Repository: livekit/agents
Length of output: 494
Wrap realtime commit to ensure STT commit always runs.
The _rt_session.commit_user_turn() call should be guarded with exception handling. While current provider implementations don't raise exceptions, this ensures the STT commit at line 983 always executes even if a future provider implementation raises RealtimeError. This pattern is already established elsewhere in the codebase for other realtime operations.
Suggested fix
if self._rt_session is not None:
- self._rt_session.commit_user_turn()
+ try:
+ self._rt_session.commit_user_turn()
+ except llm.RealtimeError:
+ logger.exception("failed to commit user turn on realtime session")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def commit_user_turn(self, *, transcript_timeout: float, stt_flush_duration: float) -> None: | |
| if self._rt_session is not None: | |
| self._rt_session.commit_user_turn() | |
| def commit_user_turn(self, *, transcript_timeout: float, stt_flush_duration: float) -> None: | |
| if self._rt_session is not None: | |
| try: | |
| self._rt_session.commit_user_turn() | |
| except llm.RealtimeError: | |
| logger.exception("failed to commit user turn on realtime session") | |
🤖 Prompt for AI Agents
In `@livekit-agents/livekit/agents/voice/agent_activity.py` around lines 978 -
981, Wrap the realtime call in commit_user_turn so exceptions from
self._rt_session.commit_user_turn() (e.g., RealtimeError) are caught and do not
prevent the STT commit from running; specifically, in the commit_user_turn
method surround the call to self._rt_session.commit_user_turn() with a
try/except that catches RealtimeError (or a broad Exception if RealtimeError
isn't imported) and ignores or logs it, then ensure the subsequent STT commit
still executes unconditionally.

This adds
commit_user_turnsupport for realtime models:This allows users to use
turn_detection="manual"with a realtime model.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.