feat(realtime): Add audio conversations#6245
Conversation
✅ Deploy Preview for localai ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
It's not clear to me if we have audio support in llama.cpp: ggml-org/llama.cpp#15194 |
|
my initial thought on this was to use the whisper backend for transcribing from VAD, and give the text to a text-to-text backend, this way we can always go back at this. There was also an interface created exactly for this so a pipeline can be kinda seen as a "drag and drop" until omni models are really capable. However, yes audio input is actually supported by llama.cpp and our backends, try qwen2-omni, you will be able to give it an audio as input, but isn't super accurate (better transcribing for now). |
|
OK, I tried Qwen 2 omni and had issues with accuracy and context length which aren't a problem for a pipeline. |
2eae0d9 to
c1b9f23
Compare
c1b9f23 to
a24bc9c
Compare
06fa944 to
e2c1fad
Compare
|
OpenAI made quite some changes to the API that possibly it would have been better to handle before this, but there are also changes in-flight to the Go realtime API library AFAICT. I really want to get something working, so I am just ignoring these changes for now and will have to address them afterwards. |
2271f01 to
915824d
Compare
|
and it works. There is a long list of issues however I have the full pipeline working. |
|
To be clear probably nobody will want to use this given its current state, but we could merge it for my own experimentation and so I don't have to keep rebasing on master. Next I need to update the API to the current OpenAI GA. @mudler |
915824d to
91c4e02
Compare
|
Build error: "E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/libc/libcaca/libcaca0_0.99.beta20-4ubuntu0.1_amd64.deb 404 Not Found [IP: 91.189.91.83 80]". Strange, I can download this file. |
37606d4 to
19cbec6
Compare
Pull request was converted to draft
03f3e2a to
afc8875
Compare
| // So possibly we could have a way to configure a composite model that can be used in situations where any-to-any is expected | ||
| pipeline := config.Pipeline{ | ||
| VAD: vadModel, | ||
| VAD: defaultVADModel, |
There was a problem hiding this comment.
wouldn't this hardcode models? Can't we stick to the configuration that can be defined
LocalAI/core/config/model_config.go
Line 136 in 5fe9bf9
There was a problem hiding this comment.
I have done this now with a new interface similar to the various backend functions instead of the grpc.Backend like interface that was there before. I tried doing it the grpc.Backend way by integrating Pipeline models into the backend loader and then ran into issues and started refactoring and things started to get crazy: https://github.com/mudler/LocalAI/compare/master...richiejp:LocalAI:chore/realtime-refactor-model-loader?expand=1
Probably though there is a better abstraction that would combine ModelLoader, ModelConfigLoader, ApplicationConfig etc. into one argument in all the places we are passing it around, but it basically touches the entire codebase.
There was a problem hiding this comment.
yes totally there is room for improvement here and uplevel the abstraction a bit to make it easier to pass around the various components. I kinda started slowly to refactor things around https://github.com/mudler/LocalAI/tree/master/core/application - so there is one component that "orchestrate" also high level the various parts (stopping and starting components too).
But I guess we can slowly get into using it step by step, or either have a separate PR addressing refactoring
800ac5b to
ee7e393
Compare
afc60f0 to
459891b
Compare
6196a91 to
f7a3995
Compare
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
…anscripts Ensure that content is sent in both deltas and done events for function call arguments and audio transcripts. This fixes compatibility with clients that rely on delta events for parsing. 💘 Generated with Crush Signed-off-by: Richard Palethorpe <io@richiejp.com>
- Refactor Model interface to accept []types.ToolUnion and *types.ToolChoiceUnion instead of JSON strings, eliminating unnecessary marshal/unmarshal cycles - Fix Parameters field handling: support both map[string]any and JSON string formats - Add PredictConfig() method to Model interface for accessing model configuration - Add comprehensive debug logging for tool call parsing and function config - Add missing return statement after prediction error (critical bug fix) - Add warning logs for NoAction function argument parsing failures - Improve error visibility throughout generateResponse function 💘 Generated with Crush Assisted-by: Claude Sonnet 4.5 via Crush <crush@charm.land> Signed-off-by: Richard Palethorpe <io@richiejp.com>
f7a3995 to
b1b2fbb
Compare
* feat(realtime): Add audio conversations Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(realtime): Vendor the updated API and modify for server side Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): Update to the GA realtime API Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore: Document realtime API and add docs to AGENTS.md Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat: Filter reasoning from spoken output Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(realtime): Send delta and done events for tool calls and audio transcripts Ensure that content is sent in both deltas and done events for function call arguments and audio transcripts. This fixes compatibility with clients that rely on delta events for parsing. 💘 Generated with Crush Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(realtime): Improve tool call handling and error reporting - Refactor Model interface to accept []types.ToolUnion and *types.ToolChoiceUnion instead of JSON strings, eliminating unnecessary marshal/unmarshal cycles - Fix Parameters field handling: support both map[string]any and JSON string formats - Add PredictConfig() method to Model interface for accessing model configuration - Add comprehensive debug logging for tool call parsing and function config - Add missing return statement after prediction error (critical bug fix) - Add warning logs for NoAction function argument parsing failures - Improve error visibility throughout generateResponse function 💘 Generated with Crush Assisted-by: Claude Sonnet 4.5 via Crush <crush@charm.land> Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>
Description
Add enough realtime API features to allow talking with an LLM using only audio.
Presently the realtime API only supports transcription which is a minor use-case for it. This PR should allow it to be used with a basic voice assistant.
This PR will ignore many of the options and edge-cases. Instead it'll just, for e.g., rely on server side VAD to commit conversation items.
Notes for Reviewers
Extra:
Fixes: #3714 (but we'll need follow issues)
Signed commits