feat(realtime): Add audio conversations by richiejp · Pull Request #6245 · mudler/LocalAI

richiejp · 2025-09-10T14:22:39Z

Description

Add enough realtime API features to allow talking with an LLM using only audio.

Presently the realtime API only supports transcription which is a minor use-case for it. This PR should allow it to be used with a basic voice assistant.

This PR will ignore many of the options and edge-cases. Instead it'll just, for e.g., rely on server side VAD to commit conversation items.

Notes for Reviewers

Configure a model pipeline or use a multi-modal model.
Commit client audio to the conversation
Generate a text response (optional)
Generate an audio response
Implement message item retrieval so client can get the audio
Allow the user to configure a composite model (pipeline model) (or can we use existing options for e.g. selecting voice style?)
Test and fix bugs in new code
Test for regressions in transcription mode

Extra:

Update to GA API
- Warning: This does not maintain backwards compatibility with the Beta API. Clients will need to update.
Remove thinking tokens
Enable tool calls
Document creating pipeline models
Check transcription only mode for regressions
Interrupt generation on voice detection?

Fixes: #3714 (but we'll need follow issues)

Signed commits

Yes, I signed my commits.

netlify · 2025-09-10T14:22:49Z

✅ Deploy Preview for localai ready!

Name	Link
🔨 Latest commit	`b1b2fbb`
🔍 Latest deploy log	https://app.netlify.com/projects/localai/deploys/697a2aa7d0fdba00088c85af
😎 Deploy Preview	https://deploy-preview-6245--localai.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

richiejp · 2025-09-13T10:19:46Z

It's not clear to me if we have audio support in llama.cpp: ggml-org/llama.cpp#15194

richiejp · 2025-09-13T10:20:45Z

ggml-org/llama.cpp#13759

richiejp · 2025-09-13T10:21:50Z

ggml-org/llama.cpp#13784

mudler · 2025-09-21T16:03:11Z

my initial thought on this was to use the whisper backend for transcribing from VAD, and give the text to a text-to-text backend, this way we can always go back at this. There was also an interface created exactly for this so a pipeline can be kinda seen as a "drag and drop" until omni models are really capable.

However, yes audio input is actually supported by llama.cpp and our backends, try qwen2-omni, you will be able to give it an audio as input, but isn't super accurate (better transcribing for now).

richiejp · 2025-09-21T17:58:16Z

OK, I tried Qwen 2 omni and had issues with accuracy and context length which aren't a problem for a pipeline.

richiejp · 2026-01-01T09:47:54Z

#7812

richiejp · 2026-01-01T17:34:36Z

OpenAI made quite some changes to the API that possibly it would have been better to handle before this, but there are also changes in-flight to the Go realtime API library AFAICT. I really want to get something working, so I am just ignoring these changes for now and will have to address them afterwards.

richiejp · 2026-01-07T13:53:23Z

and it works. There is a long list of issues however I have the full pipeline working.

richiejp · 2026-01-07T13:58:02Z

To be clear probably nobody will want to use this given its current state, but we could merge it for my own experimentation and so I don't have to keep rebasing on master. Next I need to update the API to the current OpenAI GA. @mudler

richiejp · 2026-01-07T14:35:24Z

Build error: "E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/libc/libcaca/libcaca0_0.99.beta20-4ubuntu0.1_amd64.deb 404 Not Found [IP: 91.189.91.83 80]". Strange, I can download this file.

mudler · 2026-01-17T08:41:16Z

core/http/endpoints/openai/realtime.go

-		//       So possibly we could have a way to configure a composite model that can be used in situations where any-to-any is expected
 		pipeline := config.Pipeline{
-			VAD:           vadModel,
+			VAD:           defaultVADModel,


wouldn't this hardcode models? Can't we stick to the configuration that can be defined

LocalAI/core/config/model_config.go

Line 136 in 5fe9bf9

type Pipeline struct {

?

I have done this now with a new interface similar to the various backend functions instead of the grpc.Backend like interface that was there before. I tried doing it the grpc.Backend way by integrating Pipeline models into the backend loader and then ran into issues and started refactoring and things started to get crazy: https://github.com/mudler/LocalAI/compare/master...richiejp:LocalAI:chore/realtime-refactor-model-loader?expand=1

Probably though there is a better abstraction that would combine ModelLoader, ModelConfigLoader, ApplicationConfig etc. into one argument in all the places we are passing it around, but it basically touches the entire codebase.

yes totally there is room for improvement here and uplevel the abstraction a bit to make it easier to pass around the various components. I kinda started slowly to refactor things around https://github.com/mudler/LocalAI/tree/master/core/application - so there is one component that "orchestrate" also high level the various parts (stopping and starting components too).

But I guess we can slowly get into using it step by step, or either have a separate PR addressing refactoring

Signed-off-by: Richard Palethorpe <io@richiejp.com>

…anscripts Ensure that content is sent in both deltas and done events for function call arguments and audio transcripts. This fixes compatibility with clients that rely on delta events for parsing. 💘 Generated with Crush Signed-off-by: Richard Palethorpe <io@richiejp.com>

- Refactor Model interface to accept []types.ToolUnion and *types.ToolChoiceUnion instead of JSON strings, eliminating unnecessary marshal/unmarshal cycles - Fix Parameters field handling: support both map[string]any and JSON string formats - Add PredictConfig() method to Model interface for accessing model configuration - Add comprehensive debug logging for tool call parsing and function config - Add missing return statement after prediction error (critical bug fix) - Add warning logs for NoAction function argument parsing failures - Improve error visibility throughout generateResponse function 💘 Generated with Crush Assisted-by: Claude Sonnet 4.5 via Crush <crush@charm.land> Signed-off-by: Richard Palethorpe <io@richiejp.com>

mudler

This is absolutely great! kudos @richiejp !

* feat(realtime): Add audio conversations Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(realtime): Vendor the updated API and modify for server side Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): Update to the GA realtime API Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore: Document realtime API and add docs to AGENTS.md Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat: Filter reasoning from spoken output Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(realtime): Send delta and done events for tool calls and audio transcripts Ensure that content is sent in both deltas and done events for function call arguments and audio transcripts. This fixes compatibility with clients that rely on delta events for parsing. 💘 Generated with Crush Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(realtime): Improve tool call handling and error reporting - Refactor Model interface to accept []types.ToolUnion and *types.ToolChoiceUnion instead of JSON strings, eliminating unnecessary marshal/unmarshal cycles - Fix Parameters field handling: support both map[string]any and JSON string formats - Add PredictConfig() method to Model interface for accessing model configuration - Add comprehensive debug logging for tool call parsing and function config - Add missing return statement after prediction error (critical bug fix) - Add warning logs for NoAction function argument parsing failures - Improve error visibility throughout generateResponse function 💘 Generated with Crush Assisted-by: Claude Sonnet 4.5 via Crush <crush@charm.land> Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>

mudler added the roadmap label Sep 11, 2025

richiejp force-pushed the feat/realtime-audio-conv branch from 2eae0d9 to c1b9f23 Compare September 21, 2025 17:59

richiejp force-pushed the feat/realtime-audio-conv branch from c1b9f23 to a24bc9c Compare December 19, 2025 15:22

richiejp force-pushed the feat/realtime-audio-conv branch from 06fa944 to e2c1fad Compare December 31, 2025 13:04

richiejp mentioned this pull request Dec 31, 2025

Assistant mode: voice only richiejp/VoxInput#40

Merged

richiejp force-pushed the feat/realtime-audio-conv branch 2 times, most recently from 2271f01 to 915824d Compare January 7, 2026 13:50

richiejp marked this pull request as ready for review January 7, 2026 13:58

richiejp force-pushed the feat/realtime-audio-conv branch from 915824d to 91c4e02 Compare January 7, 2026 14:07

richiejp enabled auto-merge (squash) January 7, 2026 14:07

richiejp mentioned this pull request Jan 7, 2026

Add support for realtime API #3714

Closed

richiejp force-pushed the feat/realtime-audio-conv branch 2 times, most recently from 37606d4 to 19cbec6 Compare January 14, 2026 11:04

richiejp marked this pull request as draft January 15, 2026 15:07

auto-merge was automatically disabled January 15, 2026 15:07
Pull request was converted to draft

richiejp force-pushed the feat/realtime-audio-conv branch 2 times, most recently from 03f3e2a to afc8875 Compare January 16, 2026 10:44

richiejp marked this pull request as ready for review January 16, 2026 10:44

mudler reviewed Jan 17, 2026

View reviewed changes

richiejp force-pushed the feat/realtime-audio-conv branch 3 times, most recently from 800ac5b to ee7e393 Compare January 26, 2026 10:49

github-actions bot added the kind/documentation Improvements or additions to documentation label Jan 26, 2026

richiejp force-pushed the feat/realtime-audio-conv branch from afc60f0 to 459891b Compare January 26, 2026 12:11

richiejp marked this pull request as draft January 27, 2026 09:33

richiejp mentioned this pull request Jan 28, 2026

fix(qwen3): Be explicit with function calling format #8265

Merged

richiejp marked this pull request as ready for review January 28, 2026 11:24

richiejp force-pushed the feat/realtime-audio-conv branch from 6196a91 to f7a3995 Compare January 28, 2026 11:25

github-actions bot added the area/ai-model label Jan 28, 2026

richiejp added 7 commits January 28, 2026 15:26

feat(realtime): Add audio conversations

68c8d9a

Signed-off-by: Richard Palethorpe <io@richiejp.com>

chore(realtime): Vendor the updated API and modify for server side

a9cc29e

Signed-off-by: Richard Palethorpe <io@richiejp.com>

feat(realtime): Update to the GA realtime API

7d33bbc

Signed-off-by: Richard Palethorpe <io@richiejp.com>

chore: Document realtime API and add docs to AGENTS.md

517f7d1

Signed-off-by: Richard Palethorpe <io@richiejp.com>

feat: Filter reasoning from spoken output

438cd83

Signed-off-by: Richard Palethorpe <io@richiejp.com>

richiejp force-pushed the feat/realtime-audio-conv branch from f7a3995 to b1b2fbb Compare January 28, 2026 15:26

mudler approved these changes Jan 29, 2026

View reviewed changes

mudler merged commit dd8e74a into mudler:master Jan 29, 2026
34 checks passed

mudler added enhancement New feature or request and removed kind/documentation Improvements or additions to documentation area/ai-model labels Jan 29, 2026

BrewTestBot mentioned this pull request Feb 7, 2026

localai 3.11.0 Homebrew/homebrew-core#266363

Merged

grospherellc-ops approved these changes Feb 14, 2026

View reviewed changes

Uh oh!

Conversation

richiejp commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for localai ready!

Uh oh!

richiejp commented Sep 13, 2025

Uh oh!

richiejp commented Sep 13, 2025

Uh oh!

richiejp commented Sep 13, 2025

Uh oh!

mudler commented Sep 21, 2025

Uh oh!

richiejp commented Sep 21, 2025

Uh oh!

richiejp commented Jan 1, 2026

Uh oh!

richiejp commented Jan 1, 2026

Uh oh!

richiejp commented Jan 7, 2026

Uh oh!

richiejp commented Jan 7, 2026

Uh oh!

richiejp commented Jan 7, 2026

Uh oh!

mudler Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

richiejp Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

mudler Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

mudler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

richiejp commented Sep 10, 2025 •

edited

Loading

netlify bot commented Sep 10, 2025 •

edited

Loading