Skip to content

ivan-koltsov/ai-video-voice-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

ai-video-voice-agent

  1. MVP Overview

Goal:

Deliver a voice-first, camera‑aware multimodal agent that users can speak to naturally.

The agent listens, sees through the webcam (or uploaded images), and responds via voice + text.

Key Capabilities:

Hands‑free operation (wake word + continuous listening) Real-time speech-to-text (STT) Real-time natural speech output (TTS) Vision input via webcam or uploaded images Multimodal reasoning using your preferred LLM (GPT‑5‑1 or others) Multi-step action handling (plans, summaries, instructions)

User Flow (MVP version):

User clicks Start (or says wake word) Mic + camera turn on User speaks a request App captures audio and/or frame(s) Sends STT + image(s) + context to LLM LLM returns answer + optional action structure TTS plays reply Loop

  1. Functional Requirements

Voice

Continuous microphone stream or push-to-talk. Local VAD (voice activity detection) to avoid unnecessary STT cost. Real-time STT using: OpenAI Realtime API (recommended) or Whisper locally Natural TTS: OpenAI Realtime voice output or a simple TTS engine such as Piper for local

Vision

Periodic frame capture from webcam while listening. Manual image capture button. Computer vision tasks: Object recognition Scene understanding OCR (if needed) Image-based follow-up questions

LLM Reasoning Layer

The agent should:

Combine STT transcript + visual context Produce structured output: natural language action JSON (optional) Store conversation + vision context in memory.

JSON Action format example:

code { "action": "answer", "content": "The plant looks underwatered." }

  1. System Architecture (MVP)

Client (Next.js or React)

Mic capture Webcam capture Streams audio + optional frames to backend Receives LLM response + TTS stream Plays TTS audio

Backend

Optional if using OpenAI Realtime directly from frontend Responsible for: STT if using Whisper locally Frame processing if using server CV models Proxying LLM requests for safety or secret protection Session memory store (SQLite or Redis)

LLM (GPT-5-1 or Realtime model)

Receives text + image inputs Produces structured response

  1. File & Folder Structure (ideal for Cursor)

code /app /api realtime.ts # optional backend proxy /components VoiceController.tsx VisionController.tsx AgentConsole.tsx /utils audio.ts # mic utils camera.ts # webcam utils llm.ts # calls to LLM / Realtime API memory.ts # session memory page.tsx # main UI

/backend server.js (or python/main.py)

  1. Interface Specifications

5.1 STT Stream Contract

Client → Backend or directly → OpenAI Realtime

code { "audio": , "session_id": "abc123", "timestamp": 123456 }

Response:

code { "text": "User speech transcript" }

5.2 Vision Input Contract

code { "image_base64": "...", "session_id": "abc123", "context": "last user utterance" }

5.3 LLM Prompt Template (MVP)

code SYSTEM: You are a voice-first, vision-aware assistant.
Use speech-friendly responses.
If images are present, reason based on them.
Output as JSON:

{ "answer": "", "steps": [ "...optional reasoning..." ], "actions": [ "...optional actions user should take..." ] }

  1. Core Agent Loop (Pseudo-code)

code loop: stt_text = listen_for_speech() frame = capture_frame()

payload = {
    "user_text": stt_text,
    "image": frame
}

llm_response = callLLM(payload)

play_tts(llm_response.answer)

update_memory(llm_response)
  1. MVP Scope (tight version)

➤ Audio in → text

➤ Optional camera frame → vision reasoning

➤ LLM answer

➤ Voice out via TTS

➤ Simple UI

Nothing more.

  1. Stretch Features (for later)

Wake-word detection (Porcupine or Picovoice) Local vision inference (SSD, YOLOv8, etc.) Multi-agent orchestration Tool calling (control browser, email, notes) Long-term memory Semantic search over past conversations

  1. Ready-to-Develop Tasks for Cursor

Task 1 — Create minimal UI

Microphone button Camera on/off toggle Stream logs panel

Task 2 — Implement mic stream → Realtime API

Task 3 — Implement webcam capture

Task 4 — Implement multimodal LLM call with:

STT transcript Image (if any) System template

Task 5 — Implement TTS playback

Task 6 — Add conversation memory

Simple JSON store per session

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors