- MVP Overview
Goal:
Deliver a voice-first, camera‑aware multimodal agent that users can speak to naturally.
The agent listens, sees through the webcam (or uploaded images), and responds via voice + text.
Key Capabilities:
Hands‑free operation (wake word + continuous listening) Real-time speech-to-text (STT) Real-time natural speech output (TTS) Vision input via webcam or uploaded images Multimodal reasoning using your preferred LLM (GPT‑5‑1 or others) Multi-step action handling (plans, summaries, instructions)
User Flow (MVP version):
User clicks Start (or says wake word) Mic + camera turn on User speaks a request App captures audio and/or frame(s) Sends STT + image(s) + context to LLM LLM returns answer + optional action structure TTS plays reply Loop
- Functional Requirements
Voice
Continuous microphone stream or push-to-talk. Local VAD (voice activity detection) to avoid unnecessary STT cost. Real-time STT using: OpenAI Realtime API (recommended) or Whisper locally Natural TTS: OpenAI Realtime voice output or a simple TTS engine such as Piper for local
Vision
Periodic frame capture from webcam while listening. Manual image capture button. Computer vision tasks: Object recognition Scene understanding OCR (if needed) Image-based follow-up questions
LLM Reasoning Layer
The agent should:
Combine STT transcript + visual context Produce structured output: natural language action JSON (optional) Store conversation + vision context in memory.
JSON Action format example:
code { "action": "answer", "content": "The plant looks underwatered." }
- System Architecture (MVP)
Client (Next.js or React)
Mic capture Webcam capture Streams audio + optional frames to backend Receives LLM response + TTS stream Plays TTS audio
Backend
Optional if using OpenAI Realtime directly from frontend Responsible for: STT if using Whisper locally Frame processing if using server CV models Proxying LLM requests for safety or secret protection Session memory store (SQLite or Redis)
LLM (GPT-5-1 or Realtime model)
Receives text + image inputs Produces structured response
- File & Folder Structure (ideal for Cursor)
code /app /api realtime.ts # optional backend proxy /components VoiceController.tsx VisionController.tsx AgentConsole.tsx /utils audio.ts # mic utils camera.ts # webcam utils llm.ts # calls to LLM / Realtime API memory.ts # session memory page.tsx # main UI
/backend server.js (or python/main.py)
- Interface Specifications
5.1 STT Stream Contract
Client → Backend or directly → OpenAI Realtime
code { "audio": , "session_id": "abc123", "timestamp": 123456 }
Response:
code { "text": "User speech transcript" }
5.2 Vision Input Contract
code { "image_base64": "...", "session_id": "abc123", "context": "last user utterance" }
5.3 LLM Prompt Template (MVP)
code
SYSTEM:
You are a voice-first, vision-aware assistant.
Use speech-friendly responses.
If images are present, reason based on them.
Output as JSON:
{ "answer": "", "steps": [ "...optional reasoning..." ], "actions": [ "...optional actions user should take..." ] }
- Core Agent Loop (Pseudo-code)
code loop: stt_text = listen_for_speech() frame = capture_frame()
payload = {
"user_text": stt_text,
"image": frame
}
llm_response = callLLM(payload)
play_tts(llm_response.answer)
update_memory(llm_response)
- MVP Scope (tight version)
➤ Audio in → text
➤ Optional camera frame → vision reasoning
➤ LLM answer
➤ Voice out via TTS
➤ Simple UI
Nothing more.
- Stretch Features (for later)
Wake-word detection (Porcupine or Picovoice) Local vision inference (SSD, YOLOv8, etc.) Multi-agent orchestration Tool calling (control browser, email, notes) Long-term memory Semantic search over past conversations
- Ready-to-Develop Tasks for Cursor
Task 1 — Create minimal UI
Microphone button Camera on/off toggle Stream logs panel
Task 2 — Implement mic stream → Realtime API
Task 3 — Implement webcam capture
Task 4 — Implement multimodal LLM call with:
STT transcript Image (if any) System template
Task 5 — Implement TTS playback
Task 6 — Add conversation memory
Simple JSON store per session