diff --git a/notes/CLOUD_ARCHITECTURE.md b/notes/CLOUD_ARCHITECTURE.md new file mode 100644 index 00000000..49ca0943 --- /dev/null +++ b/notes/CLOUD_ARCHITECTURE.md @@ -0,0 +1,498 @@ +# Cloud Mode Architecture + +## The Challenge + +Cloud coding agents face a fundamental tension: you want them to feel like your laptop, but they're not. You want the experience of running locally—real-time feedback, files on your disk, your IDE, full control—but the convenience of interacting from your phone or Slack while you're away. + +This creates two distinct experiences: + +**Interactive Mode** — "I'm watching" + +- Real-time feedback as the agent works +- Files sync to your local machine instantly +- You can interrupt, redirect, answer questions +- Feels like pair programming + +**Background Mode** — "Wake me when it's done" + +- Agent works autonomously +- You check in when you're ready +- Review changes, pull them locally, continue +- Feels like delegating to a colleague + +Most cloud agent implementations force you to choose one or the other. The goal here is to support both seamlessly—and let you switch between them without friction. + +### Key Goals + +1. **Local-first feel** — Edit in Array or your IDE, changes sync automatically +2. **Survive disconnection** — Close your laptop, agent keeps working +3. **Seamless resume** — Reconnect and catch up instantly +4. **Multiple clients** — Laptop, phone, Slack, API—all work +5. **Simple recovery** — If sandbox dies, state is recoverable + +--- + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ CLIENTS │ +│ Array Desktop │ Slack Bot │ API │ Mobile App │ +└─────────────────────────────────────────────────────────────────────────┘ + │ + │ Streamable HTTP (SSE + POST) + ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ BACKEND │ +│ │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ Sync API │ │ Temporal │ │ S3 Storage │ │ +│ │ (FastAPI) │◄──►│ Workflow │◄──►│ │ │ +│ └─────────────────┘ └─────────────────┘ │ - Event logs │ │ +│ │ - File content │ │ +│ └─────────────────┘ │ +└─────────────────────────────────────────────────────────────────────────┘ + ▲ │ + │ SSE (events) │ SSE (commands) + │ ▼ +┌─────────────────────────────────────────────────────────────────────────┐ +│ SANDBOX │ +│ │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ +│ │ Agent Server │◄──►│ File Watcher │───►│ S3 Upload │ │ +│ │ (message loop)│ │ │ │ (by hash) │ │ +│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ +│ │ │ +│ └──────────────► Git Repository ◄─────────────────────────────│ +└─────────────────────────────────────────────────────────────────────────┘ +``` + +**Data flow:** + +1. Agent writes files in sandbox +2. File watcher uploads content to S3 (by hash) and emits events +3. Events flow to clients via SSE stream +4. Clients fetch file content from S3 and write locally +5. Local edits reverse the flow: upload to S3, notify sandbox + +**Key insight:** S3 is the source of truth for file content. The event log is the source of truth for what happened. Git commits are used as durable checkpoints. + +--- + +## File Synchronization + +### Content-Addressed Storage + +Files are stored in S3 by hash. Events contain only metadata (path + hash), not content which may be very large. + +``` +S3 Structure: + files/ + sha256_abc123... → file content (any file with this hash) + sha256_def456... → file content (deduplicated) + + logs/ + run_{id}.jsonl → event log +``` + +**Benefits:** + +- Deduplication (same content = same hash = stored once) +- Multiple clients fetch from S3 (don't need sandbox to be alive) +- Files survive sandbox death +- Cache indefinitely (content never changes for a given hash) + +### Sandbox → Client Flow + +``` +Agent writes file + │ + ▼ +File watcher detects change + │ + ├──► Hash content (sha256) + │ + ├──► PUT to S3: files/{hash} + │ + └──► Emit event: { path, hash, action } + │ + ▼ + Event log (S3) ───► SSE stream to clients + │ + ▼ + Client receives event + │ + ├──► GET from S3: files/{hash} + │ + └──► Write to local filesystem +``` + +### Client → Sandbox Flow (Local Edits) + +``` +User edits file locally + │ + ▼ +Local file watcher detects change + │ + ├──► Hash content (sha256) + │ + ├──► PUT to S3: files/{hash} + │ + └──► POST event: { path, hash, action } + │ + ▼ + Backend pushes to agent via SSE + │ + ▼ + Agent fetches from S3, overwrites file + │ + ▼ + Agent sees file changed, adapts +``` + +### Conflict Resolution: Local Wins + +No merge logic. When user edits locally: + +1. Content goes to S3 +2. Sandbox is notified +3. Sandbox overwrites its version +4. Agent notices and adapts + +The agent will be able to handle situations where a local change overwrites it's own change gracefully. + +--- + +## State & Recovery + +### The Event Log is the source of truth + +No separate checkpointing mechanism. The combination of: + +- Event log (what happened) +- S3 content-addressed files (file contents by hash) +- Git commits (durable snapshots) + +...gives us everything needed to recover. + +``` +Event Log (run_{id}.jsonl): + + { method: "_posthog/git_commit", params: { sha: "abc123" } } + { method: "_posthog/file_change", params: { path: "src/foo.py", hash: "sha256_aaa" } } + { method: "_posthog/file_change", params: { path: "src/bar.py", hash: "sha256_bbb" } } + { method: "_posthog/git_commit", params: { sha: "def456" } } ◄── latest commit + { method: "_posthog/file_change", params: { path: "src/foo.py", hash: "sha256_ccc" } } + { method: "_posthog/file_change", params: { path: "src/baz.py", hash: "sha256_ddd" } } +``` + +**To recover current state:** + +1. Find latest `git_commit` event → checkout that commit +2. Replay `file_change` events since the last commit → apply uncommitted changes +3. Fetch file contents from S3 by hash + +### Recovery Flow + +``` +Sandbox dies or needs recovery + │ + ▼ +Read event log from S3 +Find latest git_commit + │ + ▼ +Provision new sandbox +git checkout {commit_sha} + │ + ▼ +For each file_change after commit: + - Fetch content from S3 by hash + - Write to filesystem + │ + ▼ +Resume agent with conversation history +``` + +### Agent Commits as Durable Checkpoints + +The agent commits periodically on significant changes. This creates permanent checkpoints — even if S3 files expire (30-day TTL), we can always recover to any commit. + +### Data Retention + +| Data | Retention | Recovery | +|------|-----------|----------| +| Git commits | Permanent | Always recoverable | +| S3 file content | 30 days | Uncommitted changes for 30 days | +| Event log | Indefinite | History/debugging | + +--- + +## Agent Architecture + +### Server Mode + +The agent runs in a message loop rather than single execution: + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ AGENT SERVER │ +│ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ Message │ │ Control │ │ Event │ │ +│ │ Queue │ │ Queue │ │ Emitter │ │ +│ │ (input) │ │ (signals) │ │ (output) │ │ +│ └──────┬──────┘ └──────┬──────┘ └──────▲──────┘ │ +│ │ │ │ │ +│ └──────────────────┼──────────────────┘ │ +│ │ │ +│ ┌────────▼────────┐ │ +│ │ │ │ +│ │ Agent Loop │ │ +│ │ │ │ +│ │ - Process msg │ │ +│ │ - Run tools │ │ +│ │ - Emit events │ │ +│ │ - Ask questions│ │ +│ │ │ │ +│ └─────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +**Message types:** + +- `user_message` — New prompt or response to question +- `file_sync` — Files changed locally, agent should notice +- `cancel` — Stop current operation +- `close` — Shut down gracefully + +**How commands reach the agent:** On startup, the agent opens an outbound SSE connection to the backend. Client commands (prompts, cancel, mode switch) are pushed directly through this connection to give as close to a local experience for latency as possible. This bypasses Temporal for real-time operations. + +### Temporal Workflow + +Temporal handles **lifecycle only**, not message routing: + +```python +@workflow.defn +class CloudSessionWorkflow: + + @workflow.signal + def close(self): + self.should_close = True + + @workflow.run + async def run(self, input): + sandbox_id = await provision_sandbox(input) + await start_agent_server(sandbox_id) # Agent connects to backend via SSE + + # Just wait for close or timeout - messages go direct via SSE + while not self.should_close: + try: + await workflow.wait_condition( + lambda: self.should_close, + timeout=timedelta(minutes=10) + ) + except asyncio.TimeoutError: + break # Inactivity timeout + + await cleanup_sandbox(sandbox_id) +``` + +**Key behaviors:** + +- Temporal provisions sandbox and handles cleanup +- Messages/commands go directly via SSE (not through Temporal) +- 10-min inactivity timeout triggers cleanup +- Client disconnection doesn't stop the agent + +--- + +## Array Integration + +In Array, the `AgentService` (main process) talks to agents through a connection. For cloud mode, we swap the transport without changing the rest of the app. + +``` +Renderer ──tRPC──► AgentService ──► Connection + │ + ┌─────────────┴─────────────┐ + │ │ + ▼ ▼ + LocalConnection CloudConnection + (in-process SDK) (SSE to backend) +``` + +**The connection interface** (simplified): + +```typescript +interface AgentConnection { + prompt(params: { sessionId: string; prompt: string }): AsyncIterable + cancel(params: { sessionId: string }): Promise + setMode(params: { sessionId: string; mode: string }): Promise + onEvent(handler: (event: AcpMessage) => void): void +} +``` + +**Key files:** + +- `apps/array/src/main/services/agent/service.ts` — AgentService, picks connection type +- `apps/array/src/main/services/agent/local-connection.ts` — Current ACP/SDK logic (extract) +- `apps/array/src/main/services/agent/cloud-connection.ts` — New, ~200 lines + +--- + +## Communication Protocol + +### Streamable HTTP + +Following [MCP's pattern](https://modelcontextprotocol.io/specification/2025-03-26/basic/transports#streamable-http): + +- **POST** — Client sends messages (user input, file sync, cancel) +- **GET** — Client opens SSE stream for server events +- **Session-Id header** — Identifies the session (run ID) +- **Last-Event-ID header** — Resume from where you left off + +### Endpoint + +``` +/api/projects/{project_id}/tasks/{task_id}/runs/{run_id}/sync +``` + +### Sending Messages (POST) + +```http +POST /sync +Content-Type: application/json +Session-Id: {run_id} + +{ + "jsonrpc": "2.0", + "method": "_posthog/user_message", + "params": { "content": "Please fix the auth bug" } +} +``` + +Response: `202 Accepted` + +### Receiving Events (GET) + +```http +GET /sync +Accept: text/event-stream +Session-Id: {run_id} +Last-Event-ID: 123 +``` + +```http +HTTP/1.1 200 OK +Content-Type: text/event-stream + +id: 124 +data: {"jsonrpc":"2.0","method":"_posthog/file_change","params":{"path":"src/auth.py","hash":"abc123"}} + +id: 125 +data: {"jsonrpc":"2.0","method":"agent_message_chunk","params":{"text":"I found the issue..."}} +``` + +### Why Not WebSocket? + +- SSE will work much better with our infrastructure (load balancing across multiple pods) +- Built-in resumability via `Last-Event-ID` +- Easier to manage (stateless servers) + +--- + +## Client Modes + +### Interactive (Connected) + +``` +Client Backend Sandbox + │ │ │ + │── GET /sync (SSE) ────────────►│ │ + │ │ │ + │◄── file_change ────────────────│◄── file written ──────────────│ + │◄── agent_message ──────────────│◄── agent output ──────────────│ + │ │ │ + │── POST /sync {message} ───────►│── push via SSE ──────────────►│ + │◄── 202 Accepted ───────────────│ │ +``` + +### Background (Disconnected) + +``` + Backend Sandbox + │ │ + │◄── agent keeps working ───────│ + │◄── events to S3 log ──────────│ + │ │ + │ (no client connected) │ +``` + +Agent continues autonomously. Events accumulate in S3. + +### Resume (Reconnect) + +``` +Client Backend + │ │ + │── GET /sync ──────────────────►│ + │ Last-Event-ID: 50 │ + │ │ + │◄── id:51 (from S3 log) ────────│ Replay missed events + │◄── id:52 ──────────────────────│ + │◄── ... ────────────────────────│ + │◄── id:100 (live) ──────────────│ Switch to live stream +``` + +Client catches up instantly, then receives live events. + +--- + +## Event Format + +JSON-RPC 2.0 notifications, stored as NDJSON: + +```typescript +{ + "jsonrpc": "2.0", + "method": "_posthog/file_change", + "params": { + "path": "src/auth.py", + "action": "modified", + "hash": "sha256_abc123" + } +} +``` + +### Event Types + +**File sync:** + +- `_posthog/file_change` — File created/modified/deleted (sandbox → client) +- `_posthog/file_sync` — Client pushing local changes (client → sandbox) +- `_posthog/git_commit` — Agent committed changes + +**Agent interaction:** + +- `agent_message_chunk` — Agent output +- `_posthog/agent_question` — Agent asking user +- `_posthog/user_message` — User input +- `tool_call` / `tool_result` — Tool usage + +**Session:** + +- `_posthog/session_start` — Session began +- `_posthog/session_close` — Session ended + +**Control:** + +- `_posthog/cancel` — Cancel current operation +- `_posthog/ack` — Acknowledgment + +--- + +## References + +- [MCP Streamable HTTP Transport](https://modelcontextprotocol.io/specification/2025-03-26/basic/transports) +- [Temporal Signals](https://docs.temporal.io/workflows#signal)