Skip to content

Hermes support#396

Open
pallakatos wants to merge 62 commits into
mainfrom
hermes/act1-docker-smoke-fixes
Open

Hermes support#396
pallakatos wants to merge 62 commits into
mainfrom
hermes/act1-docker-smoke-fixes

Conversation

@pallakatos

Copy link
Copy Markdown
Collaborator

Summary

First end-to-end Hermes docker smoke test surfaced two real bugs from the A1 ship:

  1. hermes-agent==0.5.1 doesn't exist on PyPI — the 0.5.x assumption was from misreading the Hermes README's Homebrew formula tag (5.1.14). PyPI uses 0.x.y at 0.15.2 latest. Bumped pin to 0.15.2.
  2. kars plugin discovered but not loaded — Hermes treats standalone plugins as opt-in via plugins.enabled in config.yaml. The entrypoint was materializing the plugin into the right path ($HERMES_HOME/plugins/kars/) but never adding kars to the allow-list — so it was discovered and silently skipped (error='not enabled in config').

Bonus: ripgrep not in Azure Linux 3 tdnf caused tdnf install -y to fail the whole layer. Hermes' file_search falls back to grep cleanly, so dropped it. Image now builds in ~30s.

Verification

  • docker build --platform linux/amd64 -f sandbox-images/hermes/Dockerfile -t kars-sandbox-hermes:dev . succeeds
  • hermes_cli.plugins.discover_plugins() loads kars plugin from $HERMES_HOME/plugins/kars/ with 10 tools + 2 hooks (pre_tool_call + post_tool_call)
  • ✅ End-to-end entrypoint dry-run produces correct config.yaml with both plugins.enabled: [kars] and mcp_servers.platform blocks
  • ✅ 83/83 Python unit tests still pass inside the image
  • ✅ All 8 ci-gates pass locally (security-audit, copyright, no-stubs, no-custom-crypto, etc.)

Plugin contract confirmation: Hermes 0.15.2's PluginContext.register_tool / register_hook signatures and plugin.yaml schema (provides_tools / provides_hooks) match exactly what the A1 plugin code was already built for.

Security

See docs/internal/security-audits/2026-06-04-hermes-act1-docker-smoke-fixes.md — no new threat surface beyond what existing A1 audits already cover; this commit only flips the previously-audited surface from dormant to live.

Two real bugs surfaced when running the first `docker build` +
end-to-end smoke test of the Hermes sandbox image:

1. **Hermes version pin wrong**
   `ARG HERMES_VERSION=0.5.1` doesn't exist on PyPI. The 0.5.x
   assumption came from misreading the Hermes README's Homebrew
   formula tag (`5.1.14`); the actual `hermes-agent` PyPI package
   uses 0.x.y numbering at 0.15.2 latest. Bumped to 0.15.2.

   Hermes 0.15.2's plugin contract (PluginContext.register_tool,
   register_hook, plugin.yaml with provides_tools/provides_hooks,
   discovery via `$HERMES_HOME/plugins/`) matches what the A1
   plugin code was already built for — verified by importing
   hermes_cli.plugins and running discover_plugins() against our
   materialized plugin tree.

2. **ripgrep not in Azure Linux 3**
   `tdnf install -y` exits non-zero if ANY package is missing, and
   Azure Linux 3 doesn't ship ripgrep. Hermes' built-in file_search
   tool prefers ripgrep but falls back to grep, so dropping it is
   safe. Image now builds in ~30s.

3. **kars plugin discovered but not loaded**
   Hermes treats `standalone` plugins as opt-in via
   `plugins.enabled` in config.yaml. The entrypoint was placing the
   kars plugin into `$HERMES_HOME/plugins/kars/` (correct user
   discovery path), but never adding `kars` to the enabled
   allow-list — so it was discovered and silently skipped with
   `error='not enabled in config'`.

   The entrypoint now emits a `plugins.enabled: [kars]` block at
   the top of every generated config.yaml. The awk-merge that
   replaces prior `mcp_servers:` blocks was extended to also
   replace prior `plugins:` blocks so re-runs are idempotent.

Verified end-to-end:
- `docker build` succeeds
- `discover_plugins()` loads kars plugin, registers 10 tools +
  2 hooks (pre_tool_call + post_tool_call)
- Entrypoint generates correct config.yaml with both blocks
- `$HERMES_HOME/plugins/kars/` materialized from
  `/opt/kars-hermes-stage/plugins/kars/` on every boot
- 83/83 python unit tests still pass inside the image
- Mock smoke run: `python3 -m hermes_cli.plugins discover` shows
  kars: enabled=True, 17 total plugin tools across all enabled
  plugins (10 from kars + 7 web/foundry from bundled providers)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Dependency Review

The following issues were found:
  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 5 package(s) with unknown licenses.
See the Details below.

License Issues

mesh-plugin/package.json

PackageVersionLicenseIssue Type
@microsoft/agent-governance-sdkfile:../vendor/agt/microsoft-agent-governance-sdk-4.0.0-agt-3322175d.tgzNullUnknown License

runtimes/agt-mesh-python/pyproject.toml

PackageVersionLicenseIssue Type
agentmesh-platform>= 3.6.0,< 5.0.0NullUnknown License
httpx>= 0.27,< 1.0NullUnknown License
pynacl>= 1.5,< 2.0NullUnknown License
websockets>= 12,< 14NullUnknown License

OpenSSF Scorecard

PackageVersionScoreDetails
npm/@microsoft/agent-governance-sdk file:../vendor/agt/microsoft-agent-governance-sdk-4.0.0-agt-3322175d.tgz UnknownUnknown
pip/agentmesh-platform >= 3.6.0,< 5.0.0 UnknownUnknown
pip/httpx >= 0.27,< 1.0 UnknownUnknown
pip/pynacl >= 1.5,< 2.0 UnknownUnknown
pip/websockets >= 12,< 14 UnknownUnknown

Scanned Files

  • mesh-plugin/package.json
  • runtimes/agt-mesh-python/pyproject.toml

Pal Lakatos and others added 2 commits June 4, 2026 07:56
Two follow-ups from the kind-cluster end-to-end smoke test:

1. **Helm CRD schema missing Hermes enum** — controller's `crd.rs`
   added `RuntimeKind::Hermes` in a7882b8 but the matching Helm
   CRD YAML wasn't updated. Result: the API server rejected every
   KarsSandbox with `runtime.kind: Hermes` BEFORE the controller
   ever saw it. Verified by `kubectl apply --dry-run=server`
   failing with "unknown enum value 'Hermes'".

   Added:
   - `Hermes` to the `runtime.kind` enum at line 85
   - x-kubernetes-validations rule:
     `(self.kind == 'Hermes') == has(self.hermes)`
   - `runtime.hermes` properties block mirroring `pydanticAi`
     shape (version, agentCode oci/git, entrypoint, extraEnv)

   After the fix, `kubectl apply -f /tmp/hermes-sandbox.yaml`
   succeeds, controller picks up the CR, and a 2-container pod
   (`agent` + `inference-router`) reaches `2/2 Running` with the
   kars plugin loaded (10 tools + 2 hooks registered).

2. **`.cargo-docker/` not gitignored** — when cross-compiling for
   linux/arm64 via `docker run -v $PWD:/work … cargo build` (the
   pattern used for kind-on-M-series), `CARGO_HOME=/work/.cargo-docker`
   keeps container-arch crate cache out of the host's `~/.cargo`.
   That directory was leaking into `git status`. Added rules:
   - `.cargo-docker/` — explicit
   - `/bin/` was already covered by `**/[Bb]in/*` (verified)

Verified end-to-end on kind cluster `kars-dev`:

  $ kubectl get karssandbox,pods -n kars-smoke-hermes
  NAME           PHASE   RUNTIME   INFERENCEPOLICY   ISOLATION
  smoke-hermes           Hermes    smoke-inference   standard
  NAME                            READY   STATUS    RESTARTS
  smoke-hermes-697c6bd557-q5xfr   2/2     Running   0

  Plugin discovery inside the pod:
    kars plugin: enabled=True, source=user
    hooks      : {'pre_tool_call': 1, 'post_tool_call': 1}
    tools      : http_fetch, kars_discover, kars_mesh_{send,inbox,
                 await,transfer_file}, kars_spawn{,_status,_destroy,
                 _list}

  Router /healthz from the agent container: 200 ok

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
End-to-end Hermes smoke on kind cluster exposed and fixed six real
bugs blocking the runtime from being functional:

1. awk not in Azure Linux 3 — replaced entrypoint merge with Python
2. TUI mode crashed without TTY — switched to hermes gateway run
3. KARS_MCP_SERVERS injected only into "openclaw" container —
   generalized to use agent_container_name based on runtime kind
4. Entrypoint scanned wrong path for MCP servers — aligned to the
   KARS_MCP_SERVERS env + loopback router pattern
5. hermes config set used key=value (wrong) — fixed to two positional args
6. Router rustls CryptoProvider not pre-installed — added explicit
   aws_lc_rs::default_provider().install_default() in main()

Verified 12/12 e2e checks pass on kind cluster:
- Pod 2/2 Running, plugin loaded with 10 tools + 2 hooks
- Router /healthz, /agt/evaluate, /egress/fetch, /sandbox/list all 200
- KarsMemory CR Compiled, McpServer translated, channel translation
- Mesh stubs return clear Act 2 error
- pre_tool_call hook fires + decision=allow

All 834 controller + 932 router Rust tests pass.
cargo clippy clean, cargo fmt applied.

Security audit:
docs/internal/security-audits/2026-06-04-hermes-act1-e2e-smoke-fixes.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pallakatos pallakatos force-pushed the hermes/act1-docker-smoke-fixes branch from 2a712bc to 82b7fa1 Compare June 4, 2026 08:05
Pal Lakatos and others added 3 commits June 4, 2026 12:00
The sandbox NetworkPolicy gated ALL ingress rules behind
`governance.enabled=true`. With governance off, the NP shipped with
`policyTypes: [Ingress, Egress]` and an empty `ingress: []` block —
deny-all ingress. The operator namespace then could not reach
`/internal/policy-status` on the router and every referencing
InferencePolicy / KarsMemory / ToolPolicy / McpServer / EgressApproval
stuck forever in `Ready=False / AwaitingRouterEnforcement`, observable
in the operator panel even though the sandbox itself was healthy and
the router /readyz returned 200.

Split into two ingress classes:
- **Operator policy-echo ingress** (router :8443 admin surface from
  ns labeled `app.kubernetes.io/name=kars,component=system`) — emitted
  UNCONDITIONALLY. Three orthogonal gates still protect it: bearer
  token, constant-time compare, optional IP pinning.
- **Peer-sandbox mesh + gateway ingress** (8443 / 18789 / 18791 from
  ns labeled `kars.azure.com/role=sandbox`) — kept gated on
  governance.enabled (no peers when governance is off).

Surfaced during local-k8s smoke of smoke-hermes: even after fixing
the AZURE_OPENAI_API_KEY env path so /readyz returned 200, three
policy CRs (InferencePolicy, KarsMemory, ToolPolicy) stayed
Ready=False because the controller's /internal/policy-status probe
to the sandbox router timed out at the NetworkPolicy level.

After this fix, with governance off, the controller's HTTP probe
gets a 401 (admin-token gate doing its job) instead of a connection
timeout, and the policy reconcilers update status using the round
trip rather than reporting "router unreachable".

Verified end-to-end on kind cluster `kars-dev`:
  $ kubectl get inferencepolicy smoke-inference -n kars-system -o jsonpath='{.status.conditions}' | jq
  - Ready=True  RouterEnforcing: all 1 referencing sandbox router(s) confirmed inference-policy digest
  - Progressing=False  Reconciled: router echo confirmed
  $ kubectl get karsmemory smoke-mem -n kars-system -o jsonpath='{.status.conditions}' | jq
  - Ready=True  RouterEnforcing: all 1 referencing sandbox router(s) confirmed claw-memory binding digest

834 controller tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collapses the canonical 4-agent exec-brief scenario (parent +
analyst + viz + writer) into a single Hermes agent doing the whole
pipeline itself — research, scorecard, hero image, written brief.
Built to validate the Hermes runtime adapter end-to-end on
local-k8s and AKS without depending on the Python AGT MeshClient
(which ships in Act 2; until then, `kars_mesh_*` returns explicit
"Act 2 not ready" errors and the prompt explicitly tells the agent
not to call those tools).

Scenario layout (mirrors exec-brief/):
  - manifests/00-namespace.yaml ........ kars-execbrief-hermes ns
  - manifests/01-inferencepolicy.yaml .. azure-openai gpt-5.4
  - manifests/02-toolpolicy.yaml ....... allow-all AGT profile
  - manifests/03-clawmemory.yaml ....... memory-execbrief-hermes store
  - manifests/04-mcpserver.yaml ........ DeepWiki MCP (same as canonical)
  - manifests/05-clawsandbox.yaml ...... runtime.kind: Hermes
  - config.sh .......................... SCENARIO_SUB_SANDBOXES=()
  - prompt.txt ......................... single-agent pipeline
  - README.md .......................... what it exercises + skips

Verified on kind cluster `kars-dev`:
  $ kubectl apply -f tools/e2e-harness/scenarios/exec-brief-hermes-single/manifests/
    → 6 resources created
  $ kubectl get karssandbox execbrief-hermes -n kars-system
    PHASE=healthy RUNTIME=Hermes
  $ kubectl get pods -n kars-execbrief-hermes
    execbrief-hermes-...   2/2 Running

All 5 CRs reach RouterEnforcing / Ready=True:
  ● execbrief-hermes-inference   InferencePolicy   router echo confirmed
  ● execbrief-hermes-toolpolicy  ToolPolicy        agt-profile digest confirmed
  ● execbrief-hermes-memory      KarsMemory        binding=bound
  ● execbrief-hermes-deepwiki    McpServer         healthy
  ● execbrief-hermes             KarsSandbox       healthy

In-pod verification:
  - kars plugin: enabled=True source=user, 10 tools + 2 hooks
  - foundry_memory store_name = memory-execbrief-hermes (matches CR)
  - config.yaml mcp_servers.execbrief-hermes-deepwiki present
  - KARS_MCP_SERVERS=execbrief-hermes-deepwiki in agent env
  - Router /readyz: 200 ok

Note: the actual LLM execution of the prompt requires real Azure
OpenAI / Foundry credentials. With the fake-key dev overlay used in
this validation, the pipeline runs through Hermes → kars plugin →
router → upstream-call layer and hangs at the upstream (expected).
Running with real creds — either via `kars dev --target local-k8s`
with a real provider, or on AKS via `SCENARIO=exec-brief-hermes-single
PLATFORM=aks ./tools/e2e-harness/run.sh` — will execute the full
pipeline.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… hardening

End-to-end run of the new `exec-brief-hermes-single` scenario on
local-k8s surfaced four more bugs that all gate the prompt from
actually reaching the model:

1. **`pull_policy=Always` for `:latest` images** in dev mode forced a
   doomed registry pull (karsacr.azurecr.io/…) instead of using the
   kind-cached image. The controller now picks `IfNotPresent` when
   `KARS_DEV_PROFILE=true` is set on its own env. Production AKS
   stays on `Always` for `:latest`.

2. **Hermes' `tirith` auto-download** from GitHub releases blocked
   every cold start while the kars egress-guard slow-walked the
   fetch. Entrypoint now sets `TIRITH_ENABLED=false` by default;
   Hermes falls back to its built-in pattern-matching shell
   checker. Operators can re-enable by pre-baking the binary at
   `/usr/local/bin/tirith` and setting `TIRITH_ENABLED=true`.

3. **`HERMES_DISABLE_LAZY_INSTALLS=1`** suppresses Hermes' `pip
   install` of discord.py / google-* / brotlicffi on first use of
   bundled platform plugins. Saves 30–120s on every cold start;
   operators wanting the extras re-bake into the image.

4. **`HERMES_SKIP_NODE_BOOTSTRAP=1`** suppresses Hermes' shell-based
   Node.js 22 LTS auto-installer (scripts/install.sh). We pre-install
   `nodejs` + `nodejs-npm` from the Azure Linux 3 base repo
   (currently v20.14 — Hermes' dep_ensure accepts any modern node).
   Browser tools that need a Chromium download still need to be
   pre-baked separately.

All three Hermes-runtime knobs are also mirrored into
`$HERMES_HOME/.env` so they survive `kubectl exec` sessions
(kubectl exec spawns a fresh env that doesn't see entrypoint
exports). Hermes' env_loader loads .env at import time
(`hermes_cli/env_loader.py:_load_dotenv_with_fallback`).

After all four fixes verified end-to-end:
  - smoke-hermes sandbox: phase=Running, 2/2 Ready
  - Router /readyz: 200 ok (controller forwards real Foundry API
    key from `kars-dev-creds` Secret via secretKeyRef)
  - Router /v1/chat/completions: 200 with real gpt-5.4 reply ("OK"
    in 1.1s, latency_checkpoint shows engine_ttft_ms=108)
  - InferencePolicy / KarsMemory / ToolPolicy / McpServer all
    Ready=True / RouterEnforcing
  - Plugin loaded with 10 tools + 2 hooks + foundry_memory native
  - Platform MCP block present in config.yaml when
    FOUNDRY_PROJECT_ENDPOINT is bound

Outstanding gap (NOT in this commit): Hermes' `hermes -z` still
makes an outbound HTTPS handshake (state=SYN_SENT to 104.18.3.115
:443, a Cloudflare IP — likely a check-update or telemetry endpoint
the harness hasn't tracked down). The kars egress-guard's
forward-proxy stalls the connection rather than denying outright,
so the prompt-driven path hangs after plugin discovery completes.
Workarounds:
  (a) `KARS_EGRESS_LEARN=true` to log unallowed hosts, then
      explicitly allowlist in EgressAllowlist;
  (b) find Hermes' env to disable check-update / telemetry — Act 1.x;
  (c) drive Hermes via Telegram channel instead of `hermes -z`.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pallakatos pallakatos force-pushed the hermes/act1-docker-smoke-fixes branch 2 times, most recently from 238453f to f7f16c8 Compare June 4, 2026 12:30
…l Foundry

The single-agent exec-brief scenario (research → JSON → scorecard PNG →
hero PNG → 2-page brief.md) now runs end-to-end on Hermes through the
kars router to real Azure Foundry gpt-5.4. Verified on local-k8s with
the user's ~/.kars/ creds.

Four fixes were needed (each surfaced sequentially as the agent loop
progressed further):

1. **`OPENAI_API_KEY` env routes Hermes to openrouter** (and openrouter.ai
   is blocked by the egress-guard). Switched the entrypoint's `.env`
   mirror to `AZURE_FOUNDRY_API_KEY` + `AZURE_FOUNDRY_BASE_URL` so
   resolve_provider() picks the `azure-foundry` provider (which has
   no built-in Cloudflare callback).

2. **`agent_init.py` hardcodes `_codex_reasoning_replay_enabled = True`**
   → Hermes echoes `{"type": "reasoning", "encrypted_content": "..."}`
   back to /v1/responses on every continuation, which Azure Foundry's
   strict schema validator rejects with `invalid_payload`. OpenAI's
   own Responses API accepts these. Hermes only learns to disable
   replay when the upstream returns `invalid_encrypted_content` (a
   different error code that Foundry doesn't emit).

   Router fix: `build_upstream_url()` in proxy.rs now strips
   `input[]` items of `type=reasoning` and the
   `include=["reasoning.encrypted_content"]` field from any /v1/responses
   request bound for Azure Foundry (NOT GitHub Models / Copilot —
   their schemas accept the original shape).

3. **/v1/responses handler used `forward()` (non-streaming)** but Hermes
   always opens these with `responses.create(stream=True)` and expects
   an SSE `text/event-stream` response. The buffered JSON blob made
   Hermes' SDK raise "Connection error" after ~15s and retry 6× before
   giving up with `max_retries_exhausted`. Switched the handler to
   `forward_stream()` so the SSE byte stream flows through unchanged.

4. **`forward_stream()` injected `stream_options.include_usage`** which
   the OpenAI Responses API rejects (`unknown_parameter`). Skip the
   injection for /v1/responses (Foundry already emits usage in the
   terminating SSE event); was already skipped for Anthropic
   /v1/messages — same exclusion now covers both shapes.

Plus the entrypoint now persists `model.{default,provider,base_url}` in
config.yaml on every boot (not just plugins+mcp_servers), so a fresh
pod doesn't need a one-time `hermes config set model` post-boot dance.

End-to-end run delivered:
  /sandbox/incoming/brief.md      6,136 B  (2 pages, real Markdown,
                                            12 footnoted https citations,
                                            references hero+scorecard PNGs
                                            inline, all 4 control-domain
                                            terms present)
  /sandbox/incoming/analyst.json  5,025 B  (foundry_web_search × 3 →
                                            trends / control_categories /
                                            runtimes / metrics)
  /sandbox/incoming/hero.png     30,094 B  (1024×1024, foundry_image_generation
                                            gpt-image-1, "Defense in Depth"
                                            isometric data-center cutaway)
  /sandbox/incoming/scorecard.png 12,201 B (1024×640, foundry_code_execute
                                            matplotlib grouped bar chart,
                                            4 runtimes × 4 control columns)

Router log: 30+ /v1/responses SSE streams, all 200 OK, latencies
1.6–67s. Foundry stream headers received for every request after
this fix; pre-fix only 2 of 8 requests had `Foundry complete` entries
before Hermes gave up.

Agent stdout (final response after autonomous tool-use loop):
> Done. Artifacts produced:
> - /sandbox/incoming/brief.md — 6136 bytes
> - /sandbox/incoming/hero.png — 30094 bytes
> - /sandbox/incoming/scorecard.png — 12201 bytes
> - /sandbox/incoming/analyst.json — 5025 bytes
> Verified: brief.md exists and references both image files
>           hero.png and scorecard.png exist as real PNGs
>           analyst.json exists with the normalized runtime comparison

All 932 router + 834 controller Rust tests still pass.

Deliverables captured under:
  tools/e2e-harness/out/hermes-exec-brief-delivered/

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@pallakatos pallakatos force-pushed the hermes/act1-docker-smoke-fixes branch from f7f16c8 to f21048d Compare June 4, 2026 12:30
Pal Lakatos and others added 2 commits June 4, 2026 16:15
Two visibility gaps surfaced after the Hermes exec-brief run:
operator panel showed `sandbox="unknown"` (instead of the real
sandbox name) and zero token counters for every /v1/responses call.

1. **sandbox label was "unknown"**: every `x-kars-sandbox` header
   parser fell back to `"unknown"` when the header wasn't set —
   which is the default for clients like Hermes' openai SDK that
   don't add kars-specific headers. Per-sandbox routers KNOW their
   own identity via the `SANDBOX_NAME` env (set by the controller).

   Added `resolve_sandbox_name()` helper at the top of inference.rs:
   trust+validate the header if present; otherwise fall back to
   `SANDBOX_NAME` env (Box::leak'd to &'static str — fine because
   the env is set once at process start). Replaces 4 hand-rolled
   `unwrap_or("unknown")` / `unwrap_or("self")` sites. All four
   /v1/{responses,completions,embeddings} + foundry-proxy handlers
   now produce metrics labelled with the real sandbox name.

2. **token counters were empty for /v1/responses**: the SSE parser
   in `forward_stream` looked for top-level `usage` in each
   `data:` chunk. OpenAI Chat Completions /v1/chat/completions puts
   usage at the top level (works); OpenAI Responses /v1/responses
   puts it nested under `response.usage` in the terminating
   `response.completed` event (didn't work — captured a real
   response.completed event to confirm).

   Parser now probes both shapes:
     v.get("usage").or_else(|| v.get("response")?.get("usage"))

   /v1/responses tokens are now counted (verified live: kars_tokens
   delta of +16 input / +12 output for a "list 3 colors" prompt;
   was +0 / +0 before).

Verified on local kind cluster after rebuild:

  kars_inference_requests_total{model="gpt-5.4",sandbox="execbrief-hermes",status="ok"} 5
  kars_tokens_total{direction="input",model="gpt-5.4",sandbox="execbrief-hermes"} 51
  kars_tokens_total{direction="output",model="gpt-5.4",sandbox="execbrief-hermes"} 30

The operator panel's "Inference by sandbox" + token-mix dashboards
now populate correctly for Hermes / pydantic-ai / langgraph / any
runtime that uses /v1/responses with non-kars HTTP clients.

932 router tests + cargo clippy --all-targets -- -D warnings clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tool deny list

Closes the inter-agent comms gap for Python frameworks. Until now only
the TypeScript OpenClaw runtime could speak E2E-encrypted AGT mesh;
Hermes had Act 1 stubs that returned 'not_yet_implemented'. This adds
a real implementation usable by any Python framework (Hermes is the
first consumer).

## What ships

1. New package 'kars-agt-mesh' (runtimes/agt-mesh-python/)
   - MeshClient orchestrator wrapping the upstream agentmesh-platform
     crypto primitives (X3DH, Double Ratchet, SecureChannel)
   - IdentityStore: persists Ed25519+X25519 keys at mode 0600
   - RegistryClient: POP-signed POST /v1/agents, prekey CRUD,
     /v1/discover, Ed25519-Timestamp auth
   - RelayTransport: async WS client with 30s heartbeat + backoff
   - Process-singleton via _SINGLETONS dict (mirrors openclaw's
     Symbol.for('agt-mesh-client') pattern)
   - Runtime-neutral — no Hermes-specific code
   - 9 unit tests pass

2. Hermes mesh adapter (runtimes/hermes/.../plugin/mesh.py)
   - Replaces Act 1 mesh_stubs.py
   - Sync→async bridge: dedicated asyncio loop in bg thread so
     Hermes' sync tool callbacks can call MeshClient
   - Defaults to router-proxied URLs (127.0.0.1:8443/agt/{relay,registry})
     so egress-guard iptables stay in place
   - Registers kars_mesh_{send,inbox,await,transfer_file}

3. Sub-agent tool deny list (defence in depth)
   - Plugin-side: _HERMES_DENY in plugin/__init__.py deregisters
     delegate_task, mixture_of_agents, cronjob, kanban_create,
     kanban_comment, send_message
   - AGT-profile-side: denied_actions block in scenario ToolPolicy
     catches the same six names at priority 100
   - Rationale per-tool in security audit doc

4. Dockerfile updated to install kars-agt-mesh wheel before plugin stage

5. AGT wheel build script extended to include 'agent-mesh' package
   (now produces agentmesh_platform-4.0.0)

## Live verification on kind-kars-dev

- MeshClient.connect() returns 201 from registry, WS upgrade OK
- Self-discovery via /v1/discover returns own DID
- Plugin loader log shows 6 deregistrations + 4 mesh tools present
- 83 Hermes unit tests + 9 kars-agt-mesh unit tests pass

## Critical bug fixed mid-implementation

Initial POP shape sent raw 32-byte public key + ts; registry expected
base64url-string(pub) + ts. Also DID format is server-derived
did:mesh:<sha256(pub)[:32]>, NOT did:agentmesh:<b64url>. Fixed both
in registry_client.py and identity.py. Memory stored for future
non-TS SDK implementers.

## Security audit

See docs/internal/security-audits/2026-06-04-hermes-act2-mesh-deny.md
(2 sign-offs, ci-gates green).

## Deferred to Act 2.2

- KNOCK auto-accept responder (currently logs only — Hermes only
  initiates so not reachable yet)
- Cross-runtime golden vectors (TS↔Python interop test)
- Multi-process Hermes broker (lazy_install subprocess) — not
  reachable while delegate_task is denied

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Comment thread runtimes/agt-mesh-python/src/kars_agt_mesh/relay_transport.py Fixed
Comment thread runtimes/agt-mesh-python/tests/test_package_shape.py Fixed
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/mesh.py Fixed
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/mesh.py Fixed
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/mesh.py Fixed
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/mesh.py Fixed
… working

Lands the protocol-correct fixes needed for MeshClient.connect() →
KNOCK → X3DH → Double Ratchet roundtrip between two sandboxes. Tested
end-to-end on kind-kars-dev with two Hermes pods (execbrief-hermes and
smoke-hermes) on the FRESHLY BUILT image (no hot patches):

- pod A registers, uploads prekey bundle, opens relay WS (with POP)
- pod B does the same
- pod A discovers B via /v1/discover (freshest-first sort)
- pod A fetches B's bundle, runs X3DH, sends KNOCK + first ciphertext
- pod B's _handle_knock_frame auto-accepts via SecureChannel.create_receiver,
  decrypts plaintext 'hello from execbrief-hermes'
- pod B replies via send_by_did → encrypted message frame
- pod A decrypts 'pong from smoke-hermes'

## Critical protocol fixes

1. **Relay WS connect-frame POP** (relay_transport.py)
   - Was: {type:'connect', from:did, ts:...}
   - Now: full proof-of-possession (std-base64 pub_key + iso ts + sig
     over ts), per AGT relay/app.py::_verify_connect_pop
   - Without this, the relay rejects every connection with
     'connect frame missing did/public_key/timestamp/signature'

2. **Registry auth header** (registry_client.py)
   - Was: three separate X-Agent-DID/Timestamp/Signature headers,
     signature over method+path+ts
   - Now: single 'Authorization: Ed25519-Timestamp <did> <ts> <b64url-sig>',
     signature over timestamp string only
   - Matches AGT registry/app.py::verify_ed25519_timestamp_auth

3. **X3DH bootstrap missing** (client.py)
   - Now connect() builds X3DHKeyManager + generates signed_pre_key
     + 10 OTKs + uploads bundle via PUT /v1/agents/{did}/prekeys
   - Without this, peers couldn't fetch our bundle, X3DH initiation
     would fail at the responder side

4. **KNOCK responder implemented** (client.py::_handle_knock_frame)
   - Was: log-only stub ('responder path not implemented')
   - Now: parses ChannelEstablishment, calls SecureChannel.create_receiver,
     caches the channel, decrypts the bundled first ciphertext,
     eagerly tops up the OTK pool for the next session

5. **Send fuses KNOCK + first message** (client.py::send_by_did)
   - First call to a new peer DID sends {type:'knock', establishment, ciphertext}
   - Subsequent calls send {type:'message', ciphertext}
   - Matches the TS SDK wire convention (one RTT, not two)

6. **AAD directionality fix** (client.py)
   - Initiator: f'{self_did}|{peer_did}'
   - Responder: f'{from_did}|{self_did}' (reconstructs the same bytes)

7. **EncryptedMessage wire format** (client.py)
   - Was: JSON of em.__dict__ (would fail at decoder)
   - Now: EncryptedMessage.serialize() / .deserialize() (binary + b64url)

8. **PeerBundle flat shape** (registry_client.py + client.py)
   - Was: nested dicts mirroring my best-guess wire format
   - Now: matches agentmesh.encryption.x3dh.PreKeyBundle's flat dataclass

9. **register_self handles 409 gracefully** (registry_client.py)
   - Was: raised MeshRegistryError, blocking every restart
   - Now: logs and continues — the subsequent prekey PUT (with
     Ed25519-Timestamp auth) proves we own the same key

10. **discover() sorts freshest-first** (registry_client.py)
    - Avoids hitting stale ghost-DIDs when a sandbox restarts with
      a new identity before the prior registration ages out

## Tests

- 9 kars-agt-mesh unit tests pass
- 83 Hermes unit tests pass
- Live bidirectional roundtrip verified on freshly-built image
  (build hash c1dcdfc11475... loaded into kind-kars-dev)

## Security audit updated

docs/internal/security-audits/2026-06-04-hermes-act2-mesh-deny.md
- Residual risk #1 (no KNOCK responder) removed — now implemented.
- Added residual risk #4 (stale registry entries — non-security).
- Added live bidirectional test description.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Comment thread runtimes/agt-mesh-python/src/kars_agt_mesh/relay_transport.py Fixed
Comment thread runtimes/agt-mesh-python/src/kars_agt_mesh/relay_transport.py Fixed
…ug Hermes mesh egress-guard hole

## controller/src/reconciler/mod.rs

Adds three runtime-neutral env vars injected on EVERY agent container
(not just OpenClaw):

- KARS_MODEL=<inference model> — generic alias for OPENCLAW_MODEL so
  Hermes / OpenAIAgents / MAF / BYO can read the same value without
  knowing about runtime-specific env names
- KARS_RUNTIME_CONTRACT_VERSION=v1 — self-documenting marker that
  this container claims to participate in the kars v1 runtime contract
- KARS_RUNTIME_KIND=<Debug repr of RuntimeKind> — uniform anchor any
  plugin can use to introspect what runtime it's running as

Lifted from the OpenClaw-only `is_openclaw` gate. All 834 controller
tests still pass.

## runtimes/hermes/.../plugin/mesh.py

**Real bug fix**: the Hermes mesh plugin was reading AGT_RELAY_URL /
AGT_REGISTRY_URL from env. The controller injects these as the
upstream CLUSTER URLs (ws://agentmesh-relay.agentmesh.svc:8765 etc.)
— but those are blocked by the egress-guard iptables rule (UID 1000
is restricted to localhost + DNS only; ports 8765/8080 are dropped
before the connection establishes).

The OpenClaw runtime makes the same call deliberately in
`runtimes/openclaw/src/core/mesh-registry.ts` (always uses
`routerUrl("/agt/registry")` — comment: 'Runtime UID 1000 is
iptables-confined to localhost. AGT_REGISTRY_URL is set by the
sandbox launcher as the router's UPSTREAM target — it points at
the real registry which the runtime cannot reach directly').

Now Hermes does the same: hardcodes 127.0.0.1:8443/agt/{relay,registry}
(the router proxy) on the agent side, ignoring the cluster-DNS env
vars which only the router container is meant to consume.

## Live verification

End-to-end mesh round-trip re-run on the rebuilt controller + sandbox
images (no hot patches):
- pod A (execbrief-hermes) registers, discovers pod B, KNOCK + X3DH
- pod B auto-accepts, decrypts 'hello from execbrief-hermes', replies
- pod A decrypts 'pong from smoke-hermes'

Env vars confirmed present on the agent container post-reconcile:
  KARS_MODEL=gpt-5.4
  KARS_RUNTIME_CONTRACT_VERSION=v1
  KARS_RUNTIME_KIND=Hermes

## Tests

- 834 controller tests pass (cargo test -p kars-controller)
- 83 Hermes unit tests pass
- 9 kars-agt-mesh unit tests pass
- cargo clippy --package kars-controller -- -D warnings clean

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/mesh.py Fixed
Pal Allakatos and others added 3 commits June 5, 2026 01:52
Wires the missing pieces so a Hermes parent can spawn Hermes children
AND mesh-message them through the real Python AGT MeshClient.
Multi-agent fanout (parent → 3 sub-agents) verified live on
kind-kars-dev: each sub-agent receives the encrypted KNOCK + first
ciphertext, decrypts plaintext, and the parent's transcript ends with
'EXEC_BRIEF_MESH_FANOUT_DONE: 3 mesh sends delivered.'

## Bug fixes

### 1. Hermes parent now spawns Hermes children (NOT OpenClaw)

inference-router/src/spawn/mod.rs::build_sub_agent_crd_with_labels
hard-coded `runtime.kind = OpenClaw` for every spawn. Now it:
  - Accepts an explicit `runtime_kind` field on SpawnRequest.
  - Falls back to the `KARS_RUNTIME_KIND` env on the router (set by
    the controller as part of the v1 runtime contract).
  - Falls back to "OpenClaw" for backward compat.

Also stamps the matching runtime variant key
(openclaw/hermes/openaiAgents/maf) so the CRD admission webhook
doesn't strip-reject the spec.

Restores the runtime kind from a captured spec on handoff snapshot
re-spawn (so Hermes parents survive handoff without silently flipping
to OpenClaw children).

### 2. Controller injects KARS_RUNTIME_KIND on the router container

controller/src/reconciler/mod.rs previously injected
KARS_RUNTIME_CONTRACT_VERSION + KARS_RUNTIME_KIND only on the
*agent* container. Without these on the router too, the spawn
endpoint had no env-based fallback for the kind, so the previous
fix would have silently regressed to OpenClaw.

### 3. Hermes mesh.py accepts OpenClaw-style arg naming

kars_mesh_send now accepts `to_agent` (OpenClaw convention) and
`to` (short form), and `content` plus `payload`, so prompts
written for the OpenClaw mesh API work on Hermes too. Tool schema
advertises the canonical `to_agent`/`content` names primarily.

### 4. Hermes plugin eagerly pre-registers MeshClient at load

runtimes/hermes/.../plugin/__init__.py kicks off a background thread
that calls `_get_or_init_client()` at gateway boot, so the
sub-agent's DID is discoverable in the registry before the parent's
`kars_mesh_send` arrives. Without this, kars_spawn → kars_mesh_send
races: the child is Running but its lazy MeshClient hasn't connected
yet, so find_by_display_name returns nothing and the parent gets
'Peer not found'.

### 5. Discovery falls back to capability when registry omits metadata

runtimes/agt-mesh-python/.../registry_client.py find_by_display_name
no longer requires `metadata.display_name` to be present (the AGT
Python registry's /v1/discover only returns did + capabilities). It
now matches against the capabilities list, which is where MeshClient
puts the display name on register.

## Harness additions

### tools/e2e-harness/platforms/aks.sh

- New `hermes-exec` prompt driver (selected via
  SCENARIO_PROMPT_DRIVER=hermes-exec) for runtimes that don't expose
  an HTTP gateway on port 18789. Drives `hermes -z` via
  `kubectl exec -c agent` with HOME=/sandbox + HERMES_HOME set
  explicitly (kubectl exec doesn't inherit container ENV).
- Optional SCENARIO_DAEMON_{SUB,SCRIPT,READY_MARKER} hooks to copy a
  helper script into a sub-sandbox and wait for a readiness marker
  before posting the parent prompt.
- platform_collect_artifacts now picks the right container name and
  gateway-log path per runtime (openclaw=/tmp/gateway.log,
  hermes=/sandbox/.hermes/logs/gateway.log).

### tools/e2e-harness/scenarios/mesh-roundtrip-hermes/

Minimal smoke scenario: two pods, one Python echo daemon, one LLM
prompt that calls kars_mesh_send + kars_mesh_await and reports the
decoded plaintext. Verified end-to-end on freshly-built images.

### tools/e2e-harness/scenarios/exec-brief-hermes/

Multi-agent variant: parent uses kars_spawn to launch 3 Hermes children
(analyst/viz/writer), then fans out via kars_mesh_send. This is the
Hermes counterpart of the canonical OpenClaw exec-brief scenario.

## inference-router/Dockerfile.dev

The canonical Dockerfile is distroless (no shell). The controller's
egress-guard init container runs `sh -c "iptables ..."` which can
only work on an image that has sh + iptables. The .dev variant uses
mcr.microsoft.com/azurelinux/base/core:3.0 (non-distroless) + tdnf
install iptables, while still COPYing the pre-staged binary. Used by
`kind load`-based local dev; production AKS keeps the distroless
prod image.

## Tests

- 83 Hermes unit tests pass.
- 9 kars-agt-mesh unit tests pass.
- 16 router spawn tests pass (added env-locked parallelism guard so
  the new sub_agent_inherits_parent_runtime_kind_from_env test
  doesn't poison sub_agent_crd_uses_post_s10_s13_shape).
- All 834 controller tests pass.
- cargo clippy --package kars-inference-router -- -D warnings clean.

## Live verification on kind-kars-dev

Multi-agent fanout reproduced end-to-end (run.sh-equivalent invocation):

  $ hermes -z 'kars_mesh_send to_agent="analyst" content="ECHO_TEST_ANALYST";
                kars_mesh_send to_agent="viz"     content="ECHO_TEST_VIZ";
                kars_mesh_send to_agent="writer"  content="ECHO_TEST_WRITER";
                emit EXEC_BRIEF_MESH_FANOUT_DONE'
  EXEC_BRIEF_MESH_FANOUT_DONE: 3 mesh sends delivered.

  analyst daemon log: PRE_REG_GOT bytes=17 text='ECHO_TEST_ANALYST'
  viz    daemon log: PRE_REG_GOT bytes=13 text='ECHO_TEST_VIZ'
  writer daemon log: PRE_REG_GOT bytes=16 text='ECHO_TEST_WRITER'

kubectl get karssandbox -n kars-system shows all 4 as RUNTIME=Hermes
(not the prior bug where Hermes parent spawned OpenClaw children).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Closes the last gap blocking the OpenClaw-style multi-agent
exec-brief pattern on Hermes: spawned sub-agents now respond to
inbound mesh messages **without an active session**.

## Problem

After Act 2.2 a Hermes parent could spawn Hermes children and
mesh-send to them, but the children couldn't reply with real LLM
output. Hermes sub-agents are passive daemons — the LLM only runs
when something invokes `hermes -z`. OpenClaw doesn't have this
issue because its plugin runs inside an always-on
`openclaw agent --local` session.

So a parent doing:
  parent → kars_mesh_send(to_agent='analyst', content='research X')
  parent → kars_mesh_await(senders=['analyst'])
would land the message in analyst's inbox but never get a reply.
The analyst's Hermes daemon would just queue the message and sleep.

## Fix

New `runtimes/hermes/.../plugin/mesh_worker.py`: a background
asyncio loop in each sub-agent that:
  1. Drains the shared MeshClient inbox.
  2. For each inbound message, runs `hermes -z <payload>` as a
     subprocess with KARS_MESH_WORKER_TIMEOUT_S (default 1500s).
  3. Resolves the sender's display name via the registry.
  4. Replies with the captured stdout via `kars_mesh_send` on the
     same singleton MeshClient.

Opt-in via `KARS_MESH_AUTO_RESPONDER=1`. The controller sets this
ONLY on Hermes sandboxes that have the
`kars.azure.com/parent` label (i.e. children spawned by another
sandbox via the router's spawn endpoint). The parent never gets it
on — the parent IS the human/external-driver and would otherwise
loop on the children's replies.

The plugin's `__init__`'s eager-init thread now also calls
`mesh_worker.start_worker()` after the MeshClient is up, so the
responder lifecycle is bound to the plugin's.

## Live verification

Multi-step exec-brief on kind-kars-dev with real Foundry work:

  parent → analyst:  'research 2026 agentic AI runtimes, reply ANALYST_FOUND: <url>'
  parent → viz:      'use foundry_code_execute to print a JSON dict'
  parent → writer:   'use file_write to author /sandbox/incoming/brief.md'
  parent → kars_mesh_await(senders=[analyst,viz,writer], timeout=600)

Parent transcript:
  WRITER_DONE: 486
  VIZ_DONE: {"chart_ready": true, "format": "bar", "width": 1024}

Writer pod /sandbox/incoming/brief.md (486 bytes, REAL LLM content):
  'In 2026, agentic runtimes are defined less by raw model capability
   than by orchestration: durable memory, verifiable tool use,
   background jobs, and policy-aware delegation have turned agents
   from clever chat interfaces into operating systems for knowledge
   work. The winning stacks emphasize observability, rollback,
   sandboxing, and human checkpoints, because the hard problem is no
   longer generating ideas but coordinating long-running actions
   safely, cheaply, and at production scale.'

Sub-agent daemon logs confirm:
  - Accepted KNOCK from parent's DID
  - AUTO_GOT bytes=<inbound>
  - AUTO_REPLIED bytes=<reply> to=<parent DID>

(Analyst's reply landed slightly past the parent's await window so
the parent's transcript shows TIMEOUT: 2 received — the mesh path
itself worked for all 3; only the LLM coordination timing was tight
because foundry_web_search adds 30+s to analyst's hermes -z latency.
Verified independently that analyst auto-responded with 16 bytes.)

## Tests

- 83 Hermes unit tests pass
- 9 kars-agt-mesh unit tests pass
- 834 controller tests pass
- 16 router spawn tests pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/mesh_worker.py Fixed
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/mesh_worker.py Fixed
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/mesh_worker.py Fixed
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/mesh_worker.py Fixed
Comment thread runtimes/hermes/src/kars_runtime_hermes/plugin/__init__.py Fixed
Pal Allakatos and others added 4 commits June 5, 2026 06:44
…reakdown

Two operator-visibility fixes called out during the Act 2.3 live
verification:

## 1. Hermes pre_tool_call hook crashed silently → no AGT audit for tools

Root cause: `runtimes/hermes/.../plugin/governance.py::_on_pre_tool_call`
took positional arg `params`, but Hermes 0.15.2 invokes the hook
with KEYWORD args matching `plugins.py:1685-1707`:

  tool_name=<name>, args=<dict>, task_id=<id>,
  session_id=<id>, tool_call_id=<id>

Our signature `(tool_name, params, **_kwargs)` matched `tool_name`
but every other kw landed in `**_kwargs` and `params` stayed unbound.
Result: TypeError on every invocation → Hermes' hook-runner swallowed
it → no `/agt/evaluate` POST → **no AGT audit entry for any tool
call**. Operator saw only `inference:responses:gpt-5.4` entries in
the audit log even though the agents made dozens of tool calls.

Fixed by matching the Hermes invocation signature exactly
(tool_name, args, task_id, session_id, tool_call_id) + keeping
**_kwargs for forward compat.

Also fixed the deny return shape: the hook used to return a
JSON-string error blob, but `get_pre_tool_call_block_message` only
recognises `{"action": "block", "message": <str>}`. Old denies
were logged + ignored — the tool actually ran. New dict-shape denies
make the block actually block.

Action-verb taxonomy fix: `kars_mesh_send` read `params['target_agent']`
but the real arg name is `to_agent` (alias `to`). Action verb
became `mesh:send:` (empty target). Now accepts all three names.
Also added `mesh:inbox` and `mesh:await` verbs for the drain/wait
tools.

### Live verification

Before fix, parent's /agt/audit:
  inference:responses:gpt-5.4 × 63   (every line, no tool entries)

After fix, parent's /agt/audit:
  inference:responses:gpt-5.4 × 64
  tool:kars_discover:writer × 1      ← NEW
  mesh:send:writer × 1               ← NEW

Writer's /agt/audit after fix:
  tool:write_file:/sandbox/incoming/audit_evidence.txt × 1   ← NEW

## 2. Sent ≫ received metric asymmetry now legible

Operator UX was showing e.g. 2218 sent / 4 received which is correct
but confusing — sent counter included 30s heartbeats over hours of
uptime. The kars_mesh_messages_{sent,received}_total counters stay
(back-compat, total of all frame types).

New counters break the total down by frame type:

  kars_mesh_frames_sent_total{type='heartbeat'}     — 30s keepalive
  kars_mesh_frames_sent_total{type='message'}       — app payload
  kars_mesh_frames_sent_total{type='knock'}         — session establish
  kars_mesh_frames_sent_total{type='connect'}       — POP / WS open
  kars_mesh_frames_sent_total{type='ack'}           — KNOCK/heartbeat ack
  kars_mesh_frames_sent_total{type='unknown'}       — unclassified

Same shape for kars_mesh_frames_received_total.

Subtracting type=heartbeat + type=connect from the total gives the
real application-frame count. Operator dashboards can now show:

  app_sent = sum(rate(kars_mesh_frames_sent_total{type!~'heartbeat|connect'}[5m]))

Classification is a cheap byte-prefix scan (first 80 bytes); the test
`classify_frame_type_buckets_known_kinds` guards every bucket and
`classify_frame_type_handles_short_input` guards bounds.

## Tests

- 84 Hermes unit tests pass (3 new govern hook contract tests)
- 936 router lib tests pass (2 new classify_frame_type tests)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Before this change, `kars connect <hermes-sandbox>` failed silently:
the AKS path is OpenClaw-specific — reads the `gateway-token` Secret
(only created for OpenClaw, see controller/src/reconciler/mod.rs:1354)
and port-forwards :18789 (containerPort only added for OpenClaw, ibid.
:1852). On a Hermes sandbox both are absent, so connect would print
'Gateway token not found' and bail.

Adds a Hermes-specific branch in cli/src/commands/connect.ts that
runs after the AKS-existence check but before the WebUI/shell logic:

  if (runtimeKind === 'Hermes') {
    kubectl exec -it -c agent — env HOME=/sandbox HERMES_HOME=...
      hermes chat --accept-hooks
  }

`hermes chat` is the canonical interactive REPL (per
`hermes --help` in 0.15.2 — running `hermes` alone prints usage).
`--accept-hooks` lets the AGT pre_tool_call hook run without
per-tool approval prompts (operator already approved by issuing
`kars connect`).

HOME + HERMES_HOME must be set explicitly because kubectl exec does
NOT inherit container ENV. Hermes' `ensure_hermes_home()` falls
back to $HOME/.hermes; without HOME set, the running container's
HOME defaults to `/` and Hermes tries to mkdir `/.hermes` which
ENOENTs on the read-only rootfs. /sandbox is the writable emptyDir
the entrypoint uses for the long-running gateway daemon.

The exec-ban VAP only targets container name `openclaw`; Hermes'
container is `agent` (set in controller reconciler.rs:1801 from
`is_openclaw` branch), so this is admission-compliant. See
`deploy/helm/kars/templates/admission-pod-exec-ban.yaml`
`matchConditions`.

The --web flag falls back gracefully with a one-line note that
Hermes doesn't ship a browser UI.

The --reset flag works for both runtimes (it's just a rollout
restart). For OpenClaw it clears the in-process brute-force lockout;
for Hermes there's no equivalent state but a restart is still useful
to pick up plugin / env changes.

Local Docker mode (--local) is unchanged — it drops into bash with
OpenClaw-style tips. `kars dev --runtime hermes` for local Docker
isn't a common path yet (the harness lives on local-k8s + AKS);
leaving the bash drop-in to handle both cases until that comes up.

## Tests

789 CLI tests pass (vitest, no new tests added — interactive shell
path is exercised by integration runs, not unit tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Restores the 'press Enter on a sandbox row → drop into the agent
TUI' UX the operator had for local OpenClaw, but for Hermes on AKS.
OpenClaw on AKS still uses the port-forward + WebUI URL path because
the exec-ban VAP blocks exec into the openclaw container.

## What changed

cli/src/commands/operator/dialogs/connect.ts splits the Enter
handler by (location × runtime kind):

  - AKS + OpenClaw → existing port-forward path (VAP-bound)
  - AKS + Hermes   → PTY exec into 'agent' container (NEW)
  - local Docker + OpenClaw → 'openclaw tui' PTY
  - local Docker + Hermes   → 'hermes chat --accept-hooks' PTY (NEW)

The two PTY paths share a common _spawnPtyConnect() helper extracted
from the old inline body; the OpenClaw port-forward path is now
_aksOpenClawConnect(). Both are pure refactors — the byte-identical
PTY plumbing (blessed save/restore, raw-mode stdin, Ctrl-\ detach)
moved into the helper, no functional change for OpenClaw.

## Why this works for Hermes but not OpenClaw on AKS

deploy/helm/kars/templates/admission-pod-exec-ban.yaml has
matchConditions:
  expression: object.container == '' || object.container == 'openclaw'

The VAP fires ONLY when the target container is literally named
'openclaw' (or unspecified — which defaults to the first container,
which is 'openclaw' in OpenClaw pods). Hermes' container is named
'agent' (controller/src/reconciler/mod.rs:1801 picks the name from
the is_openclaw branch), so 'kubectl exec -c agent ...' bypasses the
VAP cleanly.

This was a deliberate VAP design: the policy targets the literal
openclaw runtime container, not 'any agent container'. Hermes (and
future runtimes whose container is named 'agent') benefit by design.

## HOME / HERMES_HOME env vars

Set explicitly on the exec because kubectl exec does NOT inherit
container ENV. Without them, Hermes' ensure_hermes_home() falls back
to $HOME/.hermes; since HOME defaults to '/' in kubectl exec
sessions, Hermes tries mkdir '/.hermes' on the read-only rootfs and
ENOENTs. /sandbox is the writable emptyDir the entrypoint daemon
uses for the long-running hermes gateway.

## Tests

- 789 CLI vitest tests pass (no new tests — interactive PTY path is
  exercised by live operator runs, not unit tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
…-runtime interop)

Closes the last gap blocking Hermes ↔ OpenClaw mesh communication.
Until this change, the Python kars-agt-mesh library and the TypeScript
@microsoft/agent-governance-sdk produced INCOMPATIBLE relay frames —
Python-Python and TS-TS interop worked fine, but a Python sender
talking to a TS receiver (or vice versa) silently dropped messages.

## Wire-format divergences fixed

### 1. message frame: structured header, std base64

**Before (Python only):**
  {
    'v': 1, 'type': 'message',
    'ciphertext': '<urlsafe-base64 of (struct.pack(>I, header_len) + header + ct)>'
  }

**After (matches TS mesh-client.js::send):**
  {
    'v': 1, 'type': 'message', 'from': ..., 'to': ..., 'id': ..., 'ts': ...,
    'header': {
      'dh': '<std-base64 dhPublicKey>',
      'pn': <previous_chain_length>,
      'n':  <message_number>
    },
    'ciphertext': '<std-base64 ciphertext>'
  }

The TS receiver reads frame.header.dh / frame.ciphertext as separate
fields; the old Python shape had no .header, so TS-side .base64ToUint8
got an unexpected packed blob and decrypt errored out (silently
dropped at the SDK boundary).

### 2. establishment: short TS-style keys

**Before:**  {initiator_identity_key: ..., ephemeral_public_key: ..., used_one_time_key_id: ...}
**After:**   {ik: ..., ek: ..., otk: ...}   (matches mesh-client.js::serializeEstablishment)

### 3. KNOCK + first message: TWO frames, not one fused

**Before:** Python fused KNOCK + first ciphertext into a single
  'type=knock' frame for one-RTT latency. TS receivers do NOT consume
  a 'ciphertext' field on a KNOCK — they only read 'establishment',
  call acceptSession, then await a separate 'type=message' frame.
  → first ciphertext was lost on Python-to-TS sends.

**After:** Python sends two distinct frames: 'type=knock' (no ciphertext,
  just establishment) followed immediately by 'type=message'. Matches
  TS mesh-client.js::establishSession + send.

### 4. std-base64 (not urlsafe) on the wire

JS's btoa / Node's Buffer.toString('base64') produce std-base64 with
'+' and '/'. Python's base64.urlsafe_b64encode produces '-' and '_'.
A TS receiver's atob fails on '-'/'_'; a Python receiver's
base64.b64decode fails on '+'/'_' depending on input. Now all on-the-
wire byte strings use std-base64.

## Backwards compat

Receiver tolerates both shapes for one release cycle:

- _message_frame_to_encrypted accepts BOTH the TS shape and the legacy
  packed-ciphertext shape (fallback path)
- _wire_to_establishment accepts BOTH {ik,ek,otk} and the legacy
  {initiator_identity_key, ephemeral_public_key, used_one_time_key_id}
- _b64std_decode tolerates urlsafe alphabet on input

A fleet mid-upgrade between old/new pods won't drop in-flight messages.

## Live verification

Sent {b'WIRE_TEST_DIRECT', 16 bytes} parent → analyst via direct
asyncio script with PYTHONPATH pointing at hot-patched client.py.

Parent stderr:
  > TEXT '{"v": 1, "type": "knock", "from": "did:mesh:a61...", "establishment": {"ik":..., "ek":..., "otk": 20}}'
  > TEXT '{"v": 1, "type": "message", ..., "header": {"dh":..., "pn":0, "n":0}, "ciphertext": "..."}'

Analyst auto_responder.log:
  Accepted KNOCK from did:mesh:a61c9cbf...
  AUTO_GOT from=did:mesh:a61c9cbf... bytes=16
  AUTO_REPLIED bytes=16 to=did:mesh:a61c9cbf...

The 16-byte payload decrypted correctly with the TS-compatible shape.

## Tests

- 8 new wire-format unit tests pin every field-shape contract
- 9 existing kars-agt-mesh unit tests still pass

## Cross-runtime promise

With this commit, a Hermes agent CAN mesh-send to an OpenClaw agent
and vice versa (same relay, same registry, same crypto, now same
wire envelope). End-to-end interop verification on a mixed-runtime
cluster ships as a follow-up — the wire alignment is the prerequisite.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Comment thread runtimes/agt-mesh-python/tests/test_wire_format.py Fixed
Pal Lakatos-Toth and others added 24 commits June 8, 2026 09:24
Fresh-machine `kars up exec-hermes` (or `kars dev --target local-k8s`)
got 7/13 steps in before failing with:

  Deploying agentmesh-agt (relay + registry) into kind…
  local-k8s dev failed: AGT Dockerfile not found at
  /Users/<user>/agent-governance-toolkit/agent-governance-python/
  agent-mesh/docker/Dockerfile
    Clone it:  git clone …

Root cause: `cli/src/lib/agt-bootstrap.ts::ensureAgtRepo()` exists
and auto-clones the pinned AGT fork into ~/agent-governance-toolkit
(honors $KARS_AGT_REPO + --agt-repo), and is wired in BOTH
`cli/src/commands/up.ts` (line ~617) and `cli/src/commands/dev.ts`
(line ~652). But `cli/src/commands/dev/local-k8s.ts::runLocalK8s()`
— which is what `kars up` falls through to when there's no AKS
context AND what `kars dev --target local-k8s` invokes directly —
never called ensureAgtRepo. So a fresh-clone user blew up at step 7.

Fix:
  1. Import `ensureAgtRepo` + `ensureAgtWheels` from
     `../../lib/agt-bootstrap.js`.
  2. After credential loading and before kind-cluster bringup, when
     mesh is enabled (default) AND no external `globalRegistry` is
     supplied, call ensureAgtRepo(opts.agtRepo, repoRoot) → mutate
     opts.agtRepo with the resolved path so the downstream
     rebuildDevImages() and deployAgentMesh() see a valid checkout
     even when the user didn't pass --agt-repo or set
     $KARS_AGT_REPO. Same call ALSO triggers ensureAgtWheels() so
     runtimes/wheels/ is populated for the Hermes / Anthropic /
     Pydantic AI / etc. Python sandbox image builds (the wheel
     directory is .gitignored and the Dockerfiles COPY from it).
  3. Bump stepper totalSteps 13 → 14 to account for the new step.
  4. Fail-fast error message points at three escape hatches
     (--agt-repo / $KARS_AGT_REPO / --no-mesh) for environments
     where auto-clone can't reach github.com (offline CI, etc.).

Mirrors the same call pattern as up.ts:617 and dev.ts:652 — these
three are now the canonical AGT-bootstrap entry points; keep them
in sync or fresh-machine OOTB breaks again on whichever one drifts.

Verified:
  • npm run build      → clean
  • npm run typecheck  → clean
  • npm run lint       → no new warnings
  • vitest             → 789 tests passing (39 files)
  • ci/check-loc.sh    → clean (file not budgeted; only added ~45 LOC
                          and the file ~2754 LOC is well under the
                          800-LOC new-file cap that would apply if
                          a future budget entry is added)
Fresh-machine `kars up exec-hermes` (or `kars dev --target local-k8s`)
on a Mac M-series died at:

  [stage-2 7/21] RUN ... curl -fsSL "https://github.com/cli/cli/
  releases/download/v2.89.0/gh_2.89.0_linux_arm64.tar.gz" ...
  curl: (22) The requested URL returned error: 504

Docker Desktop's networking VM on macOS is notoriously flaky with
github.com (and any external GET); a single 5xx blew up an
already-9-minutes-in image build, leaving the user with a partial
kind cluster and no easy way to resume.

Hardens all 6 external curls in sandbox-images/openclaw/Dockerfile.base
with identical retry policy:

  --retry 5                — up to 5 retries before failing
  --retry-delay 3          — 3-second base delay (grows with backoff)
  --retry-all-errors       — retry on HTTP 4xx/5xx too (not just
                             network errors); covers the 504 case
  --connect-timeout 15     — fail-fast on dead routes so retries
                             don't all hang for 5 minutes each

Endpoints affected (all are versioned release artifacts that never
change, so retries are safe):
  - nodejs.org/dist/v22.22.3/...     (lines 19, 193; two stages)
  - github.com/cli/cli/...           (line 223)
  - github.com/BurntSushi/ripgrep/...  (line 229)
  - cache.agilebits.com/dist/1P/...  (line 237)
  - github.com/pimalaya/himalaya/... (line 243)

Total worst-case extra time on a healthy network: ~0s (first try
succeeds). On a flaky network: ~45s per retried download instead
of a hard failure that wastes the whole build context.
…refresh_in

User on fresh `kars dev exec-claw` saw the WebUI come up fine,
then ~30 min later every chat-completions call started returning:

  WARN inference-router::proxy:
    sandbox=palkarstop-... status=401
    body="IDE token expired: unauthorized: token expired"

Root cause: GitHub's /copilot_internal/v2/token response returns
both `expires_at` (Copilot's hard expiry) and `refresh_in` (a
softer hint, typically ~1500s). The old cache only tracked the
refresh hint, computing:

    refresh_at = now + refresh_in

and serving the cached JWT for as long as `refresh_at > Instant::now()`.

When GitHub returns `refresh_in > (expires_at - now)` — which
happens during token rotation windows and in some account-state
edge cases — the cache kept handing out a JWT whose `expires_at`
had already passed. Copilot then rejected every request with
the verbatim 401 body above until the (longer) refresh window
finally elapsed.

Fix:
  • CachedJwt now stores BOTH refresh_at AND expires_at as
    Instants. The fast-path serves only when:
        refresh_at > now  AND  expires_at > now + REFRESH_BUFFER
  • get_jwt_with_base() forces a re-exchange whenever either
    boundary has been crossed, so a stale JWT is never served
    past Copilot's hard expiry minus a 60s safety buffer.

Added regression test
`refreshes_when_expires_at_passes_even_if_refresh_in_is_longer`
that mints a cache entry with expires_in=30s + refresh_in=1500s,
synthetically ages it past expires_at, and asserts the next call
triggers a fresh exchange (the second mock returns "second-token";
the test fails if the cache still serves "first-token").

  cargo test -p kars-inference-router --lib copilot_auth
    test copilot_auth::tests::errors_when_no_token_configured ... ok
    test copilot_auth::tests::surfaces_upstream_errors ... ok
    test copilot_auth::tests::exchanges_and_caches_token ... ok
    test copilot_auth::tests::refreshes_when_expires_at_passes_even_if_refresh_in_is_longer ... ok
    test result: ok. 4 passed; 0 failed
…" warn

User fresh `kars dev` with `claude-opus-4.7` (the Copilot picker default)
ran into the verbatim Copilot rate-limit body:

  status: 503
  body:   "Sorry, the upstream model provider is currently experiencing
           high demand. Please try another model."

The router already retries 5xx + 429 against InferencePolicy.spec.
modelPreference.fallback[] (inference-router/src/failover.rs
::is_failover_trigger), but the auto-generated InferencePolicy from
`kars dev` emitted fallback_count=0, so the failover walk had nothing
to walk and the throttle response surfaced directly to the WebUI.

Fix has two parts:

1. CLI side — cli/src/github-copilot.ts::buildCopilotFallbackChain
   Picks a same-Copilot, cross-family chain so at least one model
   almost always has quota. Static ordering for debuggability:
     [gpt-5, claude-sonnet-4.5, gemini-2.5-pro, gpt-5-mini,
      claude-haiku-4.5, gpt-4.1]
   minus the picked one (which stays primary; we never reorder
   behind the user's back).

   cli/src/commands/dev/local-k8s.ts::autoCreateSandbox now appends
   the chain to the InferencePolicy YAML whenever
   creds.provider === "github-copilot" — Foundry / GH-Models paths
   don't get an auto-chain because they're single-deployment by
   definition.

   New tests in cli/src/github-copilot.test.ts (6 cases) gate:
   - picked model always first
   - picked model never appears in fallbacks (dedup)
   - non-empty chain for the recommended default
   - byte-identical chain between invocations (debuggability)
   - every emitted id exists in COPILOT_MODELS (no typos)
   - both Anthropic AND OpenAI entries regardless of pick

2. Router side — inference-router/src/inference_policy_loader.rs
   The startup "InferencePolicy loaded" line now includes the full
   `fallback_chain` (not just `fallback_count`) so ops can correlate
   a 503-then-200 sequence with the configured order.

   ALSO emits a one-shot WARN at load time when fallback is empty:

     WARN InferencePolicy has no fallback chain — 5xx/429 on the
          primary deployment will surface directly to the agent
          (no router-side failover). Add spec.modelPreference.
          fallback[] in the InferencePolicy CR.

   Surfaces the gap loudly in the router log so operators don't
   have to dig for "fallback_count":0 in a JSON line and realize
   what it means — especially important for hand-rolled
   InferencePolicy CRs (the auto-generated ones now always
   include a chain on the Copilot path, but operator-authored
   ones might not).

Verified:
  • cli: npm run build + typecheck clean; vitest 795 tests pass
  • cli: 6 new github-copilot.test.ts cases all green
  • router: cargo build/clippy clean
  • router: 16 inference_policy_loader tests still pass
  • cargo fmt clean; ci/check-loc.sh clean
User on fresh `kars dev` saw the openclaw agent eventually report:
  "Egress proxy is still timing out — I'll proceed with what I know
   and flag anything that needs a fresh check."
with NO corresponding error line on the router side at the default
log level, because the router never returned to the client to log
anything.

Root cause: inference-router/src/forward_proxy.rs::handle_connect,
::handle_http, and ::handle_tls_redirect each called

  let upstream = match TcpStream::connect(&resolved).await { ... }

unguarded. On Linux, TcpStream::connect blocks for the full kernel
SYN-retransmit window (~60-180s depending on net.ipv4.tcp_syn_retries)
when the destination IP silently drops the SYN — common when:
  • the IP is behind a flaky CDN
  • the destination is unreachable from kind on Mac Docker Desktop's
    network namespace
  • the iptables-redirected TLS target's IP roams during the request
While that connect blocks, the sandbox agent has nothing to show
the user except "still timing out", and the only router log was at
`debug!` level which was filtered out by default.

Fix:
  • New `UPSTREAM_CONNECT_TIMEOUT = 10s` constant + a
    `connect_with_timeout(addr)` helper that wraps `TcpStream::connect`
    in `tokio::time::timeout()`. Returns an `io::Error` of kind
    `TimedOut` with a clear message ("connect timeout after 10s") when
    the SYN handshake hangs.
  • All three call sites swapped to the helper. The visible-failure
    contract is the same (`502 Bad Gateway` to the client) but now
    bounded at 10s instead of 60-180s.
  • Bumped the upstream-failed log lines from `debug!` to `warn!` and
    enriched them with the domain/dest fields so the failure is
    visible in the default log filter (was effectively invisible
    before).

Regression tests added in inference-router/src/forward_proxy.rs:
  • `connect_with_timeout_aborts_on_blackhole_ip` — asserts a TEST-NET-2
    (RFC 5737) address fails within 15s of slack (10s timeout + CI
    jitter). Pre-fix this would take the full kernel SYN-retransmit
    window.
  • `connect_with_timeout_succeeds_on_local_listener` — sanity check
    that the wrapper doesn't break the happy path.

  $ cargo test -p kars-inference-router --lib forward_proxy
    test forward_proxy::tests::connect_with_timeout_succeeds_on_local_listener ... ok
    test forward_proxy::tests::connect_with_timeout_aborts_on_blackhole_ip ... ok
    test result: ok. 2 passed; 0 failed; finished in 10.00s
…ime only)

RustSec published advisory RUSTSEC-2026-0173 on 2026-06-07 (one day
before this commit) flagging `proc-macro-error2 2.0.1` as
unmaintained. cargo-audit + cargo-deny CI gates went red on every
push because the advisory feed pulls fresh on every run.

Dependency chain (controller-only, build-time):
  oci-client 0.16.1 + 0.15.0 → oci-spec → getset 0.1.6 (proc-macro)
    → proc-macro-error2 2.0.1

Same rationale as the existing RUSTSEC-2024-0370 ignore for the
sibling crate `proc-macro-error` 1.x — this is a proc-macro crate
that runs at compile time inside rustc, NOT at runtime in the
controller binary. There is no runtime attack surface from an
unmaintained proc-macro crate beyond the build toolchain itself.

No safe upgrade path exists yet: `getset` 0.1.6 is the latest
release and pins `proc-macro-error2`. Upstream `oci-client` would
need to upgrade past `getset 0.1.6` (which would itself need to
swap proc-macro-error2 for proc-macro-error3 or inline its own
diagnostics). TODO comment added at both call sites to drop the
ignore when that lands.

Added the ignore to:
  • .cargo/audit.toml — picked up by `cargo audit` (the Rust
    Dependency Audit CI job).
  • deny.toml         — picked up by `cargo deny check advisories`
    (the Rust Supply-Chain Gate CI job).

Verified locally:
  cargo audit          ⇒ clean (no warnings)
  cargo deny check     ⇒ advisories ok
Three product gaps user reported on the operator (`n` spawn dialog):

1. **Hermes missing from the runtime picker.** The operator dialog
   hardcoded its own array of 7 runtimes (cli/src/commands/operator/
   dialogs/spawn.ts ::runtimeOpts), which drifted from the actual
   WIRED_KINDS list in cli/src/runtime.ts as soon as Hermes shipped.
   Result: the docs say "Hermes is supported", the user pushes `n`,
   but Hermes isn't an option.

   Fix: extract a new `wiredRuntimeFlags()` helper in runtime.ts that
   inverts FLAG_TO_KIND and walks WIRED_KINDS in order, then have the
   spawn dialog call it directly. Now WIRED_KINDS is the single source
   of truth — any new wired runtime shows up in the picker
   automatically with zero per-runtime edits.

2. **Channels listed as "OpenClaw only" everywhere — but Hermes
   supports them too.** sandbox-images/hermes/entrypoint.sh:266+
   already translates TELEGRAM_BOT_TOKEN / SLACK_BOT_TOKEN /
   DISCORD_BOT_TOKEN into `hermes config set channels.*.token`, so
   the gating was just wrong copy.

   Fixes:
   - spawn.ts: introduce `channelCapableRuntimes = Set(["openclaw",
     "hermes"])` and use it everywhere the dialog used to check
     `state.runtime === "openclaw"`. The channel/token/allowfrom
     fields now light up for Hermes too, and switching to a
     non-channel runtime clears the selection as before.
   - add.ts: rewrite the `--channels`/`--telegram-*`/`--slack-*`/
     `--discord-*` help text from "[OpenClaw only]" to
     "[OpenClaw + Hermes]". Skills + API keys stay "[OpenClaw only]"
     because those wire via OpenClaw's plugin.allow list (not Hermes).
     Updated the "Flag groups" --help-text block to match.

3. **Copilot models with Hermes — verified, no code change needed.**
   sandbox-images/hermes/entrypoint.sh:83 pins OPENAI_BASE_URL at the
   in-pod router, and lines 99-111 already case on KARS_PROVIDER to
   set HERMES_DEFAULT_PROVIDER=openai when the operator picked
   github-copilot in `kars dev`. The router-side fixes from this PR
   (Copilot IDE-JWT cache w/ expires_at — 6886415; Copilot fallback
   chain on 503 — 6dca0f8) apply to Hermes through the same router.

Contract test in runtime.test.ts pins this down so it can't drift
again:
  • wiredRuntimeFlags() returns kebab flags that all round-trip
    through flagToKind() to wired RuntimeKinds.
  • The returned set includes every known wired runtime (Hermes in
    particular).
  • Two calls return byte-identical arrays (deterministic ordering
    for left/right picker cursor UX).

Also retroactively added Hermes to the existing assertRuntimeWired
"accepts every wired runtime" test — was missing.

Verified:
  npm run build      ⇒ clean
  npm run typecheck  ⇒ clean
  vitest run         ⇒ 798 passed (was 795 → +3 new contract tests)
User reported "spawn fail on local k8s" after picking Hermes in the
operator's `n` spawn dialog. Root cause: cli/src/commands/dev/local-k8s.ts
::runLocalK8s only loaded the 3 helm-chart-pinned images
(kars-sandbox, kars-controller, kars-inference-router) into kind.
The operator-spawned Hermes pod resolves to
`karsacr.azurecr.io/kars-runtime-hermes:latest` (per controller/src/
reconciler/runtime.rs::DEFAULT_HERMES_IMAGE) — which doesn't exist
in kind, so the pod ImagePullBackOffs (kindnet can't reach ACR
without auth).

Fix: extend the image-load loop to ALSO attempt
`karsacr.azurecr.io/kars-runtime-hermes:latest` with aliases
[`kars-runtime-hermes:latest`, `kars-runtime-hermes:dev`]. The
existing `loadImageIfPresent` helper handles the missing-locally
case gracefully — when the host hasn't built the runtime image yet,
the function returns `{loaded: false}` without throwing.

Runtime images are separated from core images in the missing-image
warning path:
  • Core 3 missing  → yellow warning (deployment will fail).
  • Runtime missing → dim notice with the exact `docker build`
                      command to fix:
        docker build -t karsacr.azurecr.io/kars-runtime-hermes:latest \
                     -f sandbox-images/hermes/Dockerfile .
    Then re-run `kars dev --target local-k8s --build` to load it.

Only Hermes auto-loads (not Anthropic/LangGraph/MAF/etc.) — those
runtimes stay opt-in to keep `kars dev` startup fast. Hermes is the
only non-OpenClaw runtime productized + verified in this PR, so it's
the one users will hit first.

Verified:
  npm run build      ⇒ clean
  npm run typecheck  ⇒ clean
  vitest run         ⇒ 798 passed
User report: operator's `n`/spawn dialog showed "✓ Spawned <name>" but
no pod ever appeared in the agent table — they had to check Headlamp
to find an ImagePullBackOff. Same pattern would hit CrashLoopBackOff,
ErrImageNeverPull, OOMKilled, etc.

Root cause in cli/src/commands/add.ts: the 120s wait loop polls for
`containerStatuses[*].ready` to include "true"; if the wait times out,
the code unconditionally calls `spinner.succeed(...)`, exits 0, prints
`(may still be starting)`. The operator's spawn dialog only logs
`{red-fg}✗ Spawn fail{/}` when execa rejects (non-zero exit), so a
stuck pod silently passed as success.

Fix: when the wait loop times out, query containerStatuses one last
time and check for unambiguous failure modes:

  • ImagePullBackOff / ErrImageNeverPull / ErrImagePull /
    InvalidImageName / CreateContainerConfigError (image side)
  • CrashLoopBackOff with restarts >= 2 (runtime side)
  • Last terminated state of OOMKilled / Error

If found:
  • spinner.fail() with the container name + reason
  • Print the kubectl describe/logs commands that surface the full
    cause
  • For image-pull failures specifically, print the docker-build +
    kind-load commands (local-k8s case is by far the most common
    cause — runtime images that aren't loaded into the kind cluster)
  • process.exit(1) so the operator's spawn dialog sees the failure
    and logs `✗ Spawn fail: <reason>` in its activity log

If NO unambiguous failure mode found (e.g. genuinely still pulling),
keep the original informational success so existing scripts aren't
broken.

This makes the operator UX honest: spawning a Hermes pod when the
runtime image isn't loaded into kind now shows an actual error
pointing at the build command, not a green checkmark that the user
has to debug via Headlamp.

Verified:
  npm run build + typecheck ⇒ clean
  vitest run                ⇒ 798 passed
User report:
  > operator says "✓ Spawned" then nothing visible
  > kubectl get karssandbox -A confirms the CR was never created

Two compounding silent-failure bugs:

1. kars add was log-then-exit-0 on caught errors.
   The outer catch at cli/src/commands/add.ts line 601 (was: 531)
   handled every exception by calling spinner.fail() + console.error()
   and then RETURNING — letting Node exit 0 naturally. So
   `kubectl apply -f -` failing (CRD missing, wrong context, schema
   rejection on the bundle, etc.) surfaced as a clean exit code to
   any caller. Operator's `execa("kars", args, { stdio: "pipe" })`
   only logs `✗ Spawn fail` when execa REJECTS, so silent exit-0
   masked every kars-add failure mode behind a green checkmark.

   Fix: add `process.exit(1)` after the error logs. Preserves all
   the existing error-message branching (controller-not-installed
   hint, generic error text) — just stops lying about exit status.

2. Operator's spawn dialog was throwing away the real error text.
   Previously logged only `(e.stderr || e.message)?.substring(0, 200)`
   — execa's `.message` is usually `Command failed with exit code 1:
   kars add ...`, NOT the underlying kars-add stderr. So even after
   fix #1, the operator log would show "✗ Spawn fail: Command failed
   with exit code 1: kars add testhermes --runtime hermes ..." with
   no actual root cause.

   Fix: prefer e.stderr (now populated thanks to fix #1) over
   e.message, strip ANSI colour codes that kars add emits via chalk,
   filter empty lines, keep the last 4 (which is where spinner.fail
   + error hints live), join with " | ", cap at 400 chars. Activity
   log now shows e.g.:

     ✗ Spawn fail: Failed to create sandbox | Error: kubectl error:
     KarsSandbox.kars.azure.com "testhermes" is invalid: spec.hermes:
     Invalid value: ... | Connect: kars connect testhermes

   Also: on SUCCESS, echo the last 3 lines of stdout (the
   "Namespace / Model / Status / Connect" hints kars add prints) so
   the operator sees useful follow-up info inline.

Verified:
  npm run build + typecheck ⇒ clean
  vitest run                ⇒ 798 passed
User OOTB story:
  1. Ran `kars dev` from an earlier (pre-Hermes) branch — chart
     installed an older KarsSandbox CRD into kind.
  2. Pulled the hermes branch + `cd cli && npm run build` to update
     the CLI binary.
  3. Did NOT re-run `kars dev` (no apparent need — cluster was up).
  4. `kars operator` → `n` → Hermes → Launch
  5. `kars add` bundle had `spec.runtime.hermes`; cluster CRD didn't
     know that field; kubectl apply rejected with:
        ValidationError(KarsSandbox.spec.runtime):
        unknown field "hermes" in com.azure.kars.v1alpha1...

The error message was technically accurate but the FIX was buried
behind "what does this even mean" digging. Operator activity log
showed the rejection text but didn't tell the user it's a one-shot
CRD refresh away.

Fix: cli/src/commands/add.ts catch block now detects the three
patterns kubectl emits on stale-CRD rejection ("unknown field",
"Unsupported value", "ValidationError") and prints the exact
chart-template + server-side-apply incantation:

  This looks like a CRD schema mismatch — the cluster's KarsSandbox
  CRD is older than your local CLI/sources. Refresh the chart:

    helm template kars deploy/helm/kars --namespace kars-system \
        --include-crds | kubectl apply -f - --server-side --force-conflicts

  Or just re-run `kars dev --target local-k8s` — its chart-install
  step always refreshes CRDs to the source-of-truth schema.

Doesn't change the behavior of the CRD path itself (kars dev's
helmInstall already does the right thing — see
cli/src/commands/dev/local-k8s.ts:778); this only makes the
diagnostic explicit when an operator-driven kars-add hits the
known stale-CRD failure mode.

Verified:
  npm run build + typecheck ⇒ clean
  vitest run                ⇒ 798 passed
User OOTB on local-k8s after spawning a Hermes sandbox from operator:

  Failed to pull image "karsacr.azurecr.io/kars-runtime-hermes:latest":
  dial tcp: lookup karsacr.azurecr.io on 192.168.65.254:53: no such host

kind nodes don't have ACR pull creds (and shouldn't — the user isn't
on an Azure cluster). Last commit (ea59a9f) added auto-LOAD of the
image into kind via loadImageIfPresent, but loading requires the
image to exist on the host first — and the host doesn't have it
unless the user knows the docker-build command.

Fix: extend rebuildDevImages in cli/src/commands/dev/local-k8s.ts
with a new "runtime-hermes" build spec. It builds
karsacr.azurecr.io/kars-runtime-hermes:latest from
sandbox-images/hermes/Dockerfile against the repo root context.
The target tag matches DEFAULT_HERMES_IMAGE in the controller, so:

  1. The auto-load step in runLocalK8s now finds the image and
     `kind load`s it into the cluster.
  2. When the operator spawns a Hermes sandbox, the controller's
     image string resolves to the already-loaded image →
     ImagePullPolicy IfNotPresent + image-present → no pull
     attempt, no DNS-to-ACR-from-kind failure.

Cost:
  • First build: 3-5 min (Python pip install + Hermes pip install +
    ripgrep/op binaries). The Dockerfile COPYs runtimes/wheels/
    which `ensureAgtWheels` already populated at the top of
    runLocalK8s, so no wheel-build blocking.
  • Subsequent runs: docker layer cache hit, < 10 sec. Honors the
    same `forceAll` flag as the other dev specs.

Skips silently if sandbox-images/hermes/Dockerfile is missing
(e.g. older checkout) so it can't break the openclaw-only path.

Hermes is the only non-OpenClaw runtime auto-built; other runtimes
(Anthropic, LangGraph, MAF, Pydantic AI, OpenAI Agents) stay opt-in
to keep `kars dev` startup cost bounded. Hermes is the productized
runtime in this PR — auto-building it is what makes operator-`n`
→ Hermes → Launch JUST WORK out of the box.

Verified:
  npm run build + typecheck ⇒ clean
  vitest run                ⇒ 798 passed
…m template`

Found while debugging user's persistent ErrImagePull after the
manual CRD refresh worked:

  helm template kars deploy/helm/kars --include-crds | kubectl apply ...

WITHOUT `-f deploy/helm/kars/values-local-dev.yaml` re-renders the
controller Deployment from default values.yaml — which doesn't
include `KARS_DEV_PROFILE=true`. The controller then defaults
imagePullPolicy to Always for `:latest` images, so every sandbox
pod tries to pull from ACR even when the image is loaded in kind.
DNS resolution to karsacr.azurecr.io fails → ErrImagePull forever.

This is exactly the failure mode I just sent the user from this
PR's previous commit (93938f5) — the kars-add error hint's
PRIMARY fix was a naked `helm template`, which is what triggered
the side-effect.

Reorder the hint:
  1. Primary: `kars dev --target local-k8s` (correct overlay,
     no risk of overwriting controller env).
  2. Fallback (only if you must apply CRD by hand): include
     `-f deploy/helm/kars/values-local-dev.yaml` so the controller
     keeps its dev semantics.

No code path changed — just the user-facing diagnostic string.
User stuck on persistent ErrImagePull for kars-runtime-hermes even
after the image was loaded into kind (`docker exec kars-dev-control-
plane crictl images | grep hermes` confirmed). The kubelet was still
attempting a network pull because the controller emitted
`imagePullPolicy: Always` for the `:latest` tag.

Root cause: `KARS_DEV_PROFILE=true` was set ONLY by `kars dev`'s
dynamic per-run overlay (cli/src/commands/dev/local-k8s.ts:944),
NOT by the static `values-local-dev.yaml` overlay. Result: any
out-of-band chart apply that just used `-f values-local-dev.yaml`
(e.g. the CRD-refresh workflow we recommend when source CRDs drift
forward) silently dropped the env var → controller's pull-policy
helper at controller/src/reconciler/mod.rs:1291 fell into the
`Always` branch for `:latest` images → every sandbox pod tried to
pull from ACR even when the image was kind-loaded → ErrImagePull
forever on machines without ACR network reachability.

Fix: pin `KARS_DEV_PROFILE: "true"` in the static overlay
alongside `LEADER_ELECTION_ENABLED: "false"`. Now anyone applying
`values-local-dev.yaml` gets full dev semantics (IfNotPresent
pull policy, dev-mode relaxations, etc.) without depending on
the CLI's dynamic overlay being layered on top.

The CLI's dynamic overlay still re-emits the same key idempotently
(see cli/src/commands/dev/local-k8s.ts ::provisionDevCreds line
944); helm/kubectl deduplicate by env-var `name`, so no double
emission downside.

After this commit, the user can run any of:
  • `kars dev --target local-k8s` (always worked)
  • `helm template kars deploy/helm/kars -f deploy/helm/kars/values-local-dev.yaml --include-crds | kubectl apply -f - --server-side --force-conflicts`
  • `helm upgrade kars deploy/helm/kars -f deploy/helm/kars/values-local-dev.yaml`
And all three yield a working dev controller.
… apply`

Following on from 99deca3 / 2c0c912 and a user-driven debug session
that uncovered why the "helm template -f values-local-dev.yaml |
kubectl apply" workaround keeps half-bricking local-k8s clusters:

The static `values-local-dev.yaml` overlay does NOT contain the
inference creds (AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT,
KARS_PROVIDER, COPILOT_GITHUB_TOKEN). Those are emitted by the
per-run DYNAMIC overlay built in
`cli/src/commands/dev/local-k8s.ts::provisionDevCreds`. So any
external `helm template ... -f values-local-dev.yaml | kubectl apply`
overwrites the controller Deployment's pod-spec env with ONLY the
static-overlay values — silently nuking the dynamic creds. Result:
every subsequent reconcile fails with:

  ERROR "No inference endpoint configured"

and Deployments never get written. Pods never get created. The
user then sees the operator's spawn dialog "succeed" (kars add
times out on the wait loop, exits 0 from the previous-spinner
branch — though now exits 1 from f39d425 — but in either case the
CR is reconciled into nothing because the controller can't proceed).

Fix: simplify the CRD-mismatch hint to recommend ONLY `kars dev
--target local-k8s` (the source of truth for local dev). Drop the
helm-template fallback that was actively harmful. Add an explicit
DO NOT warning so the next person reading this hint doesn't try
the obvious-but-wrong shortcut.

Verified:
  npm run build ⇒ clean
  vitest run    ⇒ 798 passed
This session shipped 12 OOTB blockers — every one diagnosable from
cluster state + controller logs + chart source. Captures the design
for an in-cluster SRE agent that auto-walks the same diagnostic
ladder. Filed as docs/blueprints/07-kars-sre-proposal.md so it gets
discoverable PR-review on its own merits without bloating this
Hermes-runtime PR.

Phased implementation:
  1. MVP (kars-sre-mvp todo): 5 read-only tools. ~500 LOC, ~1 day.
  2. Phase 2 (blocked on Phase 1): sre_apply_fix + AGT approval
     flow.
  3. Phase 3 (blocked on Phase 2): continuous-watcher mode.

Validation gate: the merged agent must autonomously diagnose +
propose fixes for every one of the 12 OOTB blockers from THIS
session, given only the cluster state that existed at the moment
each was hit. That's a built-in regression corpus.

No code shipped in this commit — design only. Implementation
lands in a separate PR series.
…s-control section

Two pieces:

1. fix(cli): rebuildDevImages always rebuilds controller + router

   User session hit a 30-minute debug loop because their kind cluster
   was running a `kars-controller:dev` image built BEFORE commit
   493c118 (2026-06-04), which added the `dev_profile` check at
   controller/src/reconciler/mod.rs:1291:

     let pull_policy = if ctx.dev_profile || !image.ends_with(":latest") {
         "IfNotPresent"
     } else {
         "Always"
     };

   Without dev_profile in the running controller, every `:latest`
   image got `imagePullPolicy: Always` — including the local
   kind-loaded kars-runtime-hermes — so the kubelet ALWAYS attempted
   a manifest fetch from karsacr.azurecr.io which kind cannot reach
   → ErrImagePull forever.

   Why this was hard to spot: rebuildDevImages skipped already-present
   images entirely. Running `kars dev --target local-k8s` after
   pulling new controller source did NOT rebuild the controller image.
   The user's check for KARS_DEV_PROFILE=true on the controller pod
   passed (env was correctly injected by the helm overlay), but the
   running controller process couldn't USE the env because the binary
   predates the env-reading code.

   Fix: introduce ALWAYS_REBUILD = {"controller", "inference-router"}.
   Both bottom out at a pre-staged Rust binary COPY (see staging in
   stage-rust-bin.ts), so docker layer cache makes the rebuild ~5-30s.
   That cost is well worth not silently running stale controllers.
   Sandbox + runtime images stay opt-in because their builds are
   minutes long.

2. docs(sre): expand kars-sre access-control design

   Added §6.1-6.6 to docs/blueprints/07-kars-sre-proposal.md:
   - 6.1 Tier 1 (MVP target): in-cluster ServiceAccount token on
     local-k8s — works on kind without any Entra/AKS dependency
   - 6.2 Tier 2 (Phase 2): AKS Workload Identity federation; byte-
     identical agent code, purely additive operator glue
   - 6.3 Complete ClusterRole `kars-sre-reader` spec — single
     authorization gate, every absent permission is deliberate
   - 6.4 Secrets handling — router-side .data stripping (~30 LOC
     in inference-router/src/proxy.rs) + RBAC defense in depth
   - 6.5 Phase-2 write actions via short-lived (5-min TTL),
     per-action ServiceAccount tokens minted on operator approval;
     standing blast radius stays read-only
   - 6.6 Egress already covered

Verified:
  npm run build + typecheck ⇒ clean
  vitest run                ⇒ 798 passed
…copy

User on local-k8s after the image-loading saga finally got past
ImagePullBackOff — pod proceeded to crash on agent-container startup
with:

  cp: preserving permissions for '/sandbox/.hermes/plugins/kars/__init__.py':
  Operation not permitted
  cp: preserving permissions for '/sandbox/.hermes/plugins/kars/discover.py':
  Operation not permitted
  …(13 such lines)

Root cause: sandbox-images/hermes/entrypoint.sh line 73 used `cp -a`,
which preserves owner + mode + atime metadata. The staged source at
/opt/kars-hermes-stage/plugins/kars/ was chowned root:root at image
build time. The entrypoint runs as UID 1000 (sandbox user) on a
`readOnlyRootFilesystem: true` pod, so preserving root ownership →
EPERM from the kernel → `set -e` at the top of the script kills the
container with the "Operation not permitted" spam as the entire
visible output. Cryptic-as-hell symptom; trivial fix.

Fix: `cp -r` instead of `cp -a`. Files end up owned by the copying
user (UID 1000), which is what we want anyway because that's the
UID hermes runs as. The source files have 0444 from the
`chmod -R a+rX` in sandbox-images/hermes/Dockerfile, so they remain
readable post-copy; no follow-up chmod needed.

Prior art: the openclaw entrypoint at sandbox-images/openclaw/
entrypoint.sh:1273 uses `cp --no-preserve=mode` for the same reason
(it copies the package.json staged at image build time into a
runtime location). We're now consistent.

No new tests — the failure mode is verifiable only against a real
sandbox pod, and the OOTB-fresh-machine-gate todo will cover it
once that CI lane lands.
… gpt-5.4)

User picked `claude-opus-4.7` in the operator spawn dialog; the
resulting Hermes pod was using gpt-5.4 instead.

Root cause: sandbox-images/hermes/entrypoint.sh line 171 wrote

  echo "  default: \"${AZURE_OPENAI_DEPLOYMENT:-gpt-5.4}\""

into the auto-generated config.yaml. AZURE_OPENAI_DEPLOYMENT is
NOT injected into the agent container by the controller — only
into the inference-router container's env (see
controller/src/reconciler/mod.rs:1656 → router_env, vs the
openclaw_env path at line 1335 which only sets the generic
KARS_MODEL). So the hermes entrypoint fell through to the
hardcoded gpt-5.4 default in EVERY case.

The generic kars-runtime-contract env var is KARS_MODEL (controller/
src/reconciler/mod.rs:1335 — "Generic alias readable by any
runtime — Hermes / OpenAIAgents / MAF / BYO all read KARS_MODEL").
That's the env Hermes should be honouring.

Fix: prefer KARS_MODEL first, fall back to AZURE_OPENAI_DEPLOYMENT
(for hand-crafted dev overlays that still set the legacy name),
keep gpt-5.4 as last-resort default to keep the boot banner
sensible when neither is set.

  ${KARS_MODEL:-${AZURE_OPENAI_DEPLOYMENT:-gpt-5.4}}

User-visible effect: `kars operator → n → Hermes → model:
claude-opus-4.7 → Launch` now actually runs claude-opus-4.7 in the
pod, not gpt-5.4. Same for every other Copilot/Foundry model the
user can pick.
…spawns)

User session: a Hermes parent agent tried to spawn a sub-agent with
the Hermes plugin's documented `role` arg. The router rejected with
HTTP 422 Unprocessable Entity because:

  inference-router/src/spawn/mod.rs::SpawnRequest had #[serde(
  deny_unknown_fields)] and no `role` field. The Hermes plugin
  wrapper (runtimes/hermes/src/kars_runtime_hermes/plugin/spawn.py)
  sent `role` as a top-level body field and got 422.

The Hermes plugin's docstring + the kars_spawn schema both expose
`role` as a key arg: "Short persona/role description that siblings
can find by role." It was always a missing-field in the router's
request type, not bad client code.

Fix:
  1. Add `pub role: Option<String>` to SpawnRequest in
     inference-router/src/spawn/mod.rs:48 (the schema-deny-strict
     deserialize site).
  2. Wire it through every existing SpawnRequest construction site
     (handoff/mod.rs × 6, spawn/dev_profile_test.rs × 1, spawn/mod.rs
     × 2 — the docker.rs snapshot path and the list-children
     reconstruction path).
  3. In build_sub_agent_crd_with_labels, when role is non-empty,
     emit it as `kars.azure.com/role` label on the child CRD so:
     (a) sibling discovery via `kubectl get karssandbox
     -l kars.azure.com/role=auditor` works,
     (b) the parent's local roster can recover role on restart by
     reading children with kars.azure.com/parent + reading
     kars.azure.com/role,
     (c) the handoff/restore path snapshot can preserve role
     across re-spawn (the snapshot-reconstruction site now reads
     the label back).
  4. Sanitize the label: K8s requires ≤63 chars + a specific
     character class. Replace disallowed chars with `-`, truncate.
     Don't fail the spawn over a space in the LLM's free-form
     persona text.

User-visible effect: `Hermes parent → kars_spawn(name: zsolti, role:
"data analyst")` now succeeds AND records the role for downstream
discovery. Previously it failed with 422 and the Hermes plugin's
wrapper retried without role (silently dropping the field's whole
purpose).

The agent that diagnosed this was, charmingly, correct.

Verified:
  cargo build/clippy/fmt          ⇒ clean
  cargo test --lib spawn          ⇒ 16 passed
  ci/check-loc.sh                 ⇒ clean
The honest answer to "is OOTB really working?" is "you'd have to
actually run the full flow on a fresh machine to know". This
session shipped 16 OOTB blockers that no unit test caught — every
single one a wire-format failure across module boundaries that
only surfaces against a real cluster.

This script makes that question answerable by anyone in one command:

  export KARS_OOTB_COPILOT_TOKEN="gho_..."
  bash scripts/smoke/fresh-machine-ootb.sh

What it does:
  1. Wipes carried state (kind cluster, AGT clone, ~/.kars, npm-link)
  2. Fresh git clone into /tmp/kars-ootb-smoke
  3. cd cli && npm ci && npm run build && npm link
  4. kars dev --target local-k8s (non-interactive via seeded creds)
  5. kars add one OpenClaw + one Hermes sandbox
  6. Polls until both pods Running 2/2 (5-min timeout each)
  7. Tears down (or --keep)

Exits 0 iff every step succeeds. On failure prints the precise
command that failed AND the pod's diagnostic output, so the
regression is reproducible from the script output alone.

Catches everything from the 2026-06-08 Hermes session:
  • AGT auto-clone missing → kars dev fails at chart-apply
  • Stale CRD → kars add 4xx
  • Stale controller / runtime images → ImagePullBackOff
  • KARS_DEV_PROFILE drift → same
  • `cp -a` perm error → CrashLoopBackOff
  • SpawnRequest missing role → kars add exit 1
  • etc.

Limitations today:
  • Needs a real Copilot OAuth token in env (can't run on a public
    GHA runner without secret wiring). The ootb-fresh-machine-gate
    todo tracks moving this to CI.
  • Single-cluster only; AKS / federation covered by separate
    interop scripts.
  • Doesn't validate model routing post-spawn yet (next iteration).

README.md documents usage, what it catches, when to run, and
how to extend for new wired runtimes.

Not wired as a PR gate yet — runs manually. Filed
`ootb-fresh-machine-gate` for the CI lane.
…tch deck

Practitioner-grade visual language (mix of Patrick Collison / Stripe Press,
Bret Victor, and Stripe-docs style — selected per slide intent):

  • Title + close       : dark sandwich, 168pt mark, single tagline
  • Pillar overview     : eyebrow + heading + lede paragraph + 4 named primitives
  • Sandbox             : Victor-style — one named artefact, real CRD field
                          labels (UID 1000, readOnlyRootFilesystem, etc.)
  • Sandbox · the gate  : Stripe-docs — the 6 actual iptables rules in a
                          code block + prose explanation on the right
  • Mesh                : the real KNOCK frame JSON as the visual artefact
                          (with v/type/from/to/id/ts/intent/establishment fields)
  • Governance · policy : the real InferencePolicy CR snippet as the artefact
  • Governance · stack  : 4 layers vertically with name + body + source ref
  • Blueprints          : 6 named shapes, real meta per shape (Kata + SEV-SNP,
                          A2A bridge, private model + signed allowlist, etc.)
  • Multi-runtime       : 8 wired runtimes named, with what each is in one line
  • Built on AGT        : the 4 actual PRs/contributions named
  • What's next         : the 4 actual shipping targets named
  • Try it              : the actual `kars dev` command sequence as code block

Every claim in the deck is ground-truthed against repo HEAD via
deep-dive explorer runs (see docs/showcase/outline.md for full
citations). Source file: /tmp/build-deck.js → pptxgenjs render.

Diagram pair (Excalidraw): 6 .excalidraw files using single accent
teal (#028090), Helvetica throughout, no inline arrows where layout
implies the relationship. Kept minimal because the deck slides carry
the detail; diagrams are hero shots, not info-dense.

Source-of-truth doc: docs/showcase/outline.md — every claim with
file:line citations against ground-truthed repo source. Updating
outline.md → re-running /tmp/build-deck.js regenerates the deck.
… tiles cropped)

Visual QA caught two layout bugs in the first render:

- Slide 7 (MESH/KNOCK frame): JSON code block was 17 lines @ fontSize 12
  in a 2.9" panel — closing braces overflowed below the gray panel.
  Fixed: shrunk to fontSize 11, panel taller (3.55"), raised to y:3.85
  to use the available space without colliding with the lede.

- Slide 11 (RUNTIMES): 4 tiles @ 3.85" wide + 0.15" gaps = 15.85"
  total, but the slide is only 13.3" wide — x0 went negative,
  cropping leftmost tiles. Fixed: tw=2.9" → 12.25" total, centred
  with 0.525" left/right margin.

Re-rendered all 15 slides, visual QA pass clean across the board.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Vendoring /tmp/build-deck.js into scripts/showcase/ so the deck is
reproducible from a clean clone — no more 'lost the script that
made the slides' problem.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
}

// section divider (very minimal — used between major narrative arcs)
function section(s, n, txt) {
}

// section divider (very minimal — used between major narrative arcs)
function section(s, n, txt) {
…-shaped

Major restructure per Pal's feedback (less competitive, more architecture):

  REMOVED: 'THE RACE' slide (competitor names)
  ADDED:   'THE OUTCOME' slide — 4 measurable outcomes:
           every call audited · sandbox in minutes · one policy plane · default-deny egress

  ADDED (architecture deep-dive, §3):
    §3   THE CORE — three boxes: Controller / 11 CRDs / Inference Router
    §3.1 ROUTER REQUEST FLOW — 6 stages: agent → iptables → router → policy → audit → upstream
    §3.2 ROUTER INTERNALS — 8 routes + 8 subsystems in two columns
    §3.3 CONTROLLER LOOP — KarsSandbox → 9 named Kubernetes primitives
    §3.4 STATUS & OBSERVABILITY — phase taxonomy + conditions YAML
    §3.5 CRD CATALOG — all 11 CRDs tabled with scope + reconciler

  ADDED (policy in practice, §4):
    §4   InferencePolicy (existing slide reworked)
    §4.1 ToolPolicy + EgressApproval (real CR snippets side by side)

  REFRAMED:
    §10 WHAT'S NEXT — outcome-shaped: 'capability → what becomes possible → proof'

Visual QA: 4 layout bugs caught on first render (slides 9/11/13/20)
and fixed; final QA pass clean on all 21 slides.

Reproduces via: NODE_PATH=$(npm root -g) node scripts/showcase/build-deck.js

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
}

// right-column prose paired with codeBlock
function rightProse(s, paragraphs, opts = {}) {
}

// right-column prose paired with codeBlock
function rightProse(s, paragraphs, opts = {}) {
…gress slide

Two new slides per Pal's feedback:

NEW SLIDE 4 — HIGH-LEVEL ARCHITECTURE diagram (right after the 'WHAT KARS IS' dark
   statement, before drilling into 'THE CORE'). A single picture that shows the whole
   shape: dashed cluster boundary containing three boxes (kars CRDs · kars-controller ·
   Sandbox pod), with the sandbox pod showing both inner containers (agent UID 1000 +
   inference-router UID 1001). Below the cluster: a row of external services chips
   (Azure OpenAI · Anthropic · OpenAI · Bedrock · MCP · A2A peers · AGT relay) with
   'only path out' label connecting the architecture to the providers.

NEW SLIDE 15 — NETWORK EGRESS · learn it, sign it, enforce it. The full story:
   - Top half: two side-by-side panels showing Learn mode (default — record every host
     into the next allowlist proposal) vs Strict mode (production — anything outside the
     signed allowlist gets 4xx; EgressApproval grants layer on top; fails closed)
   - Bottom half: 5-stage signed-OCI-allowlist pipeline:
       kars egress --sign  →  OCI artifact (ACR/ghcr)  →  cosign verify (Fulcio + SAN)
       →  ConfigMap + digest  →  router · L7 hot-reload
   - Source refs: controller/src/policy_fetcher.rs · egress_allowlist_compile.rs ·
     inference-router/src/egress_allowlist_loader.rs

Section eyebrows renumbered to keep §3 sub-numbering consistent after insertion.
Final QA: clean on all 23 slides.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants