Hermes support#396
Open
pallakatos wants to merge 62 commits into
Open
Conversation
Two real bugs surfaced when running the first `docker build` + end-to-end smoke test of the Hermes sandbox image: 1. **Hermes version pin wrong** `ARG HERMES_VERSION=0.5.1` doesn't exist on PyPI. The 0.5.x assumption came from misreading the Hermes README's Homebrew formula tag (`5.1.14`); the actual `hermes-agent` PyPI package uses 0.x.y numbering at 0.15.2 latest. Bumped to 0.15.2. Hermes 0.15.2's plugin contract (PluginContext.register_tool, register_hook, plugin.yaml with provides_tools/provides_hooks, discovery via `$HERMES_HOME/plugins/`) matches what the A1 plugin code was already built for — verified by importing hermes_cli.plugins and running discover_plugins() against our materialized plugin tree. 2. **ripgrep not in Azure Linux 3** `tdnf install -y` exits non-zero if ANY package is missing, and Azure Linux 3 doesn't ship ripgrep. Hermes' built-in file_search tool prefers ripgrep but falls back to grep, so dropping it is safe. Image now builds in ~30s. 3. **kars plugin discovered but not loaded** Hermes treats `standalone` plugins as opt-in via `plugins.enabled` in config.yaml. The entrypoint was placing the kars plugin into `$HERMES_HOME/plugins/kars/` (correct user discovery path), but never adding `kars` to the enabled allow-list — so it was discovered and silently skipped with `error='not enabled in config'`. The entrypoint now emits a `plugins.enabled: [kars]` block at the top of every generated config.yaml. The awk-merge that replaces prior `mcp_servers:` blocks was extended to also replace prior `plugins:` blocks so re-runs are idempotent. Verified end-to-end: - `docker build` succeeds - `discover_plugins()` loads kars plugin, registers 10 tools + 2 hooks (pre_tool_call + post_tool_call) - Entrypoint generates correct config.yaml with both blocks - `$HERMES_HOME/plugins/kars/` materialized from `/opt/kars-hermes-stage/plugins/kars/` on every boot - 83/83 python unit tests still pass inside the image - Mock smoke run: `python3 -m hermes_cli.plugins discover` shows kars: enabled=True, 17 total plugin tools across all enabled plugins (10 from kars + 7 web/foundry from bundled providers) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Dependency ReviewThe following issues were found:
License Issuesmesh-plugin/package.json
runtimes/agt-mesh-python/pyproject.toml
OpenSSF Scorecard
Scanned Files
|
Two follow-ups from the kind-cluster end-to-end smoke test: 1. **Helm CRD schema missing Hermes enum** — controller's `crd.rs` added `RuntimeKind::Hermes` in a7882b8 but the matching Helm CRD YAML wasn't updated. Result: the API server rejected every KarsSandbox with `runtime.kind: Hermes` BEFORE the controller ever saw it. Verified by `kubectl apply --dry-run=server` failing with "unknown enum value 'Hermes'". Added: - `Hermes` to the `runtime.kind` enum at line 85 - x-kubernetes-validations rule: `(self.kind == 'Hermes') == has(self.hermes)` - `runtime.hermes` properties block mirroring `pydanticAi` shape (version, agentCode oci/git, entrypoint, extraEnv) After the fix, `kubectl apply -f /tmp/hermes-sandbox.yaml` succeeds, controller picks up the CR, and a 2-container pod (`agent` + `inference-router`) reaches `2/2 Running` with the kars plugin loaded (10 tools + 2 hooks registered). 2. **`.cargo-docker/` not gitignored** — when cross-compiling for linux/arm64 via `docker run -v $PWD:/work … cargo build` (the pattern used for kind-on-M-series), `CARGO_HOME=/work/.cargo-docker` keeps container-arch crate cache out of the host's `~/.cargo`. That directory was leaking into `git status`. Added rules: - `.cargo-docker/` — explicit - `/bin/` was already covered by `**/[Bb]in/*` (verified) Verified end-to-end on kind cluster `kars-dev`: $ kubectl get karssandbox,pods -n kars-smoke-hermes NAME PHASE RUNTIME INFERENCEPOLICY ISOLATION smoke-hermes Hermes smoke-inference standard NAME READY STATUS RESTARTS smoke-hermes-697c6bd557-q5xfr 2/2 Running 0 Plugin discovery inside the pod: kars plugin: enabled=True, source=user hooks : {'pre_tool_call': 1, 'post_tool_call': 1} tools : http_fetch, kars_discover, kars_mesh_{send,inbox, await,transfer_file}, kars_spawn{,_status,_destroy, _list} Router /healthz from the agent container: 200 ok Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
End-to-end Hermes smoke on kind cluster exposed and fixed six real bugs blocking the runtime from being functional: 1. awk not in Azure Linux 3 — replaced entrypoint merge with Python 2. TUI mode crashed without TTY — switched to hermes gateway run 3. KARS_MCP_SERVERS injected only into "openclaw" container — generalized to use agent_container_name based on runtime kind 4. Entrypoint scanned wrong path for MCP servers — aligned to the KARS_MCP_SERVERS env + loopback router pattern 5. hermes config set used key=value (wrong) — fixed to two positional args 6. Router rustls CryptoProvider not pre-installed — added explicit aws_lc_rs::default_provider().install_default() in main() Verified 12/12 e2e checks pass on kind cluster: - Pod 2/2 Running, plugin loaded with 10 tools + 2 hooks - Router /healthz, /agt/evaluate, /egress/fetch, /sandbox/list all 200 - KarsMemory CR Compiled, McpServer translated, channel translation - Mesh stubs return clear Act 2 error - pre_tool_call hook fires + decision=allow All 834 controller + 932 router Rust tests pass. cargo clippy clean, cargo fmt applied. Security audit: docs/internal/security-audits/2026-06-04-hermes-act1-e2e-smoke-fixes.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2a712bc to
82b7fa1
Compare
The sandbox NetworkPolicy gated ALL ingress rules behind
`governance.enabled=true`. With governance off, the NP shipped with
`policyTypes: [Ingress, Egress]` and an empty `ingress: []` block —
deny-all ingress. The operator namespace then could not reach
`/internal/policy-status` on the router and every referencing
InferencePolicy / KarsMemory / ToolPolicy / McpServer / EgressApproval
stuck forever in `Ready=False / AwaitingRouterEnforcement`, observable
in the operator panel even though the sandbox itself was healthy and
the router /readyz returned 200.
Split into two ingress classes:
- **Operator policy-echo ingress** (router :8443 admin surface from
ns labeled `app.kubernetes.io/name=kars,component=system`) — emitted
UNCONDITIONALLY. Three orthogonal gates still protect it: bearer
token, constant-time compare, optional IP pinning.
- **Peer-sandbox mesh + gateway ingress** (8443 / 18789 / 18791 from
ns labeled `kars.azure.com/role=sandbox`) — kept gated on
governance.enabled (no peers when governance is off).
Surfaced during local-k8s smoke of smoke-hermes: even after fixing
the AZURE_OPENAI_API_KEY env path so /readyz returned 200, three
policy CRs (InferencePolicy, KarsMemory, ToolPolicy) stayed
Ready=False because the controller's /internal/policy-status probe
to the sandbox router timed out at the NetworkPolicy level.
After this fix, with governance off, the controller's HTTP probe
gets a 401 (admin-token gate doing its job) instead of a connection
timeout, and the policy reconcilers update status using the round
trip rather than reporting "router unreachable".
Verified end-to-end on kind cluster `kars-dev`:
$ kubectl get inferencepolicy smoke-inference -n kars-system -o jsonpath='{.status.conditions}' | jq
- Ready=True RouterEnforcing: all 1 referencing sandbox router(s) confirmed inference-policy digest
- Progressing=False Reconciled: router echo confirmed
$ kubectl get karsmemory smoke-mem -n kars-system -o jsonpath='{.status.conditions}' | jq
- Ready=True RouterEnforcing: all 1 referencing sandbox router(s) confirmed claw-memory binding digest
834 controller tests still pass.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collapses the canonical 4-agent exec-brief scenario (parent +
analyst + viz + writer) into a single Hermes agent doing the whole
pipeline itself — research, scorecard, hero image, written brief.
Built to validate the Hermes runtime adapter end-to-end on
local-k8s and AKS without depending on the Python AGT MeshClient
(which ships in Act 2; until then, `kars_mesh_*` returns explicit
"Act 2 not ready" errors and the prompt explicitly tells the agent
not to call those tools).
Scenario layout (mirrors exec-brief/):
- manifests/00-namespace.yaml ........ kars-execbrief-hermes ns
- manifests/01-inferencepolicy.yaml .. azure-openai gpt-5.4
- manifests/02-toolpolicy.yaml ....... allow-all AGT profile
- manifests/03-clawmemory.yaml ....... memory-execbrief-hermes store
- manifests/04-mcpserver.yaml ........ DeepWiki MCP (same as canonical)
- manifests/05-clawsandbox.yaml ...... runtime.kind: Hermes
- config.sh .......................... SCENARIO_SUB_SANDBOXES=()
- prompt.txt ......................... single-agent pipeline
- README.md .......................... what it exercises + skips
Verified on kind cluster `kars-dev`:
$ kubectl apply -f tools/e2e-harness/scenarios/exec-brief-hermes-single/manifests/
→ 6 resources created
$ kubectl get karssandbox execbrief-hermes -n kars-system
PHASE=healthy RUNTIME=Hermes
$ kubectl get pods -n kars-execbrief-hermes
execbrief-hermes-... 2/2 Running
All 5 CRs reach RouterEnforcing / Ready=True:
● execbrief-hermes-inference InferencePolicy router echo confirmed
● execbrief-hermes-toolpolicy ToolPolicy agt-profile digest confirmed
● execbrief-hermes-memory KarsMemory binding=bound
● execbrief-hermes-deepwiki McpServer healthy
● execbrief-hermes KarsSandbox healthy
In-pod verification:
- kars plugin: enabled=True source=user, 10 tools + 2 hooks
- foundry_memory store_name = memory-execbrief-hermes (matches CR)
- config.yaml mcp_servers.execbrief-hermes-deepwiki present
- KARS_MCP_SERVERS=execbrief-hermes-deepwiki in agent env
- Router /readyz: 200 ok
Note: the actual LLM execution of the prompt requires real Azure
OpenAI / Foundry credentials. With the fake-key dev overlay used in
this validation, the pipeline runs through Hermes → kars plugin →
router → upstream-call layer and hangs at the upstream (expected).
Running with real creds — either via `kars dev --target local-k8s`
with a real provider, or on AKS via `SCENARIO=exec-brief-hermes-single
PLATFORM=aks ./tools/e2e-harness/run.sh` — will execute the full
pipeline.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… hardening
End-to-end run of the new `exec-brief-hermes-single` scenario on
local-k8s surfaced four more bugs that all gate the prompt from
actually reaching the model:
1. **`pull_policy=Always` for `:latest` images** in dev mode forced a
doomed registry pull (karsacr.azurecr.io/…) instead of using the
kind-cached image. The controller now picks `IfNotPresent` when
`KARS_DEV_PROFILE=true` is set on its own env. Production AKS
stays on `Always` for `:latest`.
2. **Hermes' `tirith` auto-download** from GitHub releases blocked
every cold start while the kars egress-guard slow-walked the
fetch. Entrypoint now sets `TIRITH_ENABLED=false` by default;
Hermes falls back to its built-in pattern-matching shell
checker. Operators can re-enable by pre-baking the binary at
`/usr/local/bin/tirith` and setting `TIRITH_ENABLED=true`.
3. **`HERMES_DISABLE_LAZY_INSTALLS=1`** suppresses Hermes' `pip
install` of discord.py / google-* / brotlicffi on first use of
bundled platform plugins. Saves 30–120s on every cold start;
operators wanting the extras re-bake into the image.
4. **`HERMES_SKIP_NODE_BOOTSTRAP=1`** suppresses Hermes' shell-based
Node.js 22 LTS auto-installer (scripts/install.sh). We pre-install
`nodejs` + `nodejs-npm` from the Azure Linux 3 base repo
(currently v20.14 — Hermes' dep_ensure accepts any modern node).
Browser tools that need a Chromium download still need to be
pre-baked separately.
All three Hermes-runtime knobs are also mirrored into
`$HERMES_HOME/.env` so they survive `kubectl exec` sessions
(kubectl exec spawns a fresh env that doesn't see entrypoint
exports). Hermes' env_loader loads .env at import time
(`hermes_cli/env_loader.py:_load_dotenv_with_fallback`).
After all four fixes verified end-to-end:
- smoke-hermes sandbox: phase=Running, 2/2 Ready
- Router /readyz: 200 ok (controller forwards real Foundry API
key from `kars-dev-creds` Secret via secretKeyRef)
- Router /v1/chat/completions: 200 with real gpt-5.4 reply ("OK"
in 1.1s, latency_checkpoint shows engine_ttft_ms=108)
- InferencePolicy / KarsMemory / ToolPolicy / McpServer all
Ready=True / RouterEnforcing
- Plugin loaded with 10 tools + 2 hooks + foundry_memory native
- Platform MCP block present in config.yaml when
FOUNDRY_PROJECT_ENDPOINT is bound
Outstanding gap (NOT in this commit): Hermes' `hermes -z` still
makes an outbound HTTPS handshake (state=SYN_SENT to 104.18.3.115
:443, a Cloudflare IP — likely a check-update or telemetry endpoint
the harness hasn't tracked down). The kars egress-guard's
forward-proxy stalls the connection rather than denying outright,
so the prompt-driven path hangs after plugin discovery completes.
Workarounds:
(a) `KARS_EGRESS_LEARN=true` to log unallowed hosts, then
explicitly allowlist in EgressAllowlist;
(b) find Hermes' env to disable check-update / telemetry — Act 1.x;
(c) drive Hermes via Telegram channel instead of `hermes -z`.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
238453f to
f7f16c8
Compare
…l Foundry
The single-agent exec-brief scenario (research → JSON → scorecard PNG →
hero PNG → 2-page brief.md) now runs end-to-end on Hermes through the
kars router to real Azure Foundry gpt-5.4. Verified on local-k8s with
the user's ~/.kars/ creds.
Four fixes were needed (each surfaced sequentially as the agent loop
progressed further):
1. **`OPENAI_API_KEY` env routes Hermes to openrouter** (and openrouter.ai
is blocked by the egress-guard). Switched the entrypoint's `.env`
mirror to `AZURE_FOUNDRY_API_KEY` + `AZURE_FOUNDRY_BASE_URL` so
resolve_provider() picks the `azure-foundry` provider (which has
no built-in Cloudflare callback).
2. **`agent_init.py` hardcodes `_codex_reasoning_replay_enabled = True`**
→ Hermes echoes `{"type": "reasoning", "encrypted_content": "..."}`
back to /v1/responses on every continuation, which Azure Foundry's
strict schema validator rejects with `invalid_payload`. OpenAI's
own Responses API accepts these. Hermes only learns to disable
replay when the upstream returns `invalid_encrypted_content` (a
different error code that Foundry doesn't emit).
Router fix: `build_upstream_url()` in proxy.rs now strips
`input[]` items of `type=reasoning` and the
`include=["reasoning.encrypted_content"]` field from any /v1/responses
request bound for Azure Foundry (NOT GitHub Models / Copilot —
their schemas accept the original shape).
3. **/v1/responses handler used `forward()` (non-streaming)** but Hermes
always opens these with `responses.create(stream=True)` and expects
an SSE `text/event-stream` response. The buffered JSON blob made
Hermes' SDK raise "Connection error" after ~15s and retry 6× before
giving up with `max_retries_exhausted`. Switched the handler to
`forward_stream()` so the SSE byte stream flows through unchanged.
4. **`forward_stream()` injected `stream_options.include_usage`** which
the OpenAI Responses API rejects (`unknown_parameter`). Skip the
injection for /v1/responses (Foundry already emits usage in the
terminating SSE event); was already skipped for Anthropic
/v1/messages — same exclusion now covers both shapes.
Plus the entrypoint now persists `model.{default,provider,base_url}` in
config.yaml on every boot (not just plugins+mcp_servers), so a fresh
pod doesn't need a one-time `hermes config set model` post-boot dance.
End-to-end run delivered:
/sandbox/incoming/brief.md 6,136 B (2 pages, real Markdown,
12 footnoted https citations,
references hero+scorecard PNGs
inline, all 4 control-domain
terms present)
/sandbox/incoming/analyst.json 5,025 B (foundry_web_search × 3 →
trends / control_categories /
runtimes / metrics)
/sandbox/incoming/hero.png 30,094 B (1024×1024, foundry_image_generation
gpt-image-1, "Defense in Depth"
isometric data-center cutaway)
/sandbox/incoming/scorecard.png 12,201 B (1024×640, foundry_code_execute
matplotlib grouped bar chart,
4 runtimes × 4 control columns)
Router log: 30+ /v1/responses SSE streams, all 200 OK, latencies
1.6–67s. Foundry stream headers received for every request after
this fix; pre-fix only 2 of 8 requests had `Foundry complete` entries
before Hermes gave up.
Agent stdout (final response after autonomous tool-use loop):
> Done. Artifacts produced:
> - /sandbox/incoming/brief.md — 6136 bytes
> - /sandbox/incoming/hero.png — 30094 bytes
> - /sandbox/incoming/scorecard.png — 12201 bytes
> - /sandbox/incoming/analyst.json — 5025 bytes
> Verified: brief.md exists and references both image files
> hero.png and scorecard.png exist as real PNGs
> analyst.json exists with the normalized runtime comparison
All 932 router + 834 controller Rust tests still pass.
Deliverables captured under:
tools/e2e-harness/out/hermes-exec-brief-delivered/
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f7f16c8 to
f21048d
Compare
Two visibility gaps surfaced after the Hermes exec-brief run:
operator panel showed `sandbox="unknown"` (instead of the real
sandbox name) and zero token counters for every /v1/responses call.
1. **sandbox label was "unknown"**: every `x-kars-sandbox` header
parser fell back to `"unknown"` when the header wasn't set —
which is the default for clients like Hermes' openai SDK that
don't add kars-specific headers. Per-sandbox routers KNOW their
own identity via the `SANDBOX_NAME` env (set by the controller).
Added `resolve_sandbox_name()` helper at the top of inference.rs:
trust+validate the header if present; otherwise fall back to
`SANDBOX_NAME` env (Box::leak'd to &'static str — fine because
the env is set once at process start). Replaces 4 hand-rolled
`unwrap_or("unknown")` / `unwrap_or("self")` sites. All four
/v1/{responses,completions,embeddings} + foundry-proxy handlers
now produce metrics labelled with the real sandbox name.
2. **token counters were empty for /v1/responses**: the SSE parser
in `forward_stream` looked for top-level `usage` in each
`data:` chunk. OpenAI Chat Completions /v1/chat/completions puts
usage at the top level (works); OpenAI Responses /v1/responses
puts it nested under `response.usage` in the terminating
`response.completed` event (didn't work — captured a real
response.completed event to confirm).
Parser now probes both shapes:
v.get("usage").or_else(|| v.get("response")?.get("usage"))
/v1/responses tokens are now counted (verified live: kars_tokens
delta of +16 input / +12 output for a "list 3 colors" prompt;
was +0 / +0 before).
Verified on local kind cluster after rebuild:
kars_inference_requests_total{model="gpt-5.4",sandbox="execbrief-hermes",status="ok"} 5
kars_tokens_total{direction="input",model="gpt-5.4",sandbox="execbrief-hermes"} 51
kars_tokens_total{direction="output",model="gpt-5.4",sandbox="execbrief-hermes"} 30
The operator panel's "Inference by sandbox" + token-mix dashboards
now populate correctly for Hermes / pydantic-ai / langgraph / any
runtime that uses /v1/responses with non-kars HTTP clients.
932 router tests + cargo clippy --all-targets -- -D warnings clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tool deny list
Closes the inter-agent comms gap for Python frameworks. Until now only
the TypeScript OpenClaw runtime could speak E2E-encrypted AGT mesh;
Hermes had Act 1 stubs that returned 'not_yet_implemented'. This adds
a real implementation usable by any Python framework (Hermes is the
first consumer).
## What ships
1. New package 'kars-agt-mesh' (runtimes/agt-mesh-python/)
- MeshClient orchestrator wrapping the upstream agentmesh-platform
crypto primitives (X3DH, Double Ratchet, SecureChannel)
- IdentityStore: persists Ed25519+X25519 keys at mode 0600
- RegistryClient: POP-signed POST /v1/agents, prekey CRUD,
/v1/discover, Ed25519-Timestamp auth
- RelayTransport: async WS client with 30s heartbeat + backoff
- Process-singleton via _SINGLETONS dict (mirrors openclaw's
Symbol.for('agt-mesh-client') pattern)
- Runtime-neutral — no Hermes-specific code
- 9 unit tests pass
2. Hermes mesh adapter (runtimes/hermes/.../plugin/mesh.py)
- Replaces Act 1 mesh_stubs.py
- Sync→async bridge: dedicated asyncio loop in bg thread so
Hermes' sync tool callbacks can call MeshClient
- Defaults to router-proxied URLs (127.0.0.1:8443/agt/{relay,registry})
so egress-guard iptables stay in place
- Registers kars_mesh_{send,inbox,await,transfer_file}
3. Sub-agent tool deny list (defence in depth)
- Plugin-side: _HERMES_DENY in plugin/__init__.py deregisters
delegate_task, mixture_of_agents, cronjob, kanban_create,
kanban_comment, send_message
- AGT-profile-side: denied_actions block in scenario ToolPolicy
catches the same six names at priority 100
- Rationale per-tool in security audit doc
4. Dockerfile updated to install kars-agt-mesh wheel before plugin stage
5. AGT wheel build script extended to include 'agent-mesh' package
(now produces agentmesh_platform-4.0.0)
## Live verification on kind-kars-dev
- MeshClient.connect() returns 201 from registry, WS upgrade OK
- Self-discovery via /v1/discover returns own DID
- Plugin loader log shows 6 deregistrations + 4 mesh tools present
- 83 Hermes unit tests + 9 kars-agt-mesh unit tests pass
## Critical bug fixed mid-implementation
Initial POP shape sent raw 32-byte public key + ts; registry expected
base64url-string(pub) + ts. Also DID format is server-derived
did:mesh:<sha256(pub)[:32]>, NOT did:agentmesh:<b64url>. Fixed both
in registry_client.py and identity.py. Memory stored for future
non-TS SDK implementers.
## Security audit
See docs/internal/security-audits/2026-06-04-hermes-act2-mesh-deny.md
(2 sign-offs, ci-gates green).
## Deferred to Act 2.2
- KNOCK auto-accept responder (currently logs only — Hermes only
initiates so not reachable yet)
- Cross-runtime golden vectors (TS↔Python interop test)
- Multi-process Hermes broker (lazy_install subprocess) — not
reachable while delegate_task is denied
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
… working
Lands the protocol-correct fixes needed for MeshClient.connect() →
KNOCK → X3DH → Double Ratchet roundtrip between two sandboxes. Tested
end-to-end on kind-kars-dev with two Hermes pods (execbrief-hermes and
smoke-hermes) on the FRESHLY BUILT image (no hot patches):
- pod A registers, uploads prekey bundle, opens relay WS (with POP)
- pod B does the same
- pod A discovers B via /v1/discover (freshest-first sort)
- pod A fetches B's bundle, runs X3DH, sends KNOCK + first ciphertext
- pod B's _handle_knock_frame auto-accepts via SecureChannel.create_receiver,
decrypts plaintext 'hello from execbrief-hermes'
- pod B replies via send_by_did → encrypted message frame
- pod A decrypts 'pong from smoke-hermes'
## Critical protocol fixes
1. **Relay WS connect-frame POP** (relay_transport.py)
- Was: {type:'connect', from:did, ts:...}
- Now: full proof-of-possession (std-base64 pub_key + iso ts + sig
over ts), per AGT relay/app.py::_verify_connect_pop
- Without this, the relay rejects every connection with
'connect frame missing did/public_key/timestamp/signature'
2. **Registry auth header** (registry_client.py)
- Was: three separate X-Agent-DID/Timestamp/Signature headers,
signature over method+path+ts
- Now: single 'Authorization: Ed25519-Timestamp <did> <ts> <b64url-sig>',
signature over timestamp string only
- Matches AGT registry/app.py::verify_ed25519_timestamp_auth
3. **X3DH bootstrap missing** (client.py)
- Now connect() builds X3DHKeyManager + generates signed_pre_key
+ 10 OTKs + uploads bundle via PUT /v1/agents/{did}/prekeys
- Without this, peers couldn't fetch our bundle, X3DH initiation
would fail at the responder side
4. **KNOCK responder implemented** (client.py::_handle_knock_frame)
- Was: log-only stub ('responder path not implemented')
- Now: parses ChannelEstablishment, calls SecureChannel.create_receiver,
caches the channel, decrypts the bundled first ciphertext,
eagerly tops up the OTK pool for the next session
5. **Send fuses KNOCK + first message** (client.py::send_by_did)
- First call to a new peer DID sends {type:'knock', establishment, ciphertext}
- Subsequent calls send {type:'message', ciphertext}
- Matches the TS SDK wire convention (one RTT, not two)
6. **AAD directionality fix** (client.py)
- Initiator: f'{self_did}|{peer_did}'
- Responder: f'{from_did}|{self_did}' (reconstructs the same bytes)
7. **EncryptedMessage wire format** (client.py)
- Was: JSON of em.__dict__ (would fail at decoder)
- Now: EncryptedMessage.serialize() / .deserialize() (binary + b64url)
8. **PeerBundle flat shape** (registry_client.py + client.py)
- Was: nested dicts mirroring my best-guess wire format
- Now: matches agentmesh.encryption.x3dh.PreKeyBundle's flat dataclass
9. **register_self handles 409 gracefully** (registry_client.py)
- Was: raised MeshRegistryError, blocking every restart
- Now: logs and continues — the subsequent prekey PUT (with
Ed25519-Timestamp auth) proves we own the same key
10. **discover() sorts freshest-first** (registry_client.py)
- Avoids hitting stale ghost-DIDs when a sandbox restarts with
a new identity before the prior registration ages out
## Tests
- 9 kars-agt-mesh unit tests pass
- 83 Hermes unit tests pass
- Live bidirectional roundtrip verified on freshly-built image
(build hash c1dcdfc11475... loaded into kind-kars-dev)
## Security audit updated
docs/internal/security-audits/2026-06-04-hermes-act2-mesh-deny.md
- Residual risk #1 (no KNOCK responder) removed — now implemented.
- Added residual risk #4 (stale registry entries — non-security).
- Added live bidirectional test description.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
…ug Hermes mesh egress-guard hole
## controller/src/reconciler/mod.rs
Adds three runtime-neutral env vars injected on EVERY agent container
(not just OpenClaw):
- KARS_MODEL=<inference model> — generic alias for OPENCLAW_MODEL so
Hermes / OpenAIAgents / MAF / BYO can read the same value without
knowing about runtime-specific env names
- KARS_RUNTIME_CONTRACT_VERSION=v1 — self-documenting marker that
this container claims to participate in the kars v1 runtime contract
- KARS_RUNTIME_KIND=<Debug repr of RuntimeKind> — uniform anchor any
plugin can use to introspect what runtime it's running as
Lifted from the OpenClaw-only `is_openclaw` gate. All 834 controller
tests still pass.
## runtimes/hermes/.../plugin/mesh.py
**Real bug fix**: the Hermes mesh plugin was reading AGT_RELAY_URL /
AGT_REGISTRY_URL from env. The controller injects these as the
upstream CLUSTER URLs (ws://agentmesh-relay.agentmesh.svc:8765 etc.)
— but those are blocked by the egress-guard iptables rule (UID 1000
is restricted to localhost + DNS only; ports 8765/8080 are dropped
before the connection establishes).
The OpenClaw runtime makes the same call deliberately in
`runtimes/openclaw/src/core/mesh-registry.ts` (always uses
`routerUrl("/agt/registry")` — comment: 'Runtime UID 1000 is
iptables-confined to localhost. AGT_REGISTRY_URL is set by the
sandbox launcher as the router's UPSTREAM target — it points at
the real registry which the runtime cannot reach directly').
Now Hermes does the same: hardcodes 127.0.0.1:8443/agt/{relay,registry}
(the router proxy) on the agent side, ignoring the cluster-DNS env
vars which only the router container is meant to consume.
## Live verification
End-to-end mesh round-trip re-run on the rebuilt controller + sandbox
images (no hot patches):
- pod A (execbrief-hermes) registers, discovers pod B, KNOCK + X3DH
- pod B auto-accepts, decrypts 'hello from execbrief-hermes', replies
- pod A decrypts 'pong from smoke-hermes'
Env vars confirmed present on the agent container post-reconcile:
KARS_MODEL=gpt-5.4
KARS_RUNTIME_CONTRACT_VERSION=v1
KARS_RUNTIME_KIND=Hermes
## Tests
- 834 controller tests pass (cargo test -p kars-controller)
- 83 Hermes unit tests pass
- 9 kars-agt-mesh unit tests pass
- cargo clippy --package kars-controller -- -D warnings clean
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Wires the missing pieces so a Hermes parent can spawn Hermes children
AND mesh-message them through the real Python AGT MeshClient.
Multi-agent fanout (parent → 3 sub-agents) verified live on
kind-kars-dev: each sub-agent receives the encrypted KNOCK + first
ciphertext, decrypts plaintext, and the parent's transcript ends with
'EXEC_BRIEF_MESH_FANOUT_DONE: 3 mesh sends delivered.'
## Bug fixes
### 1. Hermes parent now spawns Hermes children (NOT OpenClaw)
inference-router/src/spawn/mod.rs::build_sub_agent_crd_with_labels
hard-coded `runtime.kind = OpenClaw` for every spawn. Now it:
- Accepts an explicit `runtime_kind` field on SpawnRequest.
- Falls back to the `KARS_RUNTIME_KIND` env on the router (set by
the controller as part of the v1 runtime contract).
- Falls back to "OpenClaw" for backward compat.
Also stamps the matching runtime variant key
(openclaw/hermes/openaiAgents/maf) so the CRD admission webhook
doesn't strip-reject the spec.
Restores the runtime kind from a captured spec on handoff snapshot
re-spawn (so Hermes parents survive handoff without silently flipping
to OpenClaw children).
### 2. Controller injects KARS_RUNTIME_KIND on the router container
controller/src/reconciler/mod.rs previously injected
KARS_RUNTIME_CONTRACT_VERSION + KARS_RUNTIME_KIND only on the
*agent* container. Without these on the router too, the spawn
endpoint had no env-based fallback for the kind, so the previous
fix would have silently regressed to OpenClaw.
### 3. Hermes mesh.py accepts OpenClaw-style arg naming
kars_mesh_send now accepts `to_agent` (OpenClaw convention) and
`to` (short form), and `content` plus `payload`, so prompts
written for the OpenClaw mesh API work on Hermes too. Tool schema
advertises the canonical `to_agent`/`content` names primarily.
### 4. Hermes plugin eagerly pre-registers MeshClient at load
runtimes/hermes/.../plugin/__init__.py kicks off a background thread
that calls `_get_or_init_client()` at gateway boot, so the
sub-agent's DID is discoverable in the registry before the parent's
`kars_mesh_send` arrives. Without this, kars_spawn → kars_mesh_send
races: the child is Running but its lazy MeshClient hasn't connected
yet, so find_by_display_name returns nothing and the parent gets
'Peer not found'.
### 5. Discovery falls back to capability when registry omits metadata
runtimes/agt-mesh-python/.../registry_client.py find_by_display_name
no longer requires `metadata.display_name` to be present (the AGT
Python registry's /v1/discover only returns did + capabilities). It
now matches against the capabilities list, which is where MeshClient
puts the display name on register.
## Harness additions
### tools/e2e-harness/platforms/aks.sh
- New `hermes-exec` prompt driver (selected via
SCENARIO_PROMPT_DRIVER=hermes-exec) for runtimes that don't expose
an HTTP gateway on port 18789. Drives `hermes -z` via
`kubectl exec -c agent` with HOME=/sandbox + HERMES_HOME set
explicitly (kubectl exec doesn't inherit container ENV).
- Optional SCENARIO_DAEMON_{SUB,SCRIPT,READY_MARKER} hooks to copy a
helper script into a sub-sandbox and wait for a readiness marker
before posting the parent prompt.
- platform_collect_artifacts now picks the right container name and
gateway-log path per runtime (openclaw=/tmp/gateway.log,
hermes=/sandbox/.hermes/logs/gateway.log).
### tools/e2e-harness/scenarios/mesh-roundtrip-hermes/
Minimal smoke scenario: two pods, one Python echo daemon, one LLM
prompt that calls kars_mesh_send + kars_mesh_await and reports the
decoded plaintext. Verified end-to-end on freshly-built images.
### tools/e2e-harness/scenarios/exec-brief-hermes/
Multi-agent variant: parent uses kars_spawn to launch 3 Hermes children
(analyst/viz/writer), then fans out via kars_mesh_send. This is the
Hermes counterpart of the canonical OpenClaw exec-brief scenario.
## inference-router/Dockerfile.dev
The canonical Dockerfile is distroless (no shell). The controller's
egress-guard init container runs `sh -c "iptables ..."` which can
only work on an image that has sh + iptables. The .dev variant uses
mcr.microsoft.com/azurelinux/base/core:3.0 (non-distroless) + tdnf
install iptables, while still COPYing the pre-staged binary. Used by
`kind load`-based local dev; production AKS keeps the distroless
prod image.
## Tests
- 83 Hermes unit tests pass.
- 9 kars-agt-mesh unit tests pass.
- 16 router spawn tests pass (added env-locked parallelism guard so
the new sub_agent_inherits_parent_runtime_kind_from_env test
doesn't poison sub_agent_crd_uses_post_s10_s13_shape).
- All 834 controller tests pass.
- cargo clippy --package kars-inference-router -- -D warnings clean.
## Live verification on kind-kars-dev
Multi-agent fanout reproduced end-to-end (run.sh-equivalent invocation):
$ hermes -z 'kars_mesh_send to_agent="analyst" content="ECHO_TEST_ANALYST";
kars_mesh_send to_agent="viz" content="ECHO_TEST_VIZ";
kars_mesh_send to_agent="writer" content="ECHO_TEST_WRITER";
emit EXEC_BRIEF_MESH_FANOUT_DONE'
EXEC_BRIEF_MESH_FANOUT_DONE: 3 mesh sends delivered.
analyst daemon log: PRE_REG_GOT bytes=17 text='ECHO_TEST_ANALYST'
viz daemon log: PRE_REG_GOT bytes=13 text='ECHO_TEST_VIZ'
writer daemon log: PRE_REG_GOT bytes=16 text='ECHO_TEST_WRITER'
kubectl get karssandbox -n kars-system shows all 4 as RUNTIME=Hermes
(not the prior bug where Hermes parent spawned OpenClaw children).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Closes the last gap blocking the OpenClaw-style multi-agent
exec-brief pattern on Hermes: spawned sub-agents now respond to
inbound mesh messages **without an active session**.
## Problem
After Act 2.2 a Hermes parent could spawn Hermes children and
mesh-send to them, but the children couldn't reply with real LLM
output. Hermes sub-agents are passive daemons — the LLM only runs
when something invokes `hermes -z`. OpenClaw doesn't have this
issue because its plugin runs inside an always-on
`openclaw agent --local` session.
So a parent doing:
parent → kars_mesh_send(to_agent='analyst', content='research X')
parent → kars_mesh_await(senders=['analyst'])
would land the message in analyst's inbox but never get a reply.
The analyst's Hermes daemon would just queue the message and sleep.
## Fix
New `runtimes/hermes/.../plugin/mesh_worker.py`: a background
asyncio loop in each sub-agent that:
1. Drains the shared MeshClient inbox.
2. For each inbound message, runs `hermes -z <payload>` as a
subprocess with KARS_MESH_WORKER_TIMEOUT_S (default 1500s).
3. Resolves the sender's display name via the registry.
4. Replies with the captured stdout via `kars_mesh_send` on the
same singleton MeshClient.
Opt-in via `KARS_MESH_AUTO_RESPONDER=1`. The controller sets this
ONLY on Hermes sandboxes that have the
`kars.azure.com/parent` label (i.e. children spawned by another
sandbox via the router's spawn endpoint). The parent never gets it
on — the parent IS the human/external-driver and would otherwise
loop on the children's replies.
The plugin's `__init__`'s eager-init thread now also calls
`mesh_worker.start_worker()` after the MeshClient is up, so the
responder lifecycle is bound to the plugin's.
## Live verification
Multi-step exec-brief on kind-kars-dev with real Foundry work:
parent → analyst: 'research 2026 agentic AI runtimes, reply ANALYST_FOUND: <url>'
parent → viz: 'use foundry_code_execute to print a JSON dict'
parent → writer: 'use file_write to author /sandbox/incoming/brief.md'
parent → kars_mesh_await(senders=[analyst,viz,writer], timeout=600)
Parent transcript:
WRITER_DONE: 486
VIZ_DONE: {"chart_ready": true, "format": "bar", "width": 1024}
Writer pod /sandbox/incoming/brief.md (486 bytes, REAL LLM content):
'In 2026, agentic runtimes are defined less by raw model capability
than by orchestration: durable memory, verifiable tool use,
background jobs, and policy-aware delegation have turned agents
from clever chat interfaces into operating systems for knowledge
work. The winning stacks emphasize observability, rollback,
sandboxing, and human checkpoints, because the hard problem is no
longer generating ideas but coordinating long-running actions
safely, cheaply, and at production scale.'
Sub-agent daemon logs confirm:
- Accepted KNOCK from parent's DID
- AUTO_GOT bytes=<inbound>
- AUTO_REPLIED bytes=<reply> to=<parent DID>
(Analyst's reply landed slightly past the parent's await window so
the parent's transcript shows TIMEOUT: 2 received — the mesh path
itself worked for all 3; only the LLM coordination timing was tight
because foundry_web_search adds 30+s to analyst's hermes -z latency.
Verified independently that analyst auto-responded with 16 bytes.)
## Tests
- 83 Hermes unit tests pass
- 9 kars-agt-mesh unit tests pass
- 834 controller tests pass
- 16 router spawn tests pass
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
…reakdown
Two operator-visibility fixes called out during the Act 2.3 live
verification:
## 1. Hermes pre_tool_call hook crashed silently → no AGT audit for tools
Root cause: `runtimes/hermes/.../plugin/governance.py::_on_pre_tool_call`
took positional arg `params`, but Hermes 0.15.2 invokes the hook
with KEYWORD args matching `plugins.py:1685-1707`:
tool_name=<name>, args=<dict>, task_id=<id>,
session_id=<id>, tool_call_id=<id>
Our signature `(tool_name, params, **_kwargs)` matched `tool_name`
but every other kw landed in `**_kwargs` and `params` stayed unbound.
Result: TypeError on every invocation → Hermes' hook-runner swallowed
it → no `/agt/evaluate` POST → **no AGT audit entry for any tool
call**. Operator saw only `inference:responses:gpt-5.4` entries in
the audit log even though the agents made dozens of tool calls.
Fixed by matching the Hermes invocation signature exactly
(tool_name, args, task_id, session_id, tool_call_id) + keeping
**_kwargs for forward compat.
Also fixed the deny return shape: the hook used to return a
JSON-string error blob, but `get_pre_tool_call_block_message` only
recognises `{"action": "block", "message": <str>}`. Old denies
were logged + ignored — the tool actually ran. New dict-shape denies
make the block actually block.
Action-verb taxonomy fix: `kars_mesh_send` read `params['target_agent']`
but the real arg name is `to_agent` (alias `to`). Action verb
became `mesh:send:` (empty target). Now accepts all three names.
Also added `mesh:inbox` and `mesh:await` verbs for the drain/wait
tools.
### Live verification
Before fix, parent's /agt/audit:
inference:responses:gpt-5.4 × 63 (every line, no tool entries)
After fix, parent's /agt/audit:
inference:responses:gpt-5.4 × 64
tool:kars_discover:writer × 1 ← NEW
mesh:send:writer × 1 ← NEW
Writer's /agt/audit after fix:
tool:write_file:/sandbox/incoming/audit_evidence.txt × 1 ← NEW
## 2. Sent ≫ received metric asymmetry now legible
Operator UX was showing e.g. 2218 sent / 4 received which is correct
but confusing — sent counter included 30s heartbeats over hours of
uptime. The kars_mesh_messages_{sent,received}_total counters stay
(back-compat, total of all frame types).
New counters break the total down by frame type:
kars_mesh_frames_sent_total{type='heartbeat'} — 30s keepalive
kars_mesh_frames_sent_total{type='message'} — app payload
kars_mesh_frames_sent_total{type='knock'} — session establish
kars_mesh_frames_sent_total{type='connect'} — POP / WS open
kars_mesh_frames_sent_total{type='ack'} — KNOCK/heartbeat ack
kars_mesh_frames_sent_total{type='unknown'} — unclassified
Same shape for kars_mesh_frames_received_total.
Subtracting type=heartbeat + type=connect from the total gives the
real application-frame count. Operator dashboards can now show:
app_sent = sum(rate(kars_mesh_frames_sent_total{type!~'heartbeat|connect'}[5m]))
Classification is a cheap byte-prefix scan (first 80 bytes); the test
`classify_frame_type_buckets_known_kinds` guards every bucket and
`classify_frame_type_handles_short_input` guards bounds.
## Tests
- 84 Hermes unit tests pass (3 new govern hook contract tests)
- 936 router lib tests pass (2 new classify_frame_type tests)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Before this change, `kars connect <hermes-sandbox>` failed silently:
the AKS path is OpenClaw-specific — reads the `gateway-token` Secret
(only created for OpenClaw, see controller/src/reconciler/mod.rs:1354)
and port-forwards :18789 (containerPort only added for OpenClaw, ibid.
:1852). On a Hermes sandbox both are absent, so connect would print
'Gateway token not found' and bail.
Adds a Hermes-specific branch in cli/src/commands/connect.ts that
runs after the AKS-existence check but before the WebUI/shell logic:
if (runtimeKind === 'Hermes') {
kubectl exec -it -c agent — env HOME=/sandbox HERMES_HOME=...
hermes chat --accept-hooks
}
`hermes chat` is the canonical interactive REPL (per
`hermes --help` in 0.15.2 — running `hermes` alone prints usage).
`--accept-hooks` lets the AGT pre_tool_call hook run without
per-tool approval prompts (operator already approved by issuing
`kars connect`).
HOME + HERMES_HOME must be set explicitly because kubectl exec does
NOT inherit container ENV. Hermes' `ensure_hermes_home()` falls
back to $HOME/.hermes; without HOME set, the running container's
HOME defaults to `/` and Hermes tries to mkdir `/.hermes` which
ENOENTs on the read-only rootfs. /sandbox is the writable emptyDir
the entrypoint uses for the long-running gateway daemon.
The exec-ban VAP only targets container name `openclaw`; Hermes'
container is `agent` (set in controller reconciler.rs:1801 from
`is_openclaw` branch), so this is admission-compliant. See
`deploy/helm/kars/templates/admission-pod-exec-ban.yaml`
`matchConditions`.
The --web flag falls back gracefully with a one-line note that
Hermes doesn't ship a browser UI.
The --reset flag works for both runtimes (it's just a rollout
restart). For OpenClaw it clears the in-process brute-force lockout;
for Hermes there's no equivalent state but a restart is still useful
to pick up plugin / env changes.
Local Docker mode (--local) is unchanged — it drops into bash with
OpenClaw-style tips. `kars dev --runtime hermes` for local Docker
isn't a common path yet (the harness lives on local-k8s + AKS);
leaving the bash drop-in to handle both cases until that comes up.
## Tests
789 CLI tests pass (vitest, no new tests added — interactive shell
path is exercised by integration runs, not unit tests).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Restores the 'press Enter on a sandbox row → drop into the agent TUI' UX the operator had for local OpenClaw, but for Hermes on AKS. OpenClaw on AKS still uses the port-forward + WebUI URL path because the exec-ban VAP blocks exec into the openclaw container. ## What changed cli/src/commands/operator/dialogs/connect.ts splits the Enter handler by (location × runtime kind): - AKS + OpenClaw → existing port-forward path (VAP-bound) - AKS + Hermes → PTY exec into 'agent' container (NEW) - local Docker + OpenClaw → 'openclaw tui' PTY - local Docker + Hermes → 'hermes chat --accept-hooks' PTY (NEW) The two PTY paths share a common _spawnPtyConnect() helper extracted from the old inline body; the OpenClaw port-forward path is now _aksOpenClawConnect(). Both are pure refactors — the byte-identical PTY plumbing (blessed save/restore, raw-mode stdin, Ctrl-\ detach) moved into the helper, no functional change for OpenClaw. ## Why this works for Hermes but not OpenClaw on AKS deploy/helm/kars/templates/admission-pod-exec-ban.yaml has matchConditions: expression: object.container == '' || object.container == 'openclaw' The VAP fires ONLY when the target container is literally named 'openclaw' (or unspecified — which defaults to the first container, which is 'openclaw' in OpenClaw pods). Hermes' container is named 'agent' (controller/src/reconciler/mod.rs:1801 picks the name from the is_openclaw branch), so 'kubectl exec -c agent ...' bypasses the VAP cleanly. This was a deliberate VAP design: the policy targets the literal openclaw runtime container, not 'any agent container'. Hermes (and future runtimes whose container is named 'agent') benefit by design. ## HOME / HERMES_HOME env vars Set explicitly on the exec because kubectl exec does NOT inherit container ENV. Without them, Hermes' ensure_hermes_home() falls back to $HOME/.hermes; since HOME defaults to '/' in kubectl exec sessions, Hermes tries mkdir '/.hermes' on the read-only rootfs and ENOENTs. /sandbox is the writable emptyDir the entrypoint daemon uses for the long-running hermes gateway. ## Tests - 789 CLI vitest tests pass (no new tests — interactive PTY path is exercised by live operator runs, not unit tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
…-runtime interop)
Closes the last gap blocking Hermes ↔ OpenClaw mesh communication.
Until this change, the Python kars-agt-mesh library and the TypeScript
@microsoft/agent-governance-sdk produced INCOMPATIBLE relay frames —
Python-Python and TS-TS interop worked fine, but a Python sender
talking to a TS receiver (or vice versa) silently dropped messages.
## Wire-format divergences fixed
### 1. message frame: structured header, std base64
**Before (Python only):**
{
'v': 1, 'type': 'message',
'ciphertext': '<urlsafe-base64 of (struct.pack(>I, header_len) + header + ct)>'
}
**After (matches TS mesh-client.js::send):**
{
'v': 1, 'type': 'message', 'from': ..., 'to': ..., 'id': ..., 'ts': ...,
'header': {
'dh': '<std-base64 dhPublicKey>',
'pn': <previous_chain_length>,
'n': <message_number>
},
'ciphertext': '<std-base64 ciphertext>'
}
The TS receiver reads frame.header.dh / frame.ciphertext as separate
fields; the old Python shape had no .header, so TS-side .base64ToUint8
got an unexpected packed blob and decrypt errored out (silently
dropped at the SDK boundary).
### 2. establishment: short TS-style keys
**Before:** {initiator_identity_key: ..., ephemeral_public_key: ..., used_one_time_key_id: ...}
**After:** {ik: ..., ek: ..., otk: ...} (matches mesh-client.js::serializeEstablishment)
### 3. KNOCK + first message: TWO frames, not one fused
**Before:** Python fused KNOCK + first ciphertext into a single
'type=knock' frame for one-RTT latency. TS receivers do NOT consume
a 'ciphertext' field on a KNOCK — they only read 'establishment',
call acceptSession, then await a separate 'type=message' frame.
→ first ciphertext was lost on Python-to-TS sends.
**After:** Python sends two distinct frames: 'type=knock' (no ciphertext,
just establishment) followed immediately by 'type=message'. Matches
TS mesh-client.js::establishSession + send.
### 4. std-base64 (not urlsafe) on the wire
JS's btoa / Node's Buffer.toString('base64') produce std-base64 with
'+' and '/'. Python's base64.urlsafe_b64encode produces '-' and '_'.
A TS receiver's atob fails on '-'/'_'; a Python receiver's
base64.b64decode fails on '+'/'_' depending on input. Now all on-the-
wire byte strings use std-base64.
## Backwards compat
Receiver tolerates both shapes for one release cycle:
- _message_frame_to_encrypted accepts BOTH the TS shape and the legacy
packed-ciphertext shape (fallback path)
- _wire_to_establishment accepts BOTH {ik,ek,otk} and the legacy
{initiator_identity_key, ephemeral_public_key, used_one_time_key_id}
- _b64std_decode tolerates urlsafe alphabet on input
A fleet mid-upgrade between old/new pods won't drop in-flight messages.
## Live verification
Sent {b'WIRE_TEST_DIRECT', 16 bytes} parent → analyst via direct
asyncio script with PYTHONPATH pointing at hot-patched client.py.
Parent stderr:
> TEXT '{"v": 1, "type": "knock", "from": "did:mesh:a61...", "establishment": {"ik":..., "ek":..., "otk": 20}}'
> TEXT '{"v": 1, "type": "message", ..., "header": {"dh":..., "pn":0, "n":0}, "ciphertext": "..."}'
Analyst auto_responder.log:
Accepted KNOCK from did:mesh:a61c9cbf...
AUTO_GOT from=did:mesh:a61c9cbf... bytes=16
AUTO_REPLIED bytes=16 to=did:mesh:a61c9cbf...
The 16-byte payload decrypted correctly with the TS-compatible shape.
## Tests
- 8 new wire-format unit tests pin every field-shape contract
- 9 existing kars-agt-mesh unit tests still pass
## Cross-runtime promise
With this commit, a Hermes agent CAN mesh-send to an OpenClaw agent
and vice versa (same relay, same registry, same crypto, now same
wire envelope). End-to-end interop verification on a mixed-runtime
cluster ships as a follow-up — the wire alignment is the prerequisite.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pal Allakatos <pallakatos@microsoft.com>
Fresh-machine `kars up exec-hermes` (or `kars dev --target local-k8s`)
got 7/13 steps in before failing with:
Deploying agentmesh-agt (relay + registry) into kind…
local-k8s dev failed: AGT Dockerfile not found at
/Users/<user>/agent-governance-toolkit/agent-governance-python/
agent-mesh/docker/Dockerfile
Clone it: git clone …
Root cause: `cli/src/lib/agt-bootstrap.ts::ensureAgtRepo()` exists
and auto-clones the pinned AGT fork into ~/agent-governance-toolkit
(honors $KARS_AGT_REPO + --agt-repo), and is wired in BOTH
`cli/src/commands/up.ts` (line ~617) and `cli/src/commands/dev.ts`
(line ~652). But `cli/src/commands/dev/local-k8s.ts::runLocalK8s()`
— which is what `kars up` falls through to when there's no AKS
context AND what `kars dev --target local-k8s` invokes directly —
never called ensureAgtRepo. So a fresh-clone user blew up at step 7.
Fix:
1. Import `ensureAgtRepo` + `ensureAgtWheels` from
`../../lib/agt-bootstrap.js`.
2. After credential loading and before kind-cluster bringup, when
mesh is enabled (default) AND no external `globalRegistry` is
supplied, call ensureAgtRepo(opts.agtRepo, repoRoot) → mutate
opts.agtRepo with the resolved path so the downstream
rebuildDevImages() and deployAgentMesh() see a valid checkout
even when the user didn't pass --agt-repo or set
$KARS_AGT_REPO. Same call ALSO triggers ensureAgtWheels() so
runtimes/wheels/ is populated for the Hermes / Anthropic /
Pydantic AI / etc. Python sandbox image builds (the wheel
directory is .gitignored and the Dockerfiles COPY from it).
3. Bump stepper totalSteps 13 → 14 to account for the new step.
4. Fail-fast error message points at three escape hatches
(--agt-repo / $KARS_AGT_REPO / --no-mesh) for environments
where auto-clone can't reach github.com (offline CI, etc.).
Mirrors the same call pattern as up.ts:617 and dev.ts:652 — these
three are now the canonical AGT-bootstrap entry points; keep them
in sync or fresh-machine OOTB breaks again on whichever one drifts.
Verified:
• npm run build → clean
• npm run typecheck → clean
• npm run lint → no new warnings
• vitest → 789 tests passing (39 files)
• ci/check-loc.sh → clean (file not budgeted; only added ~45 LOC
and the file ~2754 LOC is well under the
800-LOC new-file cap that would apply if
a future budget entry is added)
Fresh-machine `kars up exec-hermes` (or `kars dev --target local-k8s`) on a Mac M-series died at: [stage-2 7/21] RUN ... curl -fsSL "https://github.com/cli/cli/ releases/download/v2.89.0/gh_2.89.0_linux_arm64.tar.gz" ... curl: (22) The requested URL returned error: 504 Docker Desktop's networking VM on macOS is notoriously flaky with github.com (and any external GET); a single 5xx blew up an already-9-minutes-in image build, leaving the user with a partial kind cluster and no easy way to resume. Hardens all 6 external curls in sandbox-images/openclaw/Dockerfile.base with identical retry policy: --retry 5 — up to 5 retries before failing --retry-delay 3 — 3-second base delay (grows with backoff) --retry-all-errors — retry on HTTP 4xx/5xx too (not just network errors); covers the 504 case --connect-timeout 15 — fail-fast on dead routes so retries don't all hang for 5 minutes each Endpoints affected (all are versioned release artifacts that never change, so retries are safe): - nodejs.org/dist/v22.22.3/... (lines 19, 193; two stages) - github.com/cli/cli/... (line 223) - github.com/BurntSushi/ripgrep/... (line 229) - cache.agilebits.com/dist/1P/... (line 237) - github.com/pimalaya/himalaya/... (line 243) Total worst-case extra time on a healthy network: ~0s (first try succeeds). On a flaky network: ~45s per retried download instead of a hard failure that wastes the whole build context.
…refresh_in
User on fresh `kars dev exec-claw` saw the WebUI come up fine,
then ~30 min later every chat-completions call started returning:
WARN inference-router::proxy:
sandbox=palkarstop-... status=401
body="IDE token expired: unauthorized: token expired"
Root cause: GitHub's /copilot_internal/v2/token response returns
both `expires_at` (Copilot's hard expiry) and `refresh_in` (a
softer hint, typically ~1500s). The old cache only tracked the
refresh hint, computing:
refresh_at = now + refresh_in
and serving the cached JWT for as long as `refresh_at > Instant::now()`.
When GitHub returns `refresh_in > (expires_at - now)` — which
happens during token rotation windows and in some account-state
edge cases — the cache kept handing out a JWT whose `expires_at`
had already passed. Copilot then rejected every request with
the verbatim 401 body above until the (longer) refresh window
finally elapsed.
Fix:
• CachedJwt now stores BOTH refresh_at AND expires_at as
Instants. The fast-path serves only when:
refresh_at > now AND expires_at > now + REFRESH_BUFFER
• get_jwt_with_base() forces a re-exchange whenever either
boundary has been crossed, so a stale JWT is never served
past Copilot's hard expiry minus a 60s safety buffer.
Added regression test
`refreshes_when_expires_at_passes_even_if_refresh_in_is_longer`
that mints a cache entry with expires_in=30s + refresh_in=1500s,
synthetically ages it past expires_at, and asserts the next call
triggers a fresh exchange (the second mock returns "second-token";
the test fails if the cache still serves "first-token").
cargo test -p kars-inference-router --lib copilot_auth
test copilot_auth::tests::errors_when_no_token_configured ... ok
test copilot_auth::tests::surfaces_upstream_errors ... ok
test copilot_auth::tests::exchanges_and_caches_token ... ok
test copilot_auth::tests::refreshes_when_expires_at_passes_even_if_refresh_in_is_longer ... ok
test result: ok. 4 passed; 0 failed
…" warn
User fresh `kars dev` with `claude-opus-4.7` (the Copilot picker default)
ran into the verbatim Copilot rate-limit body:
status: 503
body: "Sorry, the upstream model provider is currently experiencing
high demand. Please try another model."
The router already retries 5xx + 429 against InferencePolicy.spec.
modelPreference.fallback[] (inference-router/src/failover.rs
::is_failover_trigger), but the auto-generated InferencePolicy from
`kars dev` emitted fallback_count=0, so the failover walk had nothing
to walk and the throttle response surfaced directly to the WebUI.
Fix has two parts:
1. CLI side — cli/src/github-copilot.ts::buildCopilotFallbackChain
Picks a same-Copilot, cross-family chain so at least one model
almost always has quota. Static ordering for debuggability:
[gpt-5, claude-sonnet-4.5, gemini-2.5-pro, gpt-5-mini,
claude-haiku-4.5, gpt-4.1]
minus the picked one (which stays primary; we never reorder
behind the user's back).
cli/src/commands/dev/local-k8s.ts::autoCreateSandbox now appends
the chain to the InferencePolicy YAML whenever
creds.provider === "github-copilot" — Foundry / GH-Models paths
don't get an auto-chain because they're single-deployment by
definition.
New tests in cli/src/github-copilot.test.ts (6 cases) gate:
- picked model always first
- picked model never appears in fallbacks (dedup)
- non-empty chain for the recommended default
- byte-identical chain between invocations (debuggability)
- every emitted id exists in COPILOT_MODELS (no typos)
- both Anthropic AND OpenAI entries regardless of pick
2. Router side — inference-router/src/inference_policy_loader.rs
The startup "InferencePolicy loaded" line now includes the full
`fallback_chain` (not just `fallback_count`) so ops can correlate
a 503-then-200 sequence with the configured order.
ALSO emits a one-shot WARN at load time when fallback is empty:
WARN InferencePolicy has no fallback chain — 5xx/429 on the
primary deployment will surface directly to the agent
(no router-side failover). Add spec.modelPreference.
fallback[] in the InferencePolicy CR.
Surfaces the gap loudly in the router log so operators don't
have to dig for "fallback_count":0 in a JSON line and realize
what it means — especially important for hand-rolled
InferencePolicy CRs (the auto-generated ones now always
include a chain on the Copilot path, but operator-authored
ones might not).
Verified:
• cli: npm run build + typecheck clean; vitest 795 tests pass
• cli: 6 new github-copilot.test.ts cases all green
• router: cargo build/clippy clean
• router: 16 inference_policy_loader tests still pass
• cargo fmt clean; ci/check-loc.sh clean
User on fresh `kars dev` saw the openclaw agent eventually report:
"Egress proxy is still timing out — I'll proceed with what I know
and flag anything that needs a fresh check."
with NO corresponding error line on the router side at the default
log level, because the router never returned to the client to log
anything.
Root cause: inference-router/src/forward_proxy.rs::handle_connect,
::handle_http, and ::handle_tls_redirect each called
let upstream = match TcpStream::connect(&resolved).await { ... }
unguarded. On Linux, TcpStream::connect blocks for the full kernel
SYN-retransmit window (~60-180s depending on net.ipv4.tcp_syn_retries)
when the destination IP silently drops the SYN — common when:
• the IP is behind a flaky CDN
• the destination is unreachable from kind on Mac Docker Desktop's
network namespace
• the iptables-redirected TLS target's IP roams during the request
While that connect blocks, the sandbox agent has nothing to show
the user except "still timing out", and the only router log was at
`debug!` level which was filtered out by default.
Fix:
• New `UPSTREAM_CONNECT_TIMEOUT = 10s` constant + a
`connect_with_timeout(addr)` helper that wraps `TcpStream::connect`
in `tokio::time::timeout()`. Returns an `io::Error` of kind
`TimedOut` with a clear message ("connect timeout after 10s") when
the SYN handshake hangs.
• All three call sites swapped to the helper. The visible-failure
contract is the same (`502 Bad Gateway` to the client) but now
bounded at 10s instead of 60-180s.
• Bumped the upstream-failed log lines from `debug!` to `warn!` and
enriched them with the domain/dest fields so the failure is
visible in the default log filter (was effectively invisible
before).
Regression tests added in inference-router/src/forward_proxy.rs:
• `connect_with_timeout_aborts_on_blackhole_ip` — asserts a TEST-NET-2
(RFC 5737) address fails within 15s of slack (10s timeout + CI
jitter). Pre-fix this would take the full kernel SYN-retransmit
window.
• `connect_with_timeout_succeeds_on_local_listener` — sanity check
that the wrapper doesn't break the happy path.
$ cargo test -p kars-inference-router --lib forward_proxy
test forward_proxy::tests::connect_with_timeout_succeeds_on_local_listener ... ok
test forward_proxy::tests::connect_with_timeout_aborts_on_blackhole_ip ... ok
test result: ok. 2 passed; 0 failed; finished in 10.00s
…ime only)
RustSec published advisory RUSTSEC-2026-0173 on 2026-06-07 (one day
before this commit) flagging `proc-macro-error2 2.0.1` as
unmaintained. cargo-audit + cargo-deny CI gates went red on every
push because the advisory feed pulls fresh on every run.
Dependency chain (controller-only, build-time):
oci-client 0.16.1 + 0.15.0 → oci-spec → getset 0.1.6 (proc-macro)
→ proc-macro-error2 2.0.1
Same rationale as the existing RUSTSEC-2024-0370 ignore for the
sibling crate `proc-macro-error` 1.x — this is a proc-macro crate
that runs at compile time inside rustc, NOT at runtime in the
controller binary. There is no runtime attack surface from an
unmaintained proc-macro crate beyond the build toolchain itself.
No safe upgrade path exists yet: `getset` 0.1.6 is the latest
release and pins `proc-macro-error2`. Upstream `oci-client` would
need to upgrade past `getset 0.1.6` (which would itself need to
swap proc-macro-error2 for proc-macro-error3 or inline its own
diagnostics). TODO comment added at both call sites to drop the
ignore when that lands.
Added the ignore to:
• .cargo/audit.toml — picked up by `cargo audit` (the Rust
Dependency Audit CI job).
• deny.toml — picked up by `cargo deny check advisories`
(the Rust Supply-Chain Gate CI job).
Verified locally:
cargo audit ⇒ clean (no warnings)
cargo deny check ⇒ advisories ok
Three product gaps user reported on the operator (`n` spawn dialog):
1. **Hermes missing from the runtime picker.** The operator dialog
hardcoded its own array of 7 runtimes (cli/src/commands/operator/
dialogs/spawn.ts ::runtimeOpts), which drifted from the actual
WIRED_KINDS list in cli/src/runtime.ts as soon as Hermes shipped.
Result: the docs say "Hermes is supported", the user pushes `n`,
but Hermes isn't an option.
Fix: extract a new `wiredRuntimeFlags()` helper in runtime.ts that
inverts FLAG_TO_KIND and walks WIRED_KINDS in order, then have the
spawn dialog call it directly. Now WIRED_KINDS is the single source
of truth — any new wired runtime shows up in the picker
automatically with zero per-runtime edits.
2. **Channels listed as "OpenClaw only" everywhere — but Hermes
supports them too.** sandbox-images/hermes/entrypoint.sh:266+
already translates TELEGRAM_BOT_TOKEN / SLACK_BOT_TOKEN /
DISCORD_BOT_TOKEN into `hermes config set channels.*.token`, so
the gating was just wrong copy.
Fixes:
- spawn.ts: introduce `channelCapableRuntimes = Set(["openclaw",
"hermes"])` and use it everywhere the dialog used to check
`state.runtime === "openclaw"`. The channel/token/allowfrom
fields now light up for Hermes too, and switching to a
non-channel runtime clears the selection as before.
- add.ts: rewrite the `--channels`/`--telegram-*`/`--slack-*`/
`--discord-*` help text from "[OpenClaw only]" to
"[OpenClaw + Hermes]". Skills + API keys stay "[OpenClaw only]"
because those wire via OpenClaw's plugin.allow list (not Hermes).
Updated the "Flag groups" --help-text block to match.
3. **Copilot models with Hermes — verified, no code change needed.**
sandbox-images/hermes/entrypoint.sh:83 pins OPENAI_BASE_URL at the
in-pod router, and lines 99-111 already case on KARS_PROVIDER to
set HERMES_DEFAULT_PROVIDER=openai when the operator picked
github-copilot in `kars dev`. The router-side fixes from this PR
(Copilot IDE-JWT cache w/ expires_at — 6886415; Copilot fallback
chain on 503 — 6dca0f8) apply to Hermes through the same router.
Contract test in runtime.test.ts pins this down so it can't drift
again:
• wiredRuntimeFlags() returns kebab flags that all round-trip
through flagToKind() to wired RuntimeKinds.
• The returned set includes every known wired runtime (Hermes in
particular).
• Two calls return byte-identical arrays (deterministic ordering
for left/right picker cursor UX).
Also retroactively added Hermes to the existing assertRuntimeWired
"accepts every wired runtime" test — was missing.
Verified:
npm run build ⇒ clean
npm run typecheck ⇒ clean
vitest run ⇒ 798 passed (was 795 → +3 new contract tests)
User reported "spawn fail on local k8s" after picking Hermes in the
operator's `n` spawn dialog. Root cause: cli/src/commands/dev/local-k8s.ts
::runLocalK8s only loaded the 3 helm-chart-pinned images
(kars-sandbox, kars-controller, kars-inference-router) into kind.
The operator-spawned Hermes pod resolves to
`karsacr.azurecr.io/kars-runtime-hermes:latest` (per controller/src/
reconciler/runtime.rs::DEFAULT_HERMES_IMAGE) — which doesn't exist
in kind, so the pod ImagePullBackOffs (kindnet can't reach ACR
without auth).
Fix: extend the image-load loop to ALSO attempt
`karsacr.azurecr.io/kars-runtime-hermes:latest` with aliases
[`kars-runtime-hermes:latest`, `kars-runtime-hermes:dev`]. The
existing `loadImageIfPresent` helper handles the missing-locally
case gracefully — when the host hasn't built the runtime image yet,
the function returns `{loaded: false}` without throwing.
Runtime images are separated from core images in the missing-image
warning path:
• Core 3 missing → yellow warning (deployment will fail).
• Runtime missing → dim notice with the exact `docker build`
command to fix:
docker build -t karsacr.azurecr.io/kars-runtime-hermes:latest \
-f sandbox-images/hermes/Dockerfile .
Then re-run `kars dev --target local-k8s --build` to load it.
Only Hermes auto-loads (not Anthropic/LangGraph/MAF/etc.) — those
runtimes stay opt-in to keep `kars dev` startup fast. Hermes is the
only non-OpenClaw runtime productized + verified in this PR, so it's
the one users will hit first.
Verified:
npm run build ⇒ clean
npm run typecheck ⇒ clean
vitest run ⇒ 798 passed
User report: operator's `n`/spawn dialog showed "✓ Spawned <name>" but
no pod ever appeared in the agent table — they had to check Headlamp
to find an ImagePullBackOff. Same pattern would hit CrashLoopBackOff,
ErrImageNeverPull, OOMKilled, etc.
Root cause in cli/src/commands/add.ts: the 120s wait loop polls for
`containerStatuses[*].ready` to include "true"; if the wait times out,
the code unconditionally calls `spinner.succeed(...)`, exits 0, prints
`(may still be starting)`. The operator's spawn dialog only logs
`{red-fg}✗ Spawn fail{/}` when execa rejects (non-zero exit), so a
stuck pod silently passed as success.
Fix: when the wait loop times out, query containerStatuses one last
time and check for unambiguous failure modes:
• ImagePullBackOff / ErrImageNeverPull / ErrImagePull /
InvalidImageName / CreateContainerConfigError (image side)
• CrashLoopBackOff with restarts >= 2 (runtime side)
• Last terminated state of OOMKilled / Error
If found:
• spinner.fail() with the container name + reason
• Print the kubectl describe/logs commands that surface the full
cause
• For image-pull failures specifically, print the docker-build +
kind-load commands (local-k8s case is by far the most common
cause — runtime images that aren't loaded into the kind cluster)
• process.exit(1) so the operator's spawn dialog sees the failure
and logs `✗ Spawn fail: <reason>` in its activity log
If NO unambiguous failure mode found (e.g. genuinely still pulling),
keep the original informational success so existing scripts aren't
broken.
This makes the operator UX honest: spawning a Hermes pod when the
runtime image isn't loaded into kind now shows an actual error
pointing at the build command, not a green checkmark that the user
has to debug via Headlamp.
Verified:
npm run build + typecheck ⇒ clean
vitest run ⇒ 798 passed
User report:
> operator says "✓ Spawned" then nothing visible
> kubectl get karssandbox -A confirms the CR was never created
Two compounding silent-failure bugs:
1. kars add was log-then-exit-0 on caught errors.
The outer catch at cli/src/commands/add.ts line 601 (was: 531)
handled every exception by calling spinner.fail() + console.error()
and then RETURNING — letting Node exit 0 naturally. So
`kubectl apply -f -` failing (CRD missing, wrong context, schema
rejection on the bundle, etc.) surfaced as a clean exit code to
any caller. Operator's `execa("kars", args, { stdio: "pipe" })`
only logs `✗ Spawn fail` when execa REJECTS, so silent exit-0
masked every kars-add failure mode behind a green checkmark.
Fix: add `process.exit(1)` after the error logs. Preserves all
the existing error-message branching (controller-not-installed
hint, generic error text) — just stops lying about exit status.
2. Operator's spawn dialog was throwing away the real error text.
Previously logged only `(e.stderr || e.message)?.substring(0, 200)`
— execa's `.message` is usually `Command failed with exit code 1:
kars add ...`, NOT the underlying kars-add stderr. So even after
fix #1, the operator log would show "✗ Spawn fail: Command failed
with exit code 1: kars add testhermes --runtime hermes ..." with
no actual root cause.
Fix: prefer e.stderr (now populated thanks to fix #1) over
e.message, strip ANSI colour codes that kars add emits via chalk,
filter empty lines, keep the last 4 (which is where spinner.fail
+ error hints live), join with " | ", cap at 400 chars. Activity
log now shows e.g.:
✗ Spawn fail: Failed to create sandbox | Error: kubectl error:
KarsSandbox.kars.azure.com "testhermes" is invalid: spec.hermes:
Invalid value: ... | Connect: kars connect testhermes
Also: on SUCCESS, echo the last 3 lines of stdout (the
"Namespace / Model / Status / Connect" hints kars add prints) so
the operator sees useful follow-up info inline.
Verified:
npm run build + typecheck ⇒ clean
vitest run ⇒ 798 passed
User OOTB story:
1. Ran `kars dev` from an earlier (pre-Hermes) branch — chart
installed an older KarsSandbox CRD into kind.
2. Pulled the hermes branch + `cd cli && npm run build` to update
the CLI binary.
3. Did NOT re-run `kars dev` (no apparent need — cluster was up).
4. `kars operator` → `n` → Hermes → Launch
5. `kars add` bundle had `spec.runtime.hermes`; cluster CRD didn't
know that field; kubectl apply rejected with:
ValidationError(KarsSandbox.spec.runtime):
unknown field "hermes" in com.azure.kars.v1alpha1...
The error message was technically accurate but the FIX was buried
behind "what does this even mean" digging. Operator activity log
showed the rejection text but didn't tell the user it's a one-shot
CRD refresh away.
Fix: cli/src/commands/add.ts catch block now detects the three
patterns kubectl emits on stale-CRD rejection ("unknown field",
"Unsupported value", "ValidationError") and prints the exact
chart-template + server-side-apply incantation:
This looks like a CRD schema mismatch — the cluster's KarsSandbox
CRD is older than your local CLI/sources. Refresh the chart:
helm template kars deploy/helm/kars --namespace kars-system \
--include-crds | kubectl apply -f - --server-side --force-conflicts
Or just re-run `kars dev --target local-k8s` — its chart-install
step always refreshes CRDs to the source-of-truth schema.
Doesn't change the behavior of the CRD path itself (kars dev's
helmInstall already does the right thing — see
cli/src/commands/dev/local-k8s.ts:778); this only makes the
diagnostic explicit when an operator-driven kars-add hits the
known stale-CRD failure mode.
Verified:
npm run build + typecheck ⇒ clean
vitest run ⇒ 798 passed
User OOTB on local-k8s after spawning a Hermes sandbox from operator: Failed to pull image "karsacr.azurecr.io/kars-runtime-hermes:latest": dial tcp: lookup karsacr.azurecr.io on 192.168.65.254:53: no such host kind nodes don't have ACR pull creds (and shouldn't — the user isn't on an Azure cluster). Last commit (ea59a9f) added auto-LOAD of the image into kind via loadImageIfPresent, but loading requires the image to exist on the host first — and the host doesn't have it unless the user knows the docker-build command. Fix: extend rebuildDevImages in cli/src/commands/dev/local-k8s.ts with a new "runtime-hermes" build spec. It builds karsacr.azurecr.io/kars-runtime-hermes:latest from sandbox-images/hermes/Dockerfile against the repo root context. The target tag matches DEFAULT_HERMES_IMAGE in the controller, so: 1. The auto-load step in runLocalK8s now finds the image and `kind load`s it into the cluster. 2. When the operator spawns a Hermes sandbox, the controller's image string resolves to the already-loaded image → ImagePullPolicy IfNotPresent + image-present → no pull attempt, no DNS-to-ACR-from-kind failure. Cost: • First build: 3-5 min (Python pip install + Hermes pip install + ripgrep/op binaries). The Dockerfile COPYs runtimes/wheels/ which `ensureAgtWheels` already populated at the top of runLocalK8s, so no wheel-build blocking. • Subsequent runs: docker layer cache hit, < 10 sec. Honors the same `forceAll` flag as the other dev specs. Skips silently if sandbox-images/hermes/Dockerfile is missing (e.g. older checkout) so it can't break the openclaw-only path. Hermes is the only non-OpenClaw runtime auto-built; other runtimes (Anthropic, LangGraph, MAF, Pydantic AI, OpenAI Agents) stay opt-in to keep `kars dev` startup cost bounded. Hermes is the productized runtime in this PR — auto-building it is what makes operator-`n` → Hermes → Launch JUST WORK out of the box. Verified: npm run build + typecheck ⇒ clean vitest run ⇒ 798 passed
…m template` Found while debugging user's persistent ErrImagePull after the manual CRD refresh worked: helm template kars deploy/helm/kars --include-crds | kubectl apply ... WITHOUT `-f deploy/helm/kars/values-local-dev.yaml` re-renders the controller Deployment from default values.yaml — which doesn't include `KARS_DEV_PROFILE=true`. The controller then defaults imagePullPolicy to Always for `:latest` images, so every sandbox pod tries to pull from ACR even when the image is loaded in kind. DNS resolution to karsacr.azurecr.io fails → ErrImagePull forever. This is exactly the failure mode I just sent the user from this PR's previous commit (93938f5) — the kars-add error hint's PRIMARY fix was a naked `helm template`, which is what triggered the side-effect. Reorder the hint: 1. Primary: `kars dev --target local-k8s` (correct overlay, no risk of overwriting controller env). 2. Fallback (only if you must apply CRD by hand): include `-f deploy/helm/kars/values-local-dev.yaml` so the controller keeps its dev semantics. No code path changed — just the user-facing diagnostic string.
User stuck on persistent ErrImagePull for kars-runtime-hermes even after the image was loaded into kind (`docker exec kars-dev-control- plane crictl images | grep hermes` confirmed). The kubelet was still attempting a network pull because the controller emitted `imagePullPolicy: Always` for the `:latest` tag. Root cause: `KARS_DEV_PROFILE=true` was set ONLY by `kars dev`'s dynamic per-run overlay (cli/src/commands/dev/local-k8s.ts:944), NOT by the static `values-local-dev.yaml` overlay. Result: any out-of-band chart apply that just used `-f values-local-dev.yaml` (e.g. the CRD-refresh workflow we recommend when source CRDs drift forward) silently dropped the env var → controller's pull-policy helper at controller/src/reconciler/mod.rs:1291 fell into the `Always` branch for `:latest` images → every sandbox pod tried to pull from ACR even when the image was kind-loaded → ErrImagePull forever on machines without ACR network reachability. Fix: pin `KARS_DEV_PROFILE: "true"` in the static overlay alongside `LEADER_ELECTION_ENABLED: "false"`. Now anyone applying `values-local-dev.yaml` gets full dev semantics (IfNotPresent pull policy, dev-mode relaxations, etc.) without depending on the CLI's dynamic overlay being layered on top. The CLI's dynamic overlay still re-emits the same key idempotently (see cli/src/commands/dev/local-k8s.ts ::provisionDevCreds line 944); helm/kubectl deduplicate by env-var `name`, so no double emission downside. After this commit, the user can run any of: • `kars dev --target local-k8s` (always worked) • `helm template kars deploy/helm/kars -f deploy/helm/kars/values-local-dev.yaml --include-crds | kubectl apply -f - --server-side --force-conflicts` • `helm upgrade kars deploy/helm/kars -f deploy/helm/kars/values-local-dev.yaml` And all three yield a working dev controller.
… apply` Following on from 99deca3 / 2c0c912 and a user-driven debug session that uncovered why the "helm template -f values-local-dev.yaml | kubectl apply" workaround keeps half-bricking local-k8s clusters: The static `values-local-dev.yaml` overlay does NOT contain the inference creds (AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, KARS_PROVIDER, COPILOT_GITHUB_TOKEN). Those are emitted by the per-run DYNAMIC overlay built in `cli/src/commands/dev/local-k8s.ts::provisionDevCreds`. So any external `helm template ... -f values-local-dev.yaml | kubectl apply` overwrites the controller Deployment's pod-spec env with ONLY the static-overlay values — silently nuking the dynamic creds. Result: every subsequent reconcile fails with: ERROR "No inference endpoint configured" and Deployments never get written. Pods never get created. The user then sees the operator's spawn dialog "succeed" (kars add times out on the wait loop, exits 0 from the previous-spinner branch — though now exits 1 from f39d425 — but in either case the CR is reconciled into nothing because the controller can't proceed). Fix: simplify the CRD-mismatch hint to recommend ONLY `kars dev --target local-k8s` (the source of truth for local dev). Drop the helm-template fallback that was actively harmful. Add an explicit DO NOT warning so the next person reading this hint doesn't try the obvious-but-wrong shortcut. Verified: npm run build ⇒ clean vitest run ⇒ 798 passed
This session shipped 12 OOTB blockers — every one diagnosable from
cluster state + controller logs + chart source. Captures the design
for an in-cluster SRE agent that auto-walks the same diagnostic
ladder. Filed as docs/blueprints/07-kars-sre-proposal.md so it gets
discoverable PR-review on its own merits without bloating this
Hermes-runtime PR.
Phased implementation:
1. MVP (kars-sre-mvp todo): 5 read-only tools. ~500 LOC, ~1 day.
2. Phase 2 (blocked on Phase 1): sre_apply_fix + AGT approval
flow.
3. Phase 3 (blocked on Phase 2): continuous-watcher mode.
Validation gate: the merged agent must autonomously diagnose +
propose fixes for every one of the 12 OOTB blockers from THIS
session, given only the cluster state that existed at the moment
each was hit. That's a built-in regression corpus.
No code shipped in this commit — design only. Implementation
lands in a separate PR series.
…s-control section Two pieces: 1. fix(cli): rebuildDevImages always rebuilds controller + router User session hit a 30-minute debug loop because their kind cluster was running a `kars-controller:dev` image built BEFORE commit 493c118 (2026-06-04), which added the `dev_profile` check at controller/src/reconciler/mod.rs:1291: let pull_policy = if ctx.dev_profile || !image.ends_with(":latest") { "IfNotPresent" } else { "Always" }; Without dev_profile in the running controller, every `:latest` image got `imagePullPolicy: Always` — including the local kind-loaded kars-runtime-hermes — so the kubelet ALWAYS attempted a manifest fetch from karsacr.azurecr.io which kind cannot reach → ErrImagePull forever. Why this was hard to spot: rebuildDevImages skipped already-present images entirely. Running `kars dev --target local-k8s` after pulling new controller source did NOT rebuild the controller image. The user's check for KARS_DEV_PROFILE=true on the controller pod passed (env was correctly injected by the helm overlay), but the running controller process couldn't USE the env because the binary predates the env-reading code. Fix: introduce ALWAYS_REBUILD = {"controller", "inference-router"}. Both bottom out at a pre-staged Rust binary COPY (see staging in stage-rust-bin.ts), so docker layer cache makes the rebuild ~5-30s. That cost is well worth not silently running stale controllers. Sandbox + runtime images stay opt-in because their builds are minutes long. 2. docs(sre): expand kars-sre access-control design Added §6.1-6.6 to docs/blueprints/07-kars-sre-proposal.md: - 6.1 Tier 1 (MVP target): in-cluster ServiceAccount token on local-k8s — works on kind without any Entra/AKS dependency - 6.2 Tier 2 (Phase 2): AKS Workload Identity federation; byte- identical agent code, purely additive operator glue - 6.3 Complete ClusterRole `kars-sre-reader` spec — single authorization gate, every absent permission is deliberate - 6.4 Secrets handling — router-side .data stripping (~30 LOC in inference-router/src/proxy.rs) + RBAC defense in depth - 6.5 Phase-2 write actions via short-lived (5-min TTL), per-action ServiceAccount tokens minted on operator approval; standing blast radius stays read-only - 6.6 Egress already covered Verified: npm run build + typecheck ⇒ clean vitest run ⇒ 798 passed
…copy User on local-k8s after the image-loading saga finally got past ImagePullBackOff — pod proceeded to crash on agent-container startup with: cp: preserving permissions for '/sandbox/.hermes/plugins/kars/__init__.py': Operation not permitted cp: preserving permissions for '/sandbox/.hermes/plugins/kars/discover.py': Operation not permitted …(13 such lines) Root cause: sandbox-images/hermes/entrypoint.sh line 73 used `cp -a`, which preserves owner + mode + atime metadata. The staged source at /opt/kars-hermes-stage/plugins/kars/ was chowned root:root at image build time. The entrypoint runs as UID 1000 (sandbox user) on a `readOnlyRootFilesystem: true` pod, so preserving root ownership → EPERM from the kernel → `set -e` at the top of the script kills the container with the "Operation not permitted" spam as the entire visible output. Cryptic-as-hell symptom; trivial fix. Fix: `cp -r` instead of `cp -a`. Files end up owned by the copying user (UID 1000), which is what we want anyway because that's the UID hermes runs as. The source files have 0444 from the `chmod -R a+rX` in sandbox-images/hermes/Dockerfile, so they remain readable post-copy; no follow-up chmod needed. Prior art: the openclaw entrypoint at sandbox-images/openclaw/ entrypoint.sh:1273 uses `cp --no-preserve=mode` for the same reason (it copies the package.json staged at image build time into a runtime location). We're now consistent. No new tests — the failure mode is verifiable only against a real sandbox pod, and the OOTB-fresh-machine-gate todo will cover it once that CI lane lands.
… gpt-5.4)
User picked `claude-opus-4.7` in the operator spawn dialog; the
resulting Hermes pod was using gpt-5.4 instead.
Root cause: sandbox-images/hermes/entrypoint.sh line 171 wrote
echo " default: \"${AZURE_OPENAI_DEPLOYMENT:-gpt-5.4}\""
into the auto-generated config.yaml. AZURE_OPENAI_DEPLOYMENT is
NOT injected into the agent container by the controller — only
into the inference-router container's env (see
controller/src/reconciler/mod.rs:1656 → router_env, vs the
openclaw_env path at line 1335 which only sets the generic
KARS_MODEL). So the hermes entrypoint fell through to the
hardcoded gpt-5.4 default in EVERY case.
The generic kars-runtime-contract env var is KARS_MODEL (controller/
src/reconciler/mod.rs:1335 — "Generic alias readable by any
runtime — Hermes / OpenAIAgents / MAF / BYO all read KARS_MODEL").
That's the env Hermes should be honouring.
Fix: prefer KARS_MODEL first, fall back to AZURE_OPENAI_DEPLOYMENT
(for hand-crafted dev overlays that still set the legacy name),
keep gpt-5.4 as last-resort default to keep the boot banner
sensible when neither is set.
${KARS_MODEL:-${AZURE_OPENAI_DEPLOYMENT:-gpt-5.4}}
User-visible effect: `kars operator → n → Hermes → model:
claude-opus-4.7 → Launch` now actually runs claude-opus-4.7 in the
pod, not gpt-5.4. Same for every other Copilot/Foundry model the
user can pick.
…spawns)
User session: a Hermes parent agent tried to spawn a sub-agent with
the Hermes plugin's documented `role` arg. The router rejected with
HTTP 422 Unprocessable Entity because:
inference-router/src/spawn/mod.rs::SpawnRequest had #[serde(
deny_unknown_fields)] and no `role` field. The Hermes plugin
wrapper (runtimes/hermes/src/kars_runtime_hermes/plugin/spawn.py)
sent `role` as a top-level body field and got 422.
The Hermes plugin's docstring + the kars_spawn schema both expose
`role` as a key arg: "Short persona/role description that siblings
can find by role." It was always a missing-field in the router's
request type, not bad client code.
Fix:
1. Add `pub role: Option<String>` to SpawnRequest in
inference-router/src/spawn/mod.rs:48 (the schema-deny-strict
deserialize site).
2. Wire it through every existing SpawnRequest construction site
(handoff/mod.rs × 6, spawn/dev_profile_test.rs × 1, spawn/mod.rs
× 2 — the docker.rs snapshot path and the list-children
reconstruction path).
3. In build_sub_agent_crd_with_labels, when role is non-empty,
emit it as `kars.azure.com/role` label on the child CRD so:
(a) sibling discovery via `kubectl get karssandbox
-l kars.azure.com/role=auditor` works,
(b) the parent's local roster can recover role on restart by
reading children with kars.azure.com/parent + reading
kars.azure.com/role,
(c) the handoff/restore path snapshot can preserve role
across re-spawn (the snapshot-reconstruction site now reads
the label back).
4. Sanitize the label: K8s requires ≤63 chars + a specific
character class. Replace disallowed chars with `-`, truncate.
Don't fail the spawn over a space in the LLM's free-form
persona text.
User-visible effect: `Hermes parent → kars_spawn(name: zsolti, role:
"data analyst")` now succeeds AND records the role for downstream
discovery. Previously it failed with 422 and the Hermes plugin's
wrapper retried without role (silently dropping the field's whole
purpose).
The agent that diagnosed this was, charmingly, correct.
Verified:
cargo build/clippy/fmt ⇒ clean
cargo test --lib spawn ⇒ 16 passed
ci/check-loc.sh ⇒ clean
The honest answer to "is OOTB really working?" is "you'd have to
actually run the full flow on a fresh machine to know". This
session shipped 16 OOTB blockers that no unit test caught — every
single one a wire-format failure across module boundaries that
only surfaces against a real cluster.
This script makes that question answerable by anyone in one command:
export KARS_OOTB_COPILOT_TOKEN="gho_..."
bash scripts/smoke/fresh-machine-ootb.sh
What it does:
1. Wipes carried state (kind cluster, AGT clone, ~/.kars, npm-link)
2. Fresh git clone into /tmp/kars-ootb-smoke
3. cd cli && npm ci && npm run build && npm link
4. kars dev --target local-k8s (non-interactive via seeded creds)
5. kars add one OpenClaw + one Hermes sandbox
6. Polls until both pods Running 2/2 (5-min timeout each)
7. Tears down (or --keep)
Exits 0 iff every step succeeds. On failure prints the precise
command that failed AND the pod's diagnostic output, so the
regression is reproducible from the script output alone.
Catches everything from the 2026-06-08 Hermes session:
• AGT auto-clone missing → kars dev fails at chart-apply
• Stale CRD → kars add 4xx
• Stale controller / runtime images → ImagePullBackOff
• KARS_DEV_PROFILE drift → same
• `cp -a` perm error → CrashLoopBackOff
• SpawnRequest missing role → kars add exit 1
• etc.
Limitations today:
• Needs a real Copilot OAuth token in env (can't run on a public
GHA runner without secret wiring). The ootb-fresh-machine-gate
todo tracks moving this to CI.
• Single-cluster only; AKS / federation covered by separate
interop scripts.
• Doesn't validate model routing post-spawn yet (next iteration).
README.md documents usage, what it catches, when to run, and
how to extend for new wired runtimes.
Not wired as a PR gate yet — runs manually. Filed
`ootb-fresh-machine-gate` for the CI lane.
…tch deck
Practitioner-grade visual language (mix of Patrick Collison / Stripe Press,
Bret Victor, and Stripe-docs style — selected per slide intent):
• Title + close : dark sandwich, 168pt mark, single tagline
• Pillar overview : eyebrow + heading + lede paragraph + 4 named primitives
• Sandbox : Victor-style — one named artefact, real CRD field
labels (UID 1000, readOnlyRootFilesystem, etc.)
• Sandbox · the gate : Stripe-docs — the 6 actual iptables rules in a
code block + prose explanation on the right
• Mesh : the real KNOCK frame JSON as the visual artefact
(with v/type/from/to/id/ts/intent/establishment fields)
• Governance · policy : the real InferencePolicy CR snippet as the artefact
• Governance · stack : 4 layers vertically with name + body + source ref
• Blueprints : 6 named shapes, real meta per shape (Kata + SEV-SNP,
A2A bridge, private model + signed allowlist, etc.)
• Multi-runtime : 8 wired runtimes named, with what each is in one line
• Built on AGT : the 4 actual PRs/contributions named
• What's next : the 4 actual shipping targets named
• Try it : the actual `kars dev` command sequence as code block
Every claim in the deck is ground-truthed against repo HEAD via
deep-dive explorer runs (see docs/showcase/outline.md for full
citations). Source file: /tmp/build-deck.js → pptxgenjs render.
Diagram pair (Excalidraw): 6 .excalidraw files using single accent
teal (#028090), Helvetica throughout, no inline arrows where layout
implies the relationship. Kept minimal because the deck slides carry
the detail; diagrams are hero shots, not info-dense.
Source-of-truth doc: docs/showcase/outline.md — every claim with
file:line citations against ground-truthed repo source. Updating
outline.md → re-running /tmp/build-deck.js regenerates the deck.
… tiles cropped) Visual QA caught two layout bugs in the first render: - Slide 7 (MESH/KNOCK frame): JSON code block was 17 lines @ fontSize 12 in a 2.9" panel — closing braces overflowed below the gray panel. Fixed: shrunk to fontSize 11, panel taller (3.55"), raised to y:3.85 to use the available space without colliding with the lede. - Slide 11 (RUNTIMES): 4 tiles @ 3.85" wide + 0.15" gaps = 15.85" total, but the slide is only 13.3" wide — x0 went negative, cropping leftmost tiles. Fixed: tw=2.9" → 12.25" total, centred with 0.525" left/right margin. Re-rendered all 15 slides, visual QA pass clean across the board. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Vendoring /tmp/build-deck.js into scripts/showcase/ so the deck is reproducible from a clean clone — no more 'lost the script that made the slides' problem. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| } | ||
|
|
||
| // section divider (very minimal — used between major narrative arcs) | ||
| function section(s, n, txt) { |
| } | ||
|
|
||
| // section divider (very minimal — used between major narrative arcs) | ||
| function section(s, n, txt) { |
…-shaped
Major restructure per Pal's feedback (less competitive, more architecture):
REMOVED: 'THE RACE' slide (competitor names)
ADDED: 'THE OUTCOME' slide — 4 measurable outcomes:
every call audited · sandbox in minutes · one policy plane · default-deny egress
ADDED (architecture deep-dive, §3):
§3 THE CORE — three boxes: Controller / 11 CRDs / Inference Router
§3.1 ROUTER REQUEST FLOW — 6 stages: agent → iptables → router → policy → audit → upstream
§3.2 ROUTER INTERNALS — 8 routes + 8 subsystems in two columns
§3.3 CONTROLLER LOOP — KarsSandbox → 9 named Kubernetes primitives
§3.4 STATUS & OBSERVABILITY — phase taxonomy + conditions YAML
§3.5 CRD CATALOG — all 11 CRDs tabled with scope + reconciler
ADDED (policy in practice, §4):
§4 InferencePolicy (existing slide reworked)
§4.1 ToolPolicy + EgressApproval (real CR snippets side by side)
REFRAMED:
§10 WHAT'S NEXT — outcome-shaped: 'capability → what becomes possible → proof'
Visual QA: 4 layout bugs caught on first render (slides 9/11/13/20)
and fixed; final QA pass clean on all 21 slides.
Reproduces via: NODE_PATH=$(npm root -g) node scripts/showcase/build-deck.js
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| } | ||
|
|
||
| // right-column prose paired with codeBlock | ||
| function rightProse(s, paragraphs, opts = {}) { |
| } | ||
|
|
||
| // right-column prose paired with codeBlock | ||
| function rightProse(s, paragraphs, opts = {}) { |
…gress slide
Two new slides per Pal's feedback:
NEW SLIDE 4 — HIGH-LEVEL ARCHITECTURE diagram (right after the 'WHAT KARS IS' dark
statement, before drilling into 'THE CORE'). A single picture that shows the whole
shape: dashed cluster boundary containing three boxes (kars CRDs · kars-controller ·
Sandbox pod), with the sandbox pod showing both inner containers (agent UID 1000 +
inference-router UID 1001). Below the cluster: a row of external services chips
(Azure OpenAI · Anthropic · OpenAI · Bedrock · MCP · A2A peers · AGT relay) with
'only path out' label connecting the architecture to the providers.
NEW SLIDE 15 — NETWORK EGRESS · learn it, sign it, enforce it. The full story:
- Top half: two side-by-side panels showing Learn mode (default — record every host
into the next allowlist proposal) vs Strict mode (production — anything outside the
signed allowlist gets 4xx; EgressApproval grants layer on top; fails closed)
- Bottom half: 5-stage signed-OCI-allowlist pipeline:
kars egress --sign → OCI artifact (ACR/ghcr) → cosign verify (Fulcio + SAN)
→ ConfigMap + digest → router · L7 hot-reload
- Source refs: controller/src/policy_fetcher.rs · egress_allowlist_compile.rs ·
inference-router/src/egress_allowlist_loader.rs
Section eyebrows renumbered to keep §3 sub-numbering consistent after insertion.
Final QA: clean on all 23 slides.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First end-to-end Hermes docker smoke test surfaced two real bugs from the A1 ship:
hermes-agent==0.5.1doesn't exist on PyPI — the 0.5.x assumption was from misreading the Hermes README's Homebrew formula tag (5.1.14). PyPI uses 0.x.y at 0.15.2 latest. Bumped pin to 0.15.2.standaloneplugins as opt-in viaplugins.enabledin config.yaml. The entrypoint was materializing the plugin into the right path ($HERMES_HOME/plugins/kars/) but never addingkarsto the allow-list — so it was discovered and silently skipped (error='not enabled in config').Bonus:
ripgrepnot in Azure Linux 3 tdnf causedtdnf install -yto fail the whole layer. Hermes'file_searchfalls back to grep cleanly, so dropped it. Image now builds in ~30s.Verification
docker build --platform linux/amd64 -f sandbox-images/hermes/Dockerfile -t kars-sandbox-hermes:dev .succeedshermes_cli.plugins.discover_plugins()loadskarsplugin from$HERMES_HOME/plugins/kars/with 10 tools + 2 hooks (pre_tool_call + post_tool_call)plugins.enabled: [kars]andmcp_servers.platformblocksPlugin contract confirmation: Hermes 0.15.2's
PluginContext.register_tool/register_hooksignatures andplugin.yamlschema (provides_tools/provides_hooks) match exactly what the A1 plugin code was already built for.Security
See
docs/internal/security-audits/2026-06-04-hermes-act1-docker-smoke-fixes.md— no new threat surface beyond what existing A1 audits already cover; this commit only flips the previously-audited surface from dormant to live.