Skip to content

feat(browser_execute): auto-attach Page.captureScreenshot results as image attachments#47

Merged
Alezander9 merged 1 commit intomainfrom
feat/auto-screenshot-attachments
May 9, 2026
Merged

feat(browser_execute): auto-attach Page.captureScreenshot results as image attachments#47
Alezander9 merged 1 commit intomainfrom
feat/auto-screenshot-attachments

Conversation

@Alezander9
Copy link
Copy Markdown
Member

Summary

Every successful Page.captureScreenshot made during a browser_execute call is now auto-attached as an image part on the next assistant turn. The model sees the screenshot natively as vision input — no decode → write → read dance from inside the snippet, no helper bound into snippet scope, no prompt change required.

Closes the screenshot-handling gap reported by a downstream agent: previously the base64 PNG came back in the output text and either got truncated (forcing a workaround through the read tool on a manually-saved file) or arrived as raw bytes the model couldn't interpret.

Mechanism

Pure Level-1 addition under packages/bcode-browser/. Zero upstream diff.

  1. cdp/session.ts — new onCallResult(fn) listener API, symmetric with the existing onEvent. Fires after every successful _call resolve. Keeps the Session agnostic of any one method's semantics; Page.captureScreenshot is a consumer, not baked in.

  2. browser-execute.ts (Level 1)execute() subscribes for the duration of each call, filters to Page.captureScreenshot, accumulates results into a per-call screenshots collector returned alongside output/result. When BCODE_SCREENSHOT_DIR env var is set, the same tap also writes each screenshot to disk (best-effort, fire-and-forget) so eval harnesses can collect them for an LLM judge — second consumer of the same hook.

  3. tool/browser-execute.ts (Level 2) — maps the collector into the existing attachments[] field on ExecuteResult. This is the same channel read.ts and webfetch.ts already use when they surface images. We're adding browser_execute as a third producer.

Surface area

  • cdp/session.ts: +28 lines (listener API + fire on _call resolve)
  • browser-execute.ts Level 1: +61 lines (collector, env-var dump, threading; mostly mime/format helpers)
  • tool/browser-execute.ts Level 2: +15 lines (attachments mapping + screenshot-count footer in tool output)
  • skills/BROWSER.md: 4 lines amended (one screenshot example block)
  • skills/interaction-skills/screenshots.md: 6 lines added (new "Auto-attached" callout)

Tests

Two new smoke tests in test/browser-execute.test.ts (gated on BCODE_SMOKE_CHROME=1, same as existing tests):

  • Page.captureScreenshot is collected into result.screenshots — verifies png + jpeg round-trip with correct mime tags.
  • BCODE_SCREENSHOT_DIR dumps screenshots to disk — verifies the env-var disk-dump path lands .png files.

Local run against headless Chrome 147 (Linux x64): 8/8 pass (4 pre-existing Chrome smokes + 2 new screenshot smokes + 2 unit tests). bun typecheck clean across all packages.

Concurrency

Parallel execute() calls against the same Session (rare — would require two in-flight tool calls under one sessionID, which opencode serializes within one assistant message) would each subscribe and each see all screenshots produced during their lifetime. Documented as acceptable for v1.

Notes for the eval consumer

When BCODE_SCREENSHOT_DIR=<path> is set, every Page.captureScreenshot writes <sessionID>-<startedAt>-<seq>.<ext> to the directory. Disk-dump fires unconditionally on success — independent of the attachments[] path — so it survives any future change to opencode's attachment handling.

…image attachments

Every successful Page.captureScreenshot made during a browser_execute call
is now collected from the CDP transport and surfaced as a FilePart on the
tool result. The opencode runner appends those attachments to the next
assistant turn as image parts, so the model sees the screenshot natively
as vision input. No more decode-write-read dance from inside the snippet.

Same channel that read.ts and webfetch.ts already use when they surface
images; we're adding browser_execute as a third producer.

Mechanism (Level 1, zero upstream diff):

- cdp/session.ts: new generic onCallResult(fn) listener API, symmetric with
  existing onEvent. Fires after every successful _call resolve. Keeps the
  Session agnostic of any one method's semantics.
- browser-execute.ts (Level 1): subscribes for the duration of each
  execute() call, filters to Page.captureScreenshot, accumulates results
  into a per-call collector returned alongside output/result. When
  BCODE_SCREENSHOT_DIR is set, the same tap also writes each screenshot
  to disk (eval-judge consumption — second consumer of the same hook).
- tool/browser-execute.ts (Level 2): maps the collector to attachments[]
  on the ExecuteResult.

BROWSER.md and interaction-skills/screenshots.md updated to tell the agent
the auto-attach behavior. Two new smoke tests (gated on BCODE_SMOKE_CHROME)
verify screenshots round-trip + the env-var disk dump.
@Alezander9 Alezander9 merged commit a17676f into main May 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant