feat(browser_execute): auto-attach Page.captureScreenshot results as image attachments by Alezander9 · Pull Request #47 · browser-use/browsercode

Alezander9 · 2026-05-09T03:12:40Z

Summary

Every successful Page.captureScreenshot made during a browser_execute call is now auto-attached as an image part on the next assistant turn. The model sees the screenshot natively as vision input — no decode → write → read dance from inside the snippet, no helper bound into snippet scope, no prompt change required.

Closes the screenshot-handling gap reported by a downstream agent: previously the base64 PNG came back in the output text and either got truncated (forcing a workaround through the read tool on a manually-saved file) or arrived as raw bytes the model couldn't interpret.

Mechanism

Pure Level-1 addition under packages/bcode-browser/. Zero upstream diff.

cdp/session.ts — new onCallResult(fn) listener API, symmetric with the existing onEvent. Fires after every successful _call resolve. Keeps the Session agnostic of any one method's semantics; Page.captureScreenshot is a consumer, not baked in.
browser-execute.ts (Level 1) — execute() subscribes for the duration of each call, filters to Page.captureScreenshot, accumulates results into a per-call screenshots collector returned alongside output/result. When BCODE_SCREENSHOT_DIR env var is set, the same tap also writes each screenshot to disk (best-effort, fire-and-forget) so eval harnesses can collect them for an LLM judge — second consumer of the same hook.
tool/browser-execute.ts (Level 2) — maps the collector into the existing attachments[] field on ExecuteResult. This is the same channel read.ts and webfetch.ts already use when they surface images. We're adding browser_execute as a third producer.

Surface area

cdp/session.ts: +28 lines (listener API + fire on _call resolve)
browser-execute.ts Level 1: +61 lines (collector, env-var dump, threading; mostly mime/format helpers)
tool/browser-execute.ts Level 2: +15 lines (attachments mapping + screenshot-count footer in tool output)
skills/BROWSER.md: 4 lines amended (one screenshot example block)
skills/interaction-skills/screenshots.md: 6 lines added (new "Auto-attached" callout)

Tests

Two new smoke tests in test/browser-execute.test.ts (gated on BCODE_SMOKE_CHROME=1, same as existing tests):

Page.captureScreenshot is collected into result.screenshots — verifies png + jpeg round-trip with correct mime tags.
BCODE_SCREENSHOT_DIR dumps screenshots to disk — verifies the env-var disk-dump path lands .png files.

Local run against headless Chrome 147 (Linux x64): 8/8 pass (4 pre-existing Chrome smokes + 2 new screenshot smokes + 2 unit tests). bun typecheck clean across all packages.

Concurrency

Parallel execute() calls against the same Session (rare — would require two in-flight tool calls under one sessionID, which opencode serializes within one assistant message) would each subscribe and each see all screenshots produced during their lifetime. Documented as acceptable for v1.

Notes for the eval consumer

When BCODE_SCREENSHOT_DIR=<path> is set, every Page.captureScreenshot writes <sessionID>-<startedAt>-<seq>.<ext> to the directory. Disk-dump fires unconditionally on success — independent of the attachments[] path — so it survives any future change to opencode's attachment handling.

…image attachments Every successful Page.captureScreenshot made during a browser_execute call is now collected from the CDP transport and surfaced as a FilePart on the tool result. The opencode runner appends those attachments to the next assistant turn as image parts, so the model sees the screenshot natively as vision input. No more decode-write-read dance from inside the snippet. Same channel that read.ts and webfetch.ts already use when they surface images; we're adding browser_execute as a third producer. Mechanism (Level 1, zero upstream diff): - cdp/session.ts: new generic onCallResult(fn) listener API, symmetric with existing onEvent. Fires after every successful _call resolve. Keeps the Session agnostic of any one method's semantics. - browser-execute.ts (Level 1): subscribes for the duration of each execute() call, filters to Page.captureScreenshot, accumulates results into a per-call collector returned alongside output/result. When BCODE_SCREENSHOT_DIR is set, the same tap also writes each screenshot to disk (eval-judge consumption — second consumer of the same hook). - tool/browser-execute.ts (Level 2): maps the collector to attachments[] on the ExecuteResult. BROWSER.md and interaction-skills/screenshots.md updated to tell the agent the auto-attach behavior. Two new smoke tests (gated on BCODE_SMOKE_CHROME) verify screenshots round-trip + the env-var disk dump.

Alezander9 merged commit a17676f into main May 9, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(browser_execute): auto-attach Page.captureScreenshot results as image attachments#47

feat(browser_execute): auto-attach Page.captureScreenshot results as image attachments#47
Alezander9 merged 1 commit intomainfrom
feat/auto-screenshot-attachments

Alezander9 commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alezander9 commented May 9, 2026

Summary

Mechanism

Surface area

Tests

Concurrency

Notes for the eval consumer

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant