Browser Automation

Architecture

Navigation

Interaction

Capture

Advanced

↑↓ navigate / search

Architecture

Live Host

The browser session manager. Maintains up to 5 Chromium session slots, each bound to a random port in the 46000–47000 range. Sessions persist across tool calls within a turn.

src/browser/live-host.ts manages a pool of Playwright Chromium sessions. Each slot has its own port, session ID, and current URL. When a tool call needs a browser, it claims a slot from the pool. The shared browser runtime daemon in runtime-daemon.ts keeps the Playwright process alive between tool invocations.

# Live host configuration
portRange: 46000–47000     # random port selection
maxSessionSlots: 5         # concurrent browser sessions
requiredPortCount: 6       # ports per slot (main + util)
bindHost: 127.0.0.1        # loopback only — never exposed

# Session slot status
{
  slotIndex: 0,
  port: 46031,
  sessionId: "sess_01J5…",
  currentUrl: "https://docs.python.org/asyncio",
  updatedAt: "2026-03-06T…"
}

# Enable browser tools in config
opta config set browser.mcp.enabled true

Loopback-only binding: All browser sessions bind to 127.0.0.1. The LOOPBACK_HOSTS set validates that no session is ever exposed to external network interfaces — important when the user is on a shared network.

Architecture

MCP Interceptor

Middleware layer between the agent loop and Playwright. Every browser tool call passes through the interceptor for policy evaluation, artifact recording, and risk scoring before execution.

src/browser/mcp-interceptor.ts wraps the @playwright/mcp server. It intercepts every tool call, evaluates it with the policy engine, records the action to the artifact log, and then either executes or denies. This design means policy can be changed without touching Playwright internals.

# Interceptor pipeline per tool call
tool_call = { name: "browser_click", params: { selector: "#submit" } }

1. evaluateBrowserPolicyAction(tool_call)
   → { decision: "allow" | "ask" | "deny", reason }

2. if "ask": pause for user confirmation
      if "deny": return error immediately

3. artifacts.record(tool_call, timestamp)

4. playwright.executeTool(tool_call)
   → { result, screenshot? }

5. artifacts.record(result)
6. return result to agent loop

Artifact log: Every browser action and its result is recorded to the session artifact log. This enables the visual diff pipeline to compare before/after states, and enables session replay.

Architecture

Policy Engine

Risk-based policy evaluation for browser tools. Each tool is classified into allow / ask / deny tiers, with domain allowlists for navigation and pattern-based classification for dynamic URLs.

The policy engine in src/browser/policy-engine.ts evaluates each browser tool call. Navigation tools check the target URL against a domain allowlist. Interaction tools (click, type) check the session's current domain. High-risk tools (evaluate, file_upload) require explicit config opt-in.

# Policy config
opta config set browser.policy.allowedDomains "github.com,docs.python.org"
opta config set browser.policy.allowEvaluate false  # default

# Policy evaluation logic
function evaluateBrowserPolicyAction(call) {
  if (call.name === "browser_evaluate") {
    return config.browser.policy.allowEvaluate
      ? "ask" : "deny"
  }
  if (isNavigationTool(call.name)) {
    return isAllowedDomain(call.params.url) ? "allow" : "ask"
  }
  return TIER_MAP[call.name] // allow | ask | deny
}

Autonomy floor: The global autonomy level uses Math.floor (not round), so fractional levels always resolve to the stricter tier. This prevents accidental permission escalation from floating-point autonomy scores.

Navigation

Navigate & Tabs

Full browser navigation control — URL loading, forward/back history, multi-tab management. Navigation is always auto-allowed when the target domain is in the allowlist.

# Navigation tools
browser_navigate     { url: string }
browser_go_back      {}
browser_go_forward   {}

# Tab management
browser_tabs         {}                 → [{ id, url, title }, …]
browser_tab_new      { url?: string }
browser_tab_close    { tabId: string }

# Example: multi-tab comparison
→ browser_navigate    https://github.com/repo/v1
→ browser_tab_new     https://github.com/repo/v2
→ browser_screenshot  (tab 1)
→ browser_screenshot  (tab 2)
→ [model compares screenshots]

Session persistence: Navigation state (cookies, localStorage) persists within a session slot for the duration of a conversation. The agent can log in once and navigate multiple pages without re-authenticating.

Navigation

Wait & Timing

Smart waiting for elements, navigation events, and network idle. Prevents race conditions where the agent acts before dynamic content has loaded.

# Wait tools
browser_wait_for {
  selector?: string,      # CSS selector to appear
  text?: string,          # text content to appear
  event?: "load" | "networkidle" | "domcontentloaded",
  timeout?: number        # ms, default 30000
}

# Pattern: navigate then wait for content
→ browser_navigate https://app.example.com/dashboard
→ browser_wait_for { event: "networkidle" }
→ browser_wait_for { selector: ".data-table" }
→ browser_snapshot  ← now safe to read DOM

Ask — Interaction

Click & Type

Element interaction tools. CSS selectors target elements; text labels work for buttons and links. Confirmation required — these tools mutate page state.

# Click
browser_click { selector: string }
→ browser_click { selector: "button[type=submit]" }
→ browser_click { selector: "text=Sign In" }

# Type
browser_type  { selector: string, text: string }
→ browser_type { selector: "#email", text: "[email protected]" }

# Hover (for tooltips / menus)
browser_hover { selector: string }

# Scroll
browser_scroll { selector?: string, direction: "up"|"down", distance?: number }

Selector priority: Playwright prefers aria roles and accessible text labels over CSS selectors. If browser_snapshot shows accessible names, use those for more resilient selectors than fragile CSS classes.

Ask — Interaction

Form Filling

Fill multiple form fields in one tool call, or use select_option for dropdowns. Handles text inputs, textareas, checkboxes, and radio buttons.

# Fill multiple fields atomically
browser_fill_form {
  fields: [
    { selector: "#name",    value: "Matthew Byrden" },
    { selector: "#email",   value: "[email protected]" },
    { selector: "#company", value: "Opta Operations" }
  ]
}

# Select dropdown option
browser_select_option {
  selector: "#country",
  value: "AU"          # or label: "Australia"
}

# Pattern: scrape form then fill
→ browser_snapshot  ← read form fields + selectors
→ browser_fill_form { fields: […] }
→ browser_click     { selector: "[type=submit]" }

Ask — Interaction

Keyboard & Drag

Keyboard shortcuts, key chords, and drag-and-drop for UI interactions that can't be achieved with click and type alone.

# Key press (single or chord)
browser_press_key { key: "Enter" }
browser_press_key { key: "Control+A" }
browser_press_key { key: "Escape" }
browser_press_key { key: "Tab" }

# Drag and drop
browser_drag {
  startSelector: ".draggable-item",
  endSelector:   ".drop-zone"
}

# Use cases
→ Ctrl+A then Delete to clear a field
→ Tab navigation through a form
→ Escape to close modals
→ Drag list items to reorder

Ask — Interaction

Dialog Handling

Intercept and respond to browser alert, confirm, and prompt dialogs that would otherwise block automation.

# Handle next dialog before triggering action
browser_handle_dialog {
  action: "accept" | "dismiss",
  promptText?: string    # for window.prompt()
}

# Pattern: handle delete confirmation
→ browser_handle_dialog { action: "accept" }
→ browser_click { selector: "#delete-btn" }
# Dialog intercepted and accepted automatically

# Pattern: dismiss unsaved changes dialog
→ browser_handle_dialog { action: "dismiss" }
→ browser_navigate { url: "https://app.example.com" }

Capture

Screenshot

Full-page or viewport PNG screenshots returned as base64. Used by the agent to see the current page state — the primary visual input for understanding layouts and reading dynamic content.

# Basic screenshot
browser_screenshot {}
→ { data: "data:image/png;base64,…", mimeType: "image/png" }

# Full-page screenshot (scrolls entire page)
browser_screenshot { fullPage: true }

# Screenshot is passed to vision model
# Agent can read text, see layout, understand UI

# Peekaboo alternative (macOS only)
# capturePeekabooScreenPng() — captures from
# Chromium window buffer regardless of focus
# 500ms frame cache to avoid redundant captures

Vision pipeline: Screenshots are base64-encoded PNG blobs that go directly into the model's vision input. The agent can read text, count elements, identify UI patterns, and understand visual hierarchy from screenshots alone.

Capture

DOM Snapshot

Accessibility tree snapshot of the current page. Gives the agent element roles, names, and selectors without needing to parse raw HTML. More reliable than screenshots for identifying interactive elements.

# DOM snapshot
browser_snapshot {}
→ accessibility tree (ARIA roles + names + selectors)

# Example output (excerpt)
button "Sign In" [selector: button[type=submit]]
textbox "Email" [selector: #email]
link "Forgot password?" [selector: a.forgot-link]
heading "Welcome back" [level: 1]

# When to use snapshot vs screenshot:
browser_snapshot  ← find selectors, read form fields
browser_screenshot ← understand visual layout, read images

Capture

Network Requests

Intercept and inspect all network requests made by the page. Useful for understanding API calls, capturing auth tokens in flight, and debugging SPA data fetching.

# Get all network requests since page load
browser_network_requests {}
→ [
    { method: "GET", url: "https://api.example.com/user",
      status: 200, responseBodySize: 1240 },
    { method: "POST", url: "https://api.example.com/session",
      status: 201, headers: { authorization: "Bearer …" } }
  ]

# Use cases:
→ Find the API endpoint a page calls for its data
→ Inspect auth headers to understand auth flow
→ Debug why a page isn't loading data
→ Discover undocumented internal APIs

Policy Gated — High Risk

JS Evaluate

Execute arbitrary JavaScript in the page context. Extremely powerful — can read DOM, call page APIs, modify state — but requires explicit config opt-in due to injection risk.

# Enable in config first
opta config set browser.policy.allowEvaluate true

# Execute JS in page context
browser_evaluate { js: "document.title" }
→ "GitHub - opta/cli"

browser_evaluate { js: "window.__APP_STATE__" }
→ { userId: "user_01…", featureFlags: […] }

browser_evaluate {
  js: "Array.from(document.querySelectorAll('h2')).map(e=>e.textContent)"
}
→ ["Setup", "Installation", "Configuration", …]

# Also requires confirmation (policy-gated)

Injection risk: browser_evaluate executes in the page's JS context and can access cookies, localStorage, and page state. Only enable on pages you trust. The policy engine denies by default — enable deliberately per-session.

Advanced

Peekaboo TUI

macOS-only: captures the Chromium window buffer directly (regardless of focus) and streams it into the TUI sidebar. Watch what the agent is doing in real time without switching windows.

Peekaboo uses macOS screen capture APIs to grab the Chromium window buffer by app name (PLAYWRIGHT_BROWSER_APP_NAME = "Chromium"). The frame is cached for 500ms to avoid redundant captures. Sensitive text is redacted from the log output before display.

# Peekaboo functions (src/browser/peekaboo.ts)
capturePeekabooScreenPng()      → PNG buffer (500ms cache)
isPeekabooAvailable()           → bool (macOS only)
peekabooClickLabel(label)       → click by accessible label
peekabooPressKey(key)           → synthetic key event
peekabooTypeText(text)          → type into active field
redactPeekabooSensitiveText(t)  → mask passwords/tokens

# TUI sidebar shows live browser preview
# No focus change needed — works behind other windows
# Enabled automatically when browser.mcp.enabled=true (macOS)

macOS only: Peekaboo depends on macOS screen capture permissions. On Windows/Linux, use browser_screenshot for visual feedback instead.

Advanced

Visual Diff

Pixel-level before/after screenshot comparison. Used automatically by the interceptor to detect unexpected page changes after interactions, and available as an explicit tool for regression testing.

# Visual diff pipeline (src/browser/visual-diff.ts)

# Automatic: MCP interceptor captures before/after
before_screenshot = capture() ← before action
execute_tool(action)
after_screenshot  = capture() ← after action
diff = pixelDiff(before, after)
→ { changed: true, changedRegions: [{ x, y, w, h }] }

# Diff result available in tool result payload
tool_result.visualDiff = {
  changedPixelPercent: 0.12,
  significantChange: true
}

# Agent uses diff to verify actions had expected effect

Advanced

Sub-Agent Delegation

Delegate complex multi-step browser tasks to a full-peer specialist agent. The sub-agent gets its own browser session, tool set, and conversation context — returning a structured result to the parent agent.

delegateToBrowserSubAgent() in src/browser/sub-agent-delegator.ts spawns a complete child agent with browser-focused system prompt and tool permissions. The parent agent continues once the sub-agent completes or hits a defined checkpoint. Ideal for goals like "go to this site and fill out the form".

# Sub-agent delegation
opta do "go to linear.app and create an issue for the bug in #9234"

Parent agent:
→ delegateToBrowserSubAgent({ goal: "create Linear issue for #9234" })

Browser sub-agent (full peer):
→ browser_navigate https://linear.app
→ browser_click { selector: "New Issue" }
→ browser_fill_form { fields: [{ title: "Bug: …" }] }
→ browser_click { selector: "Create Issue" }
→ return { issueId: "ENG-421", url: "https://…" }

Parent agent resumes with result.

MCP alternative: If the target site has an MCP server (e.g. Linear, GitHub), prefer the MCP tool over browser delegation — it's more reliable and doesn't require UI interaction. Browser delegation is for sites without MCP coverage.

Advanced

Session Replay

Deterministic replay of a recorded browser session from its action log. Useful for regression testing, debugging, and demonstrating a workflow without re-running the full agent.

src/browser/replay.ts reads the artifact action log from a previous session and re-executes each browser action in order. Actions are timed to match original delays. Screenshots are captured at each step for side-by-side comparison with the original.

# Replay a recorded session
opta browser replay --session sess_01J5X3…

◆ Loading action log (42 actions)…
  → browser_navigate https://app.example.com     ✓
  → browser_fill_form { email, password }         ✓
  → browser_click { selector: "#submit" }         ✓
  → browser_wait_for { event: "networkidle" }     ✓
  …
✓ Replay complete. 42/42 actions succeeded.
  Visual diff: 3 regions changed vs original.

# Action log stored in session artifact dir
# ~/.config/opta/sessions/sess_01J5…/artifacts/