Loading...
Back to Archive

9 min read

Playwright Is Not Good Enough for Agents

April 15, 2026

Humans survive that mess because we bring intuition.

Traditional automation survives it by bringing test scripts, retries, explicit selectors, and lots of maintenance.

Agents survive it badly because we keep handing them tools designed for one of those two worlds and pretending that is enough.

Browser automation for agents should not be a thin wrapper around a human testing framework.

If you build agentic products, a browser runtime designed for agent loops is a system you can operate in production.

We need a browser automation designed for AI agents, CI pipelines, and developers who want programmatic control without the overhead of a full testing framework.

We need a fast, native browser automation CLI powered by Chrome DevTools Protocol with many helpful commands to answer the following question:

What is the smallest, most deterministic, most inspectable control surface I can expose to a model that needs to perceive a page, take action, and recover from ambiguity?

Let’s see how we can achieve that with gsd-browser:

  • Understanding browser problem in agent systems
  • Agent native approach to browser automation
  • Setup and first run
  • Building reliable browser automation workflows

The Real Browser Problem in Agent Systems

When people say browser automation, they often collapse three very different jobs into one bucket:

  • End-to-end testing
  • Robotic process automation
  • Agent execution in an uncertain environment

Those are not the same workload.

End-to-end testing assumes the engineer knows what should happen, writes an explicit script, and mainly needs reproducibility.

RPA assumes a relatively stable enterprise flow, where the point is often to imitate user actions across legacy systems.

Agent execution is different: the model is trying to infer state, choose a next step, act under uncertainty, and then re-plan after each result.

That last requirement is where most current tooling starts to crack.

Playwright’s public API docs still introduce automation through a flow that launches a browser, opens a page, navigates, performs actions, and closes the browser.

Its core strengths are obvious and valuable for testing: cross-browser support, multiple language bindings, and auto-waiting behavior.

But those are the strengths of a general browser automation framework, not necessarily the strengths of an agent-native interaction model.

If you are a developer writing test code, a script-oriented abstraction is fine. You can inspect the DOM, craft selectors, add retries, and debug failures locally.

If you are an agent, every one of those assumptions becomes friction:

  • The agent often does not know the selector.
  • The page may re-render between observation and action.
  • The agent needs machine-readable state, not just raw HTML.
  • The cost of one wrong interaction compounds across the whole task.
  • Cold-start latency and process overhead show up in every tool call.

This is why browser automation is such a revealing benchmark for agent system design.

It forces you to answer a deeper question: do you actually have an execution environment for machine actors, or do you just have human tooling with an LLM duct-taped on top?

I have become increasingly convinced that many agent failures are really interface failures.

We point to the planner, the model, the prompt, or the reasoning depth but often the system died because the environment handed the model the wrong primitives.

If the only way an agent can succeed is by guessing CSS selectors, parsing brittle HTML, or reconstructing page state from screenshots, we just built a failure amplifier.

What Agent-Native Looks Like

The most interesting thing about gsd-browser is the fact that the features compose into an operating model that seems designed around agent failure modes rather than browser-engine elegance.

It’s one binary and an install path centered on curling a script that pulls down the binary and Chromium so you can be running quickly.

If every command has --json, the tool starts behaving like a local machine API with a human-friendly shell.

A snapshot returning handles like @v1:e1 turns page interaction into stateful coordination.

The version tells the agent which page state produced the reference, and the element handle tells it what it can act on from that state. That prevents a common failure mode where the model “remembers” a selector or label from an earlier page state and blindly reuses it after the DOM has shifted.

Semantic intents are the other half of the bet.

If the agent can call something like act login or act accept-cookies, you are moving one level above browser mechanics into action semantics.

There is also a systems-design insight hiding in the daemon architecture.

Instead of treating each invocation as a fresh browser session, gsd-browser keeps a daemon process with a persistent CDP connection and sends commands over local IPC.

When the first command boots the daemon and everything after that is effectively instant, the browser stops feeling like a heavyweight sidecar and starts feeling like a local execution substrate.

This matters at least four ways:

  • Lower latency means tighter action-observation loops.
  • Persistent process state means fewer reconnection and session-handling bugs.
  • Local IPC means the automation boundary is cheaper than a remote service hop.
  • CDP persistence makes multi-step workflows feel like one continuous interaction rather than a chain of mini-scripts.

It provides 63 commands spanning navigation, screenshots, accessibility trees, form analysis, network mocking, HAR export, visual regression diffing, encrypted auth vaults, test generation, device emulation, and frame management.

Let’s see GSD-Browser in action.

Setup and First Run

Install gsd-browser:

Code
bash
curl -fsSL https://raw.githubusercontent.com/gsd-build/gsd-browser/main/install.sh | bash

The daemon starts automatically on first use:

Navigate to a page

Code
bash
gsd-browser navigate https://example.com

On example.com the only interactive element is the "More information..." link

Code
bash
gsd-browser click-ref @v1:e1

Wait for navigation and assert the result

Code
bash
gsd-browser wait-for --condition network_idle
gsd-browser assert --checks '[{"kind":"url_contains","text":"iana.org"}]'

Capture a PNG

Code
bash
gsd-browser screenshot --output page.png --format png

A better hello world for an agent-native browser is: observe state, choose action, execute deterministically, and re-observe, instead of "open page, click link, exit."

That loop is what autonomous systems actually need.

Here’s a minimal observation loop:

Snapshot interactive elements

Code
bash
gsd-browser snapshot --json

Illustrative output:

Code
bash
{
"version": "v1",
"url": "https://app.example.com",
"title": "Example App",
"elements": [
  {
    "ref": "@v1:e1",
    "role": "button",
    "name": "Accept all cookies"
  },
  {
    "ref": "@v1:e2",
    "role": "button",
    "name": "Log in"
  }
]
}

This has a page-state version, durable element references scoped to that version, and enough semantic labeling to decide what to do next without parsing a huge DOM blob.

Prefer intent before mechanics:

Code
bash
# Let the tool resolve a common action semantically
gsd-browser act accept-cookies --json

# Then handle authentication
gsd-browser act login --json

Semantic intents remove a whole class of brittle prompt-time reasoning. Instead of making the model infer which button among five candidates corresponds to the right business action, the browser layer can absorb that ambiguity and return a structured success or failure payload.

Fall back to versioned refs when needed:

Re-snapshot after the page changes

Code
bash
gsd-browser snapshot --json

For example, the model should select a concrete element from the latest state and act on @v2:e4 rather than reusing an old handle from v1.

Even when you need low-level control, the versioned ref model gives you something much better than free-form selectors.

It encourages a discipline that every serious agent stack should adopt: no action without fresh observation, and no reference reuse across state transitions unless the tool explicitly guarantees it.

Use the Browser Like Infrastructure

The CLI shape also makes it easy to wrap in your orchestration layer.

A product team does not need a heavy browser SDK dependency if the local process contract is already stable and machine-readable.

For example, a TypeScript service can treat gsd-browser as a subprocess-backed tool:

Code
tsx
import { execa } from "execa";

type BrowserResult<T = unknown> = {
ok: boolean;
data?: T;
error?: string;
};

async function runBrowser(args: string[]): Promise<BrowserResult> {
try {
  const { stdout } = await execa("gsd-browser", [...args, "--json"]);
  return { ok: true, data: JSON.parse(stdout) };
} catch (error: any) {
  return { ok: false, error: error.stderr || error.message };
}
}

async function loginFlow() {
const snapshot1 = await runBrowser(["snapshot"]);
if (!snapshot1.ok) throw new Error(snapshot1.error);

const accept = await runBrowser(["act", "accept-cookies"]);
if (!accept.ok) throw new Error(accept.error);

const login = await runBrowser(["act", "login"]);
if (!login.ok) throw new Error(login.error);

const snapshot2 = await runBrowser(["snapshot"]);
if (!snapshot2.ok) throw new Error(snapshot2.error);

return snapshot2.data;
}

This is the right level of abstraction for many production systems.

Your application code does not know anything about selectors, accessibility tree parsing, or browser driver lifecycle but it knows that it can invoke a stable local binary, receive JSON, record the result, and decide the next action.

Building Reliable Workflows

The reliability comes from workflow rules.

gsd-browser is compelling because its model lends itself to rules that are enforceable.

Here is a workflow pattern I would recommend.

1. Observe Before Every Decision

Never let the model act off memory when the page may have changed. Require a fresh snapshot before each decision boundary, and attach the snapshot version to the planner state.

Code
tsx
type PlannerState = {
snapshotVersion: string;
lastSnapshot: unknown;
objective: string;
};

function requireFreshVersion(state: PlannerState, actionRef: string) {
const version = actionRef.split(":")[0].replace("@", "");
if (version !== state.snapshotVersion) {
  throw new Error(
    `Stale element reference. Expected ${state.snapshotVersion}, got ${version}`
  );
}
}

This seems strict, but it turns a fuzzy browser problem into a tractable state-management problem.

Once the tool exposes versioned refs, your orchestrator can enforce causal consistency instead of hoping the model implicitly keeps track of it.

2. Prefer Semantic Actions First

When an intent exists, use it before low-level interaction.

Common high-value tasks should be expressed semantically because they are semantically understood by both users and product flows (e.g. built-in intents such as login and accept-cookies).

A practical action policy looks like this:

  • Try semantic intent.
  • If intent is unavailable or ambiguous, inspect the latest snapshot.
  • Select a ref from the latest version.
  • Re-snapshot immediately after the action.
  • Evaluate whether the page moved toward the task goal.

This is a reusable system contract.

3. Separate Action from Evaluation

One reason browser agents degrade over long sessions is that the same model is often responsible for acting and judging success.

With JSON outputs and deterministic refs, you can split those responsibilities cleanly.

  • The actor chooses the next browser command.
  • The evaluator checks whether the returned state matches the expected progress marker.
  • The recovery policy decides whether to retry, fall back, or escalate.

Because gsd-browser is designed around structured outputs, these roles can operate over explicit data rather than natural-language summaries.

4. Treat Screenshots, Trees, and Diffs as Different Sensing Modes

Agents should not have one sensing mode; they should have several.

Use them intentionally:

  • Accessibility-tree or snapshot mode for compact planning and deterministic actions.
  • Screenshot mode for UI verification, visual anomalies, or human review.
  • Diff mode for regression checks and “did anything actually change?” verification.
  • Form analysis for structured field discovery before filling workflows.
  • HAR and network features when the browser is really a debugging instrument.
  • A mature agent system needs both action and evidence.

5. Build for Warm Sessions, Not One-Shot Scripts

The daemon-backed architecture changes optimal behavior.

Since the first command pays the startup cost and subsequent commands ride a persistent CDP connection, your system should exploit that.

That means:

  • Keep browser sessions alive across related actions.
  • Batch logically adjacent tasks into warm windows.
  • Use the same session for observe-act-verify loops.
  • Avoid spawning a new environment per micro-step unless isolation is necessary.
  • Agent systems do better when the browser feels like a long-lived collaborator.

Concluding Thoughts

Playwright MCP exposes browser capabilities to agents and centers accessibility snapshots as the state representation. Agent-browser emphasizes refs, sessions, compact output, and a native CLI for agent use.

They are converging on the same diagnosis: browser automation for models needs different primitives than browser automation for human-authored tests.

gsd-browser takes that diagnosis and pushes it toward a cleaner engineering stance:

  • Installation should be trivial enough to disappear.
  • Runtime dependencies should be minimal enough to standardize anywhere.
  • Output should be structured enough to plug into orchestrators and evaluators.
  • Element references should be deterministic enough to survive real planning loops.
  • Common user actions should be semantic enough to reduce model burden.
  • Sessions should be warm enough that latency stops dominating the loop.
  • Artifact generation should be rich enough that debugging is evidence-based.

That is why this project is a more opinionated answer to the question the whole space is now circling around: what does browser control look like when the software operator is no longer a human at a keyboard, but a reasoning system trying to act with bounded context and imperfect certainty?