Why UI Automation Becomes Brittle
The hardest part of maintaining UI automation is not writing the tests — it’s keeping them alive. Minor UI changes repeatedly break locators and flows, forcing testers to spend cycles chasing failures that are unrelated to real product defects.
As maintenance effort increases, coverage usually shrinks to only the most common paths. Edge cases, recovery flows, and unexpected user behavior often remain untested because the cost of maintaining those scripts becomes too high.
Brittle Selectors
A renamed field ID or reordered component breaks your suite. Someone has to manually hunt the changed selector and fix it.
Low Coverage
Quality engineer script the happy path. Edge cases, error states, and non-obvious flows never get written — until they break in production.
High Maintenance
UI automation frameworks often become difficult to maintain over time, especially when ownership is spread across multiple teams and priorities shift toward feature delivery.
Describe What to Test,
Not How to Test It
Instead of scripting every click and selector, you write a plain English mission. The LLM navigates the live UI, explores it like a real user, and produces a report — always against the current state of the app.
When the UI changes, you re-run the mission. The agent rediscovers the pathway. No debugging session. No selector archaeology.
Three Components,
~70 Lines of Code
The entire POC is intentionally minimal. The goal is to demonstrate the concept clearly, not to over-engineer infrastructure.
Anthropic Python SDK
Connects to the Claude API and manages the agent loop — decide, act, observe, repeat.
Playwright MCP
Microsoft's MCP server. Gives Claude real browser control: navigate, click, fill forms, read the DOM.
MCP Python SDK
Bridges Python to the Playwright MCP server using the Model Context Protocol standard.
Python 3.10+
Async runtime powering the agent loop and tool call handling.
Agentic vs. Scripted Testing
The shift isn't just technical — it's a change in how you think about test coverage. You stop asking "did someone write a test for this?" and start asking "did the agent explore this?" Those are very different questions, and the second one scales.
| Dimension | Traditional UI Automation | Agentic Testing |
|---|---|---|
| Test creation | Script every click and selector manually | Describe the goal in plain English |
| After a UI refactor | Hunt broken selectors, fix scripts | Re-run the mission against the live UI |
| Coverage | Happy path — what the developer planned | Happy path + edge cases + error flows |
| Maintenance | High — always someone's debt | Low — mission descriptions don't break |
| Best for | Regression on known stable flows | Discovery, new features, exploratory testing |
What a Real Run Looks Like
Below is the actual terminal output from running the agent against
saucedemo.com — a standard demo e-commerce app used for QA practice.
Run It Yourself
Five steps from zero to a running agent.
Install Python dependencies
pip install anthropic mcp
Install Playwright MCP and browsers
npm install -g @playwright/mcp npx playwright install chromium
Set your Anthropic API key
Never hardcode keys in source files. Set it as an environment variable.
# Windows PowerShell $env:ANTHROPIC_API_KEY="your-key-here" # Mac / Linux export ANTHROPIC_API_KEY="your-key-here"
Edit the mission in qa_agent.py
TARGET_URL = "https://your-app.com" MISSION = """ Test the checkout flow as two personas: 1. A quick shopper who knows what they want 2. An indecisive shopper who adds, removes, changes items Flag anything broken, confusing, or incomplete. """
Run the agent
python qa_agent.py
What I'd Add
in a Real Codebase
This is an intentionally simple POC — the goal is concept clarity, not production infrastructure. In a real engineering context, I'd layer in the following:
CI/CD Integration
GitHub Actions with PR label or comment triggers. Agent runs when a human intends it, not on every commit.
Codification Pipeline
Auto-convert agent-discovered pathways into deterministic Playwright regression scripts for repeatability.
Multi-Persona Missions
Multiple user personas per run to simulate diverse behaviors — power user, first-time user, error-prone user.
Screenshot Audit Trail
Capture screenshots at key steps. Visual evidence for bug reports and stakeholder communication.
Cost Controls
Turn limits and max_tokens caps to prevent runaway agent loops in CI. Predictable API spend per run.
Environment Separation
Agent always runs against staging. Never production. Test data only — nothing sensitive in the context window.