AI-Powered QA Agent — Portfolio POC

01 — The Problem

Why UI Automation Becomes Brittle

The hardest part of maintaining UI automation is not writing the tests — it’s keeping them alive. Minor UI changes repeatedly break locators and flows, forcing testers to spend cycles chasing failures that are unrelated to real product defects.

As maintenance effort increases, coverage usually shrinks to only the most common paths. Edge cases, recovery flows, and unexpected user behavior often remain untested because the cost of maintaining those scripts becomes too high.

🔧

Brittle Selectors

A renamed field ID or reordered component breaks your suite. Someone has to manually hunt the changed selector and fix it.

📉

Low Coverage

Quality engineer script the happy path. Edge cases, error states, and non-obvious flows never get written — until they break in production.

⏳

High Maintenance

UI automation frameworks often become difficult to maintain over time, especially when ownership is spread across multiple teams and priorities shift toward feature delivery.

02 — The Approach

Describe What to Test,
Not How to Test It

Instead of scripting every click and selector, you write a plain English mission. The LLM navigates the live UI, explores it like a real user, and produces a report — always against the current state of the app.

When the UI changes, you re-run the mission. The agent rediscovers the pathway. No debugging session. No selector archaeology.

Input

Plain English Mission

→

LLM
Claude Agent

↔

Protocol

Playwright MCP

→

Output

Bug Report

03 — Tech Stack

Three Components,
~70 Lines of Code

The entire POC is intentionally minimal. The goal is to demonstrate the concept clearly, not to over-engineer infrastructure.

Anthropic Python SDK

Connects to the Claude API and manages the agent loop — decide, act, observe, repeat.

Playwright MCP

Microsoft's MCP server. Gives Claude real browser control: navigate, click, fill forms, read the DOM.

MCP Python SDK

Bridges Python to the Playwright MCP server using the Model Context Protocol standard.

Python 3.10+

Async runtime powering the agent loop and tool call handling.

04 — Why It Matters

Agentic vs. Scripted Testing

The shift isn't just technical — it's a change in how you think about test coverage. You stop asking "did someone write a test for this?" and start asking "did the agent explore this?" Those are very different questions, and the second one scales.

Dimension	Traditional UI Automation	Agentic Testing
Test creation	Script every click and selector manually	Describe the goal in plain English
After a UI refactor	Hunt broken selectors, fix scripts	Re-run the mission against the live UI
Coverage	Happy path — what the developer planned	Happy path + edge cases + error flows
Maintenance	High — always someone's debt	Low — mission descriptions don't break
Best for	Regression on known stable flows	Discovery, new features, exploratory testing

05 — Live Run Output

What a Real Run Looks Like

Below is the actual terminal output from running the agent against saucedemo.com — a standard demo e-commerce app used for QA practice.

python qa_agent.py

🤖 Agent starting...

Navigating to https://www.saucedemo.com

🔧 Using tool: browser_navigate

🔧 Using tool: browser_snapshot

🔧 Using tool: browser_fill_form

🔧 Using tool: browser_click

## Scenario 1: Happy Path

✅ Login page loads correctly

✅ Valid credentials accepted — reached inventory page

✅ Product listings display with images, prices, descriptions

⚠️ Add to cart button has no visible effect

⚠️ Cart badge does not update after clicking Add to Cart

⚠️ Cart remains empty — items not persisting

🔧 Using tool: browser_navigate

🔧 Using tool: browser_type

🔧 Using tool: browser_click

## Scenario 2: Error Path — Invalid Login

✅ Invalid credentials do not grant access

✅ User remains on login page

⚠️ No error message shown to the user

⚠️ Error container exists in DOM but is never populated

⚠️ Locked-out user receives no feedback

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

VERDICT: 🔴 2 critical bugs found

Cart functionality broken. Login error handling missing.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ Done.

06 — Setup

Run It Yourself

Five steps from zero to a running agent.

Install Python dependencies

pip install anthropic mcp

Install Playwright MCP and browsers

npm install -g @playwright/mcp
npx playwright install chromium

Set your Anthropic API key

Never hardcode keys in source files. Set it as an environment variable.

# Windows PowerShell
$env:ANTHROPIC_API_KEY="your-key-here"

# Mac / Linux
export ANTHROPIC_API_KEY="your-key-here"

Edit the mission in qa_agent.py

TARGET_URL = "https://your-app.com"

MISSION = """
Test the checkout flow as two personas:
1. A quick shopper who knows what they want
2. An indecisive shopper who adds, removes, changes items
Flag anything broken, confusing, or incomplete.
"""

Run the agent

python qa_agent.py

View the Complete Source Code →

Explore the full Python implementation of the QA Agent, including the agent loop, tool execution, and Playwright MCP integration.

07 — Production Considerations

What I'd Add
in a Real Codebase

This is an intentionally simple POC — the goal is concept clarity, not production infrastructure. In a real engineering context, I'd layer in the following:

CI/CD Integration

GitHub Actions with PR label or comment triggers. Agent runs when a human intends it, not on every commit.

Codification Pipeline

Auto-convert agent-discovered pathways into deterministic Playwright regression scripts for repeatability.

Multi-Persona Missions

Multiple user personas per run to simulate diverse behaviors — power user, first-time user, error-prone user.

Screenshot Audit Trail

Capture screenshots at key steps. Visual evidence for bug reports and stakeholder communication.

Cost Controls

Turn limits and max_tokens caps to prevent runaway agent loops in CI. Predictable API spend per run.

Environment Separation

Agent always runs against staging. Never production. Test data only — nothing sensitive in the context window.

Why UI Automation Becomes Brittle

Brittle Selectors

Low Coverage

High Maintenance

Describe What to Test,Not How to Test It

Three Components,~70 Lines of Code

Anthropic Python SDK

Playwright MCP

MCP Python SDK

Python 3.10+

Agentic vs. Scripted Testing

What a Real Run Looks Like

Run It Yourself

Install Python dependencies

Install Playwright MCP and browsers

Set your Anthropic API key

Edit the mission in qa_agent.py

Run the agent

What I'd Addin a Real Codebase

CI/CD Integration

Codification Pipeline

Multi-Persona Missions

Screenshot Audit Trail

Cost Controls

Environment Separation

Describe What to Test,
Not How to Test It

Three Components,
~70 Lines of Code

What I'd Add
in a Real Codebase