Portfolio · Proof of Concept · QA Engineering

AI-Powered
QA Agent

An LLM that navigates your app, executes test scenarios described in plain English, and produces a structured bug report — without a single hardcoded selector.

Stack Python · Claude API · Playwright MCP
Type Agentic Testing POC
Lines of Code ~70 meaningful lines
Scroll to explore
01 — The Problem

Why UI Automation Becomes Brittle

The hardest part of maintaining UI automation is not writing the tests — it’s keeping them alive. Minor UI changes repeatedly break locators and flows, forcing testers to spend cycles chasing failures that are unrelated to real product defects.

As maintenance effort increases, coverage usually shrinks to only the most common paths. Edge cases, recovery flows, and unexpected user behavior often remain untested because the cost of maintaining those scripts becomes too high.

🔧

Brittle Selectors

A renamed field ID or reordered component breaks your suite. Someone has to manually hunt the changed selector and fix it.

📉

Low Coverage

Quality engineer script the happy path. Edge cases, error states, and non-obvious flows never get written — until they break in production.

High Maintenance

UI automation frameworks often become difficult to maintain over time, especially when ownership is spread across multiple teams and priorities shift toward feature delivery.

02 — The Approach

Describe What to Test,
Not How to Test It

Instead of scripting every click and selector, you write a plain English mission. The LLM navigates the live UI, explores it like a real user, and produces a report — always against the current state of the app.

When the UI changes, you re-run the mission. The agent rediscovers the pathway. No debugging session. No selector archaeology.

Input
Plain English Mission
LLM
Claude Agent
Protocol
Playwright MCP
Output
Bug Report
03 — Tech Stack

Three Components,
~70 Lines of Code

The entire POC is intentionally minimal. The goal is to demonstrate the concept clearly, not to over-engineer infrastructure.

Anthropic Python SDK

Connects to the Claude API and manages the agent loop — decide, act, observe, repeat.

Playwright MCP

Microsoft's MCP server. Gives Claude real browser control: navigate, click, fill forms, read the DOM.

MCP Python SDK

Bridges Python to the Playwright MCP server using the Model Context Protocol standard.

Python 3.10+

Async runtime powering the agent loop and tool call handling.

04 — Why It Matters

Agentic vs. Scripted Testing

The shift isn't just technical — it's a change in how you think about test coverage. You stop asking "did someone write a test for this?" and start asking "did the agent explore this?" Those are very different questions, and the second one scales.

Dimension Traditional UI Automation Agentic Testing
Test creation Script every click and selector manually Describe the goal in plain English
After a UI refactor Hunt broken selectors, fix scripts Re-run the mission against the live UI
Coverage Happy path — what the developer planned Happy path + edge cases + error flows
Maintenance High — always someone's debt Low — mission descriptions don't break
Best for Regression on known stable flows Discovery, new features, exploratory testing
05 — Live Run Output

What a Real Run Looks Like

Below is the actual terminal output from running the agent against saucedemo.com — a standard demo e-commerce app used for QA practice.

python qa_agent.py
🤖 Agent starting...

Navigating to https://www.saucedemo.com
🔧 Using tool: browser_navigate
🔧 Using tool: browser_snapshot
🔧 Using tool: browser_fill_form
🔧 Using tool: browser_click

## Scenario 1: Happy Path

✅ Login page loads correctly
✅ Valid credentials accepted — reached inventory page
✅ Product listings display with images, prices, descriptions
⚠️ Add to cart button has no visible effect
⚠️ Cart badge does not update after clicking Add to Cart
⚠️ Cart remains empty — items not persisting

🔧 Using tool: browser_navigate
🔧 Using tool: browser_type
🔧 Using tool: browser_click

## Scenario 2: Error Path — Invalid Login

✅ Invalid credentials do not grant access
✅ User remains on login page
⚠️ No error message shown to the user
⚠️ Error container exists in DOM but is never populated
⚠️ Locked-out user receives no feedback

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
VERDICT: 🔴 2 critical bugs found
Cart functionality broken. Login error handling missing.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ Done.
06 — Setup

Run It Yourself

Five steps from zero to a running agent.

1

Install Python dependencies

pip install anthropic mcp
2

Install Playwright MCP and browsers

npm install -g @playwright/mcp
npx playwright install chromium
3

Set your Anthropic API key

Never hardcode keys in source files. Set it as an environment variable.

# Windows PowerShell
$env:ANTHROPIC_API_KEY="your-key-here"

# Mac / Linux
export ANTHROPIC_API_KEY="your-key-here"
4

Edit the mission in qa_agent.py

TARGET_URL = "https://your-app.com"

MISSION = """
Test the checkout flow as two personas:
1. A quick shopper who knows what they want
2. An indecisive shopper who adds, removes, changes items
Flag anything broken, confusing, or incomplete.
"""
5

Run the agent

python qa_agent.py
07 — Production Considerations

What I'd Add
in a Real Codebase

This is an intentionally simple POC — the goal is concept clarity, not production infrastructure. In a real engineering context, I'd layer in the following:

CI/CD Integration

GitHub Actions with PR label or comment triggers. Agent runs when a human intends it, not on every commit.

Codification Pipeline

Auto-convert agent-discovered pathways into deterministic Playwright regression scripts for repeatability.

Multi-Persona Missions

Multiple user personas per run to simulate diverse behaviors — power user, first-time user, error-prone user.

Screenshot Audit Trail

Capture screenshots at key steps. Visual evidence for bug reports and stakeholder communication.

Cost Controls

Turn limits and max_tokens caps to prevent runaway agent loops in CI. Predictable API spend per run.

Environment Separation

Agent always runs against staging. Never production. Test data only — nothing sensitive in the context window.