Click here to go the App and skip the explanation

Build Summary

RAG Evaluation Suite

This application is a simple FastAPI that connects a managed Retrieval-Augmented Generation (RAG) service RAG chatbot with an OpenAI-powered evaluation judge. The result is a single-page tool for testing chatbot answers live and running a small human-reviewed regression suite against known ground-truth answers.

Application TypeSingle-page RAG evaluator

BackendPython + FastAPI

RAG Layermanaged Retrieval-Augmented Generation (RAG) service

Judge LayerOpenAI GPT-4o

What I Built

The application is a lightweight RAG testing workbench. It lets a user ask questions against a private-document managed Retrieval-Augmented Generation (RAG) service, review the live answer, and then audit the quality of that response using an LLM judge. It also includes a regression-testing tab where predefined questions are compared against expected ground-truth answers.

RAG TestingLLM-as-a-JudgeGolden DatasetFastAPImanaged Retrieval-Augmented Generation (RAG) serviceOpenAI GPT-4o

Why I Built It

The goal was to demonstrate a practical quality-engineering approach for validating a RAG chatbot. Instead of only checking whether the app responds, this project evaluates whether the response is grounded in the private knowledge base and whether it answers the user's question.

Quality engineering goal: move chatbot validation beyond simple “it works” checks and toward measurable faithfulness, answer relevance, regression coverage, and human review.

How the Application Works

User asks a questionThe frontend captures the question from the chat box.

FastAPI receives itThe backend exposes a /rag/chat endpoint.

managed Retrieval-Augmented Generation (RAG) service respondsThe managed Retrieval-Augmented Generation (RAG) service retrieves context and generates an answer.

OpenAI audits itGPT-4o scores faithfulness and relevancy on demand.

Human reviews resultsThe regression suite supports manual Pass/Fail verdicts.

User Interface

The page is organized into two clear sections: a live chat experience and a QA regression suite.

Tab 1 — Live Chat Box

User: Which regulations had to be followed by the AI Chatbot?

RAG Response: The chatbot response appears here after managed Retrieval-Augmented Generation (RAG) service returns an answer.

Run LLM Judge Audit

Faithfulness: 5/5Relevancy: 4/5

Tab 2 — QA Regression Suite

Question: What production problem did the triage system address?

Ground Truth
Expected answer from the private documents.

Live Answer
Brand-new answer generated by managed Retrieval-Augmented Generation (RAG) service.

✓ Pass✗ Fail

Run Automated Regression Test

Evaluation Layers Implemented

Layer 1: LLM-as-a-Judge

After the chatbot produces an answer, the user can run an audit using GPT-4o. The judge scores the answer on:

Faithfulness: whether the answer appears grounded in the retrieved knowledge.
Answer Relevancy: whether the answer directly addresses the question.

The scores are displayed as raw 1–5 values with a brief explanation.

Layer 2: Human Golden Dataset

The regression suite uses a hardcoded golden dataset of questions and expected answers. The application sends each question to the live RAG system, displays the new answer next to the ground truth, and lets the reviewer manually mark Pass or Fail.

This keeps final quality judgment in human hands while still automating the repetitive part of test execution.

Technology Stack

Layer	Technology	Purpose
Backend	Python 3.11 + FastAPI	Serves the web page and exposes the RAG and judge endpoints.
RAG	managed Retrieval-Augmented Generation (RAG) service SDK	Connects to a pre-existing managed Retrieval-Augmented Generation (RAG) service and returns grounded answers.
LLM Judge	OpenAI GPT-4o	Scores chatbot answers for faithfulness and relevancy.
Frontend	HTML, CSS, JavaScript	Provides the single-page chat and regression-test interface.
Server	Uvicorn	Runs the FastAPI application.
Hosting	Vercel	Provides the development environment, secrets, and deployment surface.

Core Backend Integration

The managed Retrieval-Augmented Generation (RAG) service integration followed the baseline execution pattern requested in the application design concept:

from managed Retrieval-Augmented Generation (RAG) service import managed Retrieval-Augmented Generation (RAG) service
from managed Retrieval-Augmented Generation (RAG) service_plugins.assistant.models.chat import Message pc = managed Retrieval-Augmented Generation (RAG) service(api_key=os.environ.get("managed Retrieval-Augmented Generation (RAG) service_API_KEY"))
assistant = pc.assistant.Assistant( assistant_name=os.environ.get("managed Retrieval-Augmented Generation (RAG) service_ASSISTANT_NAME")
) msg = Message(role="user", content=user_query)
res = assistant.chat(messages=[msg])
answer = res.message.content

API Endpoints

Method	Path	What It Does
GET	`/rag/`	Serves the single-page web application.
POST	`/rag/chat`	Sends the user question to managed Retrieval-Augmented Generation (RAG) service and returns the RAG answer.
POST	`/rag/judge`	Sends the question and answer to GPT-4o and returns judge scores.

Environment Variables

The application uses secrets for API keys and assistant configuration:

Variable	Purpose
`(RAG) service_API_KEY`	Authenticates with managed Retrieval-Augmented Generation (RAG) service.
`(RAG) service_ASSISTANT_NAME`	Identifies the pre-existing managed Retrieval-Augmented Generation (RAG) service.
`OPENAI_API_KEY`	Authenticates with OpenAI for the GPT-4o judge.

What This Demonstrates

This project shows how one can think about AI validation as a layered system instead of a one-time manual test. The application combines live exploratory testing, AI-assisted response scoring, and a repeatable regression workflow based on a golden dataset.

Product thinking: gives users a simple way to interact with the chatbot and evaluate answer quality.
QE strategy: separates live testing from regression testing and human review.
AI governance: checks faithfulness and relevancy instead of trusting chatbot output blindly.
Engineering practicality: uses a simple FastAPI and vanilla JavaScript implementation that can be extended later.

Possible Next Improvements

Add persistent test run history so Pass/Fail results can be tracked over time.
Export regression results to CSV or JSON.
Add automated pass/fail thresholds for judge scores while keeping human review available.
Capture retrieved source chunks to make faithfulness review easier.
Expand the golden dataset as the private knowledge base grows.
Add trend dashboards for faithfulness, relevancy, failure categories, and repeated hallucination patterns.

What I Built

Why I Built It

How the Application Works

User Interface

Evaluation Layers Implemented

Layer 1: LLM-as-a-Judge

Layer 2: Human Golden Dataset

Technology Stack

Core Backend Integration

API Endpoints

Environment Variables

What This Demonstrates

Possible Next Improvements

Application Design Concept