Click here to go the App and skip the explanation
Build Summary

RAG Evaluation Suite

This application is a simple FastAPI that connects a managed Retrieval-Augmented Generation (RAG) service RAG chatbot with an OpenAI-powered evaluation judge. The result is a single-page tool for testing chatbot answers live and running a small human-reviewed regression suite against known ground-truth answers.

Application TypeSingle-page RAG evaluator
BackendPython + FastAPI
RAG Layermanaged Retrieval-Augmented Generation (RAG) service
Judge LayerOpenAI GPT-4o

What I Built

The application is a lightweight RAG testing workbench. It lets a user ask questions against a private-document managed Retrieval-Augmented Generation (RAG) service, review the live answer, and then audit the quality of that response using an LLM judge. It also includes a regression-testing tab where predefined questions are compared against expected ground-truth answers.

RAG TestingLLM-as-a-JudgeGolden DatasetFastAPImanaged Retrieval-Augmented Generation (RAG) serviceOpenAI GPT-4o

Why I Built It

The goal was to demonstrate a practical quality-engineering approach for validating a RAG chatbot. Instead of only checking whether the app responds, this project evaluates whether the response is grounded in the private knowledge base and whether it answers the user's question.

Quality engineering goal: move chatbot validation beyond simple “it works” checks and toward measurable faithfulness, answer relevance, regression coverage, and human review.

How the Application Works

1
User asks a questionThe frontend captures the question from the chat box.
2
FastAPI receives itThe backend exposes a /rag/chat endpoint.
3
managed Retrieval-Augmented Generation (RAG) service respondsThe managed Retrieval-Augmented Generation (RAG) service retrieves context and generates an answer.
4
OpenAI audits itGPT-4o scores faithfulness and relevancy on demand.
5
Human reviews resultsThe regression suite supports manual Pass/Fail verdicts.

User Interface

The page is organized into two clear sections: a live chat experience and a QA regression suite.

Tab 1 — Live Chat Box
User: Which regulations had to be followed by the AI Chatbot?
RAG Response: The chatbot response appears here after managed Retrieval-Augmented Generation (RAG) service returns an answer.
Run LLM Judge Audit
Faithfulness: 5/5Relevancy: 4/5
Tab 2 — QA Regression Suite
Question: What production problem did the triage system address?
Ground Truth
Expected answer from the private documents.
Live Answer
Brand-new answer generated by managed Retrieval-Augmented Generation (RAG) service.
✓ Pass✗ Fail
Run Automated Regression Test

Evaluation Layers Implemented

Layer 1: LLM-as-a-Judge

After the chatbot produces an answer, the user can run an audit using GPT-4o. The judge scores the answer on:

  • Faithfulness: whether the answer appears grounded in the retrieved knowledge.
  • Answer Relevancy: whether the answer directly addresses the question.

The scores are displayed as raw 1–5 values with a brief explanation.

Layer 2: Human Golden Dataset

The regression suite uses a hardcoded golden dataset of questions and expected answers. The application sends each question to the live RAG system, displays the new answer next to the ground truth, and lets the reviewer manually mark Pass or Fail.

This keeps final quality judgment in human hands while still automating the repetitive part of test execution.

Technology Stack

LayerTechnologyPurpose
BackendPython 3.11 + FastAPIServes the web page and exposes the RAG and judge endpoints.
RAGmanaged Retrieval-Augmented Generation (RAG) service SDKConnects to a pre-existing managed Retrieval-Augmented Generation (RAG) service and returns grounded answers.
LLM JudgeOpenAI GPT-4oScores chatbot answers for faithfulness and relevancy.
FrontendHTML, CSS, JavaScriptProvides the single-page chat and regression-test interface.
ServerUvicornRuns the FastAPI application.
HostingVercelProvides the development environment, secrets, and deployment surface.

Core Backend Integration

The managed Retrieval-Augmented Generation (RAG) service integration followed the baseline execution pattern requested in the application design concept:

from managed Retrieval-Augmented Generation (RAG) service import managed Retrieval-Augmented Generation (RAG) service
from managed Retrieval-Augmented Generation (RAG) service_plugins.assistant.models.chat import Message pc = managed Retrieval-Augmented Generation (RAG) service(api_key=os.environ.get("managed Retrieval-Augmented Generation (RAG) service_API_KEY"))
assistant = pc.assistant.Assistant( assistant_name=os.environ.get("managed Retrieval-Augmented Generation (RAG) service_ASSISTANT_NAME")
) msg = Message(role="user", content=user_query)
res = assistant.chat(messages=[msg])
answer = res.message.content

API Endpoints

MethodPathWhat It Does
GET/rag/Serves the single-page web application.
POST/rag/chatSends the user question to managed Retrieval-Augmented Generation (RAG) service and returns the RAG answer.
POST/rag/judgeSends the question and answer to GPT-4o and returns judge scores.

Environment Variables

The application uses secrets for API keys and assistant configuration:

VariablePurpose
(RAG) service_API_KEYAuthenticates with managed Retrieval-Augmented Generation (RAG) service.
(RAG) service_ASSISTANT_NAMEIdentifies the pre-existing managed Retrieval-Augmented Generation (RAG) service.
OPENAI_API_KEYAuthenticates with OpenAI for the GPT-4o judge.

What This Demonstrates

This project shows how one can think about AI validation as a layered system instead of a one-time manual test. The application combines live exploratory testing, AI-assisted response scoring, and a repeatable regression workflow based on a golden dataset.

Possible Next Improvements

Application Design Concept

The application was designed as a lightweight proof-of-concept for evaluating Retrieval-Augmented Generation (RAG) chatbot quality using multiple validation layers.

The design combines:

The goal was to demonstrate how modern AI systems can be evaluated using both automated and human-centered validation approaches while keeping the implementation intentionally simple and easy to understand.

Click here to go the App