RAG Evaluation Suite
This application is a simple FastAPI that connects a managed Retrieval-Augmented Generation (RAG) service RAG chatbot with an OpenAI-powered evaluation judge. The result is a single-page tool for testing chatbot answers live and running a small human-reviewed regression suite against known ground-truth answers.
What I Built
The application is a lightweight RAG testing workbench. It lets a user ask questions against a private-document managed Retrieval-Augmented Generation (RAG) service, review the live answer, and then audit the quality of that response using an LLM judge. It also includes a regression-testing tab where predefined questions are compared against expected ground-truth answers.
Why I Built It
The goal was to demonstrate a practical quality-engineering approach for validating a RAG chatbot. Instead of only checking whether the app responds, this project evaluates whether the response is grounded in the private knowledge base and whether it answers the user's question.
How the Application Works
/rag/chat endpoint.User Interface
The page is organized into two clear sections: a live chat experience and a QA regression suite.
Expected answer from the private documents.
Brand-new answer generated by managed Retrieval-Augmented Generation (RAG) service.
Evaluation Layers Implemented
Layer 1: LLM-as-a-Judge
After the chatbot produces an answer, the user can run an audit using GPT-4o. The judge scores the answer on:
- Faithfulness: whether the answer appears grounded in the retrieved knowledge.
- Answer Relevancy: whether the answer directly addresses the question.
The scores are displayed as raw 1–5 values with a brief explanation.
Layer 2: Human Golden Dataset
The regression suite uses a hardcoded golden dataset of questions and expected answers. The application sends each question to the live RAG system, displays the new answer next to the ground truth, and lets the reviewer manually mark Pass or Fail.
This keeps final quality judgment in human hands while still automating the repetitive part of test execution.
Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Backend | Python 3.11 + FastAPI | Serves the web page and exposes the RAG and judge endpoints. |
| RAG | managed Retrieval-Augmented Generation (RAG) service SDK | Connects to a pre-existing managed Retrieval-Augmented Generation (RAG) service and returns grounded answers. |
| LLM Judge | OpenAI GPT-4o | Scores chatbot answers for faithfulness and relevancy. |
| Frontend | HTML, CSS, JavaScript | Provides the single-page chat and regression-test interface. |
| Server | Uvicorn | Runs the FastAPI application. |
| Hosting | Vercel | Provides the development environment, secrets, and deployment surface. |
Core Backend Integration
The managed Retrieval-Augmented Generation (RAG) service integration followed the baseline execution pattern requested in the application design concept:
from managed Retrieval-Augmented Generation (RAG) service import managed Retrieval-Augmented Generation (RAG) service
from managed Retrieval-Augmented Generation (RAG) service_plugins.assistant.models.chat import Message pc = managed Retrieval-Augmented Generation (RAG) service(api_key=os.environ.get("managed Retrieval-Augmented Generation (RAG) service_API_KEY"))
assistant = pc.assistant.Assistant( assistant_name=os.environ.get("managed Retrieval-Augmented Generation (RAG) service_ASSISTANT_NAME")
) msg = Message(role="user", content=user_query)
res = assistant.chat(messages=[msg])
answer = res.message.contentAPI Endpoints
| Method | Path | What It Does |
|---|---|---|
| GET | /rag/ | Serves the single-page web application. |
| POST | /rag/chat | Sends the user question to managed Retrieval-Augmented Generation (RAG) service and returns the RAG answer. |
| POST | /rag/judge | Sends the question and answer to GPT-4o and returns judge scores. |
Environment Variables
The application uses secrets for API keys and assistant configuration:
| Variable | Purpose |
|---|---|
(RAG) service_API_KEY | Authenticates with managed Retrieval-Augmented Generation (RAG) service. |
(RAG) service_ASSISTANT_NAME | Identifies the pre-existing managed Retrieval-Augmented Generation (RAG) service. |
OPENAI_API_KEY | Authenticates with OpenAI for the GPT-4o judge. |
What This Demonstrates
This project shows how one can think about AI validation as a layered system instead of a one-time manual test. The application combines live exploratory testing, AI-assisted response scoring, and a repeatable regression workflow based on a golden dataset.
- Product thinking: gives users a simple way to interact with the chatbot and evaluate answer quality.
- QE strategy: separates live testing from regression testing and human review.
- AI governance: checks faithfulness and relevancy instead of trusting chatbot output blindly.
- Engineering practicality: uses a simple FastAPI and vanilla JavaScript implementation that can be extended later.
Possible Next Improvements
- Add persistent test run history so Pass/Fail results can be tracked over time.
- Export regression results to CSV or JSON.
- Add automated pass/fail thresholds for judge scores while keeping human review available.
- Capture retrieved source chunks to make faithfulness review easier.
- Expand the golden dataset as the private knowledge base grows.
- Add trend dashboards for faithfulness, relevancy, failure categories, and repeated hallucination patterns.