Day-1

Part-1

This document introduces the Case Handler evaluation using RAGChecker. It is a new framework designed to evaluate claims in RAG applications. Previously, we only used overall metrics like precision and recall to compare answers. Now, we have introduced context-based metrics. These allow us to measure how well AI responses rely on the retrieved context. We also evaluate the quality of the context itself compared to the answers. A key innovation here is the use of AI agents. We use them to systematically investigate results and identify root causes. This helps us discover systemic issues across multiple cases. RAGChecker works by comparing two texts, such as an AI answer and a ground truth answer. It extracts atomic facts called claims, which consist of a subject, a predicate, and an object.

Part-2

The evaluation process consists of four main steps. First, we select high-quality test cases from existing knowledge base entries. We retrieve the first email for each case to serve as the ground truth. It is crucial to check if this ground truth is faithful to the reference material. The quality of the evaluation depends heavily on the quality of these answers. In the second step, we run the Case Handler for the selected top cases and save the traces. Step three involves the full RAGChecker evaluation. We calculate metrics for both the generator and the retriever. Finally, in step four, we use AI agents to analyze the results. We focus especially on anomalies, such as low metric scores, to find the root causes.