Day-2

Part-1

We identified the top five cases that had high faithfulness scores, ranging from eighty to one hundred percent. The top-ranked case was a perfect match related to an SQL migration issue. The second case was near-perfect, while the others were rated as excellent. When we ran the full evaluation, we looked at metrics like precision, recall, and the F1 score. For example, one case showed a high recall of eighty-two percent, but its precision was much lower. Another case matched the knowledge base perfectly. However, we also encountered a failed case. The final output was an engineer's note instead of a draft email, so we could not compare it to the ground truth.

Part-2

Our key findings reveal that recall is generally high, often exceeding sixty percent. This indicates that the AI covers most of the claims found in the ground truth. However, precision is relatively low, falling between thirty-three and seventy-five percent. This suggests that the AI often adds extra or unrelated information, likely from incorrect context. One specific anomaly stood out. It showed one hundred percent recall despite having a zero percent match with the knowledge base. We used Claude Code AI agents to investigate this further. We analyzed the output files and examined the claim-level judgments to understand exactly what happened in the retrieval process.