RAG Evaluation Metrics Explained: Faithfulness, Context Relevance & Answer Relevance

Part of the AI Testing series, our earlier blog Engineering Trust: The Mandate for Testing Agentic AI & RAG, introduced the RAG triad as a diagnostic frame. This article goes one level deeper – how the three metrics are actually computed, and what they catch in a real world application using financial-services setting.

Why this matters

RAG evaluation metrics are becoming critical for teams deploying Retrieval-Augmented Generation systems in regulated industries like financial services. In Financial Services, when a model summarises a 10-K, answers a question about a fund factsheet, or explains a credit memo, “mostly right” is not a quality bar – it is a regulatory and reputational exposure. The risk is well documented. On the FinanceBench benchmark of 10,231 questions over public company filings, the authors found that GPT-4-Turbo paired with a retrieval system “incorrectly answered or refused to answer 81% of questions,” with hallucinations cited as a primary failure mode that “limit their suitability for use by enterprises” [1].

That is the gap the RAG triad is designed to measure. The triad does not ask whether the answer sounds right; it isolates which part of the pipeline failed.

What the triad actually measures

The triad comes from RAGAS (Retrieval Augmented Generation Assessment), introduced by Es, James, Espinosa-Anke and Schockaert at EACL 2024 [2]. The framework’s value is that it scores each dimension separately, without requiring a hand-labelled ground-truth answer for every question — what the authors call “reference-free” evaluation [2].

Three metrics carry most of the weight:

1. Context Relevance – did retrieval bring back the right evidence?

The retriever’s job is to pull the chunks needed to answer the question and nothing else. Context relevance penalises both misses and noise.

How it is computed. RAGAS prompts a judge LLM to extract only the sentences in the retrieved context that are necessary to answer the question, and computes the ratio of relevant sentences to total retrieved sentences [2]. A higher ratio means a cleaner, more focused retrieval.

Example. A wealth advisor asks: “What was Nike’s total revenue for fiscal year 2023?” The retriever returns three chunks: one from the FY2023 income statement, one from the FY2022 comparative column, and one from the risk-factors section. Only the first chunk is necessary. Context relevance is roughly 1 in 3 – and that explains the downstream failure before you ever look at the answer.

The RAGAS authors flagged context relevance as the hardest of the three to score, with the weakest agreement to human raters [2] – useful to know when calibrating thresholds.

2. Faithfulness – is every claim in the answer supported by the retrieved evidence?

This is the hallucination metric. It does not check whether the answer is true in the world; it checks whether the answer is supported by what the system actually retrieved.

How it is computed. The generated answer is decomposed into atomic claims. Each claim is then checked – by a judge LLM, often with a natural-language-inference style prompt – against the retrieved context. Faithfulness is the proportion of claims that can be inferred from the context [2][3].

Example. A query asks: “Summarizes the key drivers of operating margin change in HSBC’s most recent annual report.” The model answers with three drivers: net interest income expansion, a one-off litigation provision, and “increased exposure to APAC SME lending.” The first two appear verbatim in the retrieved MD&A section. The third is plausible – HSBC does have APAC SME exposure – but it is not in the retrieved chunks. Faithfulness drops to 0.67, and the unfaithful claim is exactly the kind of confident-sounding fabrication that gets through human review.

This is the metric the RAGAS authors found tracks closest to human judgement [2], which is why it tends to be the most trusted of the three in production.

3. Answer Relevance – does the answer address what the user actually asked?

A faithful answer can still be the wrong answer if it addresses a different question.

How it is computed. RAGAS uses an LLM to generate N candidate questions that the given answer could plausibly be a response to, then measures the average cosine similarity between those generated questions and the original user query in an embedding space [2]. High similarity means the answer is on-topic; low similarity means it has drifted.

Example. User asks: “Is the dividend covered by free cash flow in JPMorgan’s 2024 annual report?” The model produces a detailed, accurate, well-grounded paragraph about JPMorgan’s capital adequacy ratios. Faithfulness is 1.0. Context relevance is high. But answer relevance is low – the user wanted a dividend-coverage analysis, not a CET1 ratio explainer. Without this metric, the failure looks like a success.

The diagnostic pattern across triad surfaces

The three metrics work together because each failure mode looks different across them:

Failure pattern	Context Rel.	Faithfulness	Answer Rel.
Retriever pulled the wrong filings	Low	Often low	Variable
Right evidence, model hallucinated anyway	High	Low	High
Right evidence, model answered a different question	High	High	Low
Fluent confabulation across the board	Low	Low	High

The last row is the most dangerous one – the answer reads well, addresses the question, and is entirely invented. A single metric cannot catch it; the triad can. At HIVE AI Testing Services, one thing we keep noticing in production RAG systems is that failures rarely come from a single layer. Retrieval might look fine, grounding looks mostly correct, yet the final answer still drifts just enough to become risky — especially in financial workflows where confident responses can slip through review.

The RAGAS paper reports human-agreement scores of 95% for faithfulness, 78% for answer relevance, and 70% for context relevance on the WikiEval benchmark [2]. Three implications for teams:

Treat the scores as signals, not verdicts. Especially context relevance, where the judge LLM disagrees with human raters roughly one time in three.
Run each evaluation multiple times. RAG outputs are non-deterministic, and so is the judge. Report a mean and a confidence interval, not a single number.
Always pair RAGAS with a domain-specific golden dataset. RAGAS gives you breadth across the triad; a curated set of high-stakes Q&A pairs (dividend coverage, segment revenue, covenant breach disclosure) gives you depth where the cost of being wrong is highest.

The FinanceBench authors made the same point: even when an LLM “appears to be giving reasonable responses, there remains a risk that its answers are hallucinations, out-of-date, logically incorrect, or given with the wrong confidence” [1]. The triad is how you stop accepting “reasonable” as a quality bar.

Key takeaways

The RAG triad isolates failure to a layer. Context relevance points at the retriever, faithfulness at the generator’s grounding, answer relevance at intent matching. A single accuracy score cannot do this.
Faithfulness is the most reliable of the three — 95% human agreement in the original RAGAS evaluation [2] — and it is the right metric to wire into a CI gate first.
Context relevance is the noisiest signal (~70% human agreement [2]); use it as a directional indicator, not an acceptance criterion.
Answer relevance catches the “right answer, wrong question” failure that faithfulness alone will miss — particularly common in finance, where adjacent topics (capital ratios vs. dividend coverage, gross vs. operating margin) trip models up.
In regulated domains, RAGAS is a floor, not a ceiling. Pair reference-free scores with a curated financial golden dataset and statistical confidence intervals across multiple runs.

References

[1] Islam, P., Kannappan, A., Kiela, D., Qian, R., Scherrer, N., & Barrett, B. (2023). FinanceBench: A New Benchmark for Financial Question Answering. arXiv:2311.11944. https://arxiv.org/abs/2311.11944

[2] Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). RAGAs: Automated Evaluation of Retrieval Augmented Generation. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 150–158. https://aclanthology.org/2024.eacl-demo.16/ (arXiv:2309.15217)

[3] Ragas documentation. Faithfulness metric. https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/

The RAG Triad in Practice: Faithfulness, Context Relevance & Answer Relevance with RAGAS

Why this matters

What the triad actually measures

1. Context Relevance – did retrieval bring back the right evidence?

2. Faithfulness – is every claim in the answer supported by the retrieved evidence?

3. Answer Relevance – does the answer address what the user actually asked?

The diagnostic pattern across triad surfaces

Key takeaways

References

Author

Tags

Further Reading

Why AI Test Generation Fails to Scale: Solving the 40% Accuracy Plateau

Engineering Trust: The Mandate for Testing Agentic AI & RAG

AI in Software Testing: Why MCP is the Missing Layer

Why this matters

What the triad actually measures

1. Context Relevance – did retrieval bring back the right evidence?

2. Faithfulness – is every claim in the answer supported by the retrieved evidence?

3. Answer Relevance – does the answer address what the user actually asked?

The diagnostic pattern across triad surfaces

Key takeaways

References

Author

Tags

Further Reading

Why AI Test Generation Fails to Scale: Solving the 40% Accuracy Plateau

Engineering Trust: The Mandate for Testing Agentic AI & RAG

AI in Software Testing: Why MCP is the Missing Layer

Discovery Workshop

Let's talk

Discovery
Workshop