What metrics should product manager or engineer should focus on, for their RAG performance

Context Precision: Of all the retrieved chunks , how many are relevant to the user query ?

Formula - Number of relevant chunks retrieived / Total Retrieved chunks
Example - If retrieved chunks are 5 and relevant chunks in context of user query are only 3 then context precision is 60%

Context Recall : Of all the relevant chunks in your database, how did you retrieve ?

Formula - Number of relevant chunks retrieved / Total releveant chunks in database
Example - Chunks in database are 8 and retrieived are 5 chunks as previous example (out of which 3 were relevant) then context recall is 3/8 -> 37.5%

What does this tells you ?

You missed 62.5% of the relevant information
Answer might be incomplete
Might need to retrieve more chunks (increase top-k)

Context F1 Score : Harmonic mean of precision and recall

Formula : 2 x (Context Precision x Recall) / (Precision + Recall)

Why F1 Matters ? - High precision , low recall : example Precision of 0.9 , recall of 0.3 which translates to F1 of 0.45. Meanining, you get very relevant chunks but miss a lot. On the other side for Low precision and high recall, Precision of 0.3 and recall of 0.9 translates to F1 of 0.45. Meaning, you get most relevant chunks but lots of noise.

A balanced approach is to have for example a precision of 0.8 and recall of 0.7 which translates to F1 of 0.75. Meaning good balance of relevance and coverage

Answer Relevance : Does the generated answer actually addressed the question?

Method - use another LLM as judge.
Question : {Original_Question}
Answer : {Generated_Answer}
On a scale of 1-5 how well does this answer address the question? 1 is completely off topic and 5 is fully and directly answered.
Judge score of 5/5 is perfect scenario.

Faithfullness : Is the answer supported by the retrieved context ? No Hallucinations ?

Formula : (Number of claims supported by context)/total claims in the answer
Example : Retrieved context : Our API rate limit is 1000 requests per hour for free tier users
GENERATED ANSWERS :

The API has a rate limit of 1000 requests per hour for free tier users - Supported claim
Premium users get 10000 requests per hour - Not in context : hallucination
Contact Support to increase limit - Not in context

Failthfullness = 1/3 which is 33.3% this is a poor score --> significant hallucinations

Use LLMto extract claims and verify

RAG Made Ridiculously Simple: How AI Looks Up Answers

Understand Retrieval Augmented Generation (RAG) through a simple open-book vs closed-book exam analogy. This beginner-friendly guide explains how RAG overcomes AI knowledge cutoffs using retrieval and generation, introduces TF-IDF relevance scoring mathematics, and shows how modern AI systems access external knowledge to deliver accurate, up-to-date answers.

February 24, 20263 min100

How does a product manager go about taking RAG architecture decision

Learn how to design a high-accuracy RAG architecture for clinical decision support systems. This article explains hybrid search, medical embedding, chunking strategies, safety mechanisms, evaluation frameworks, and key product trade offs between accuracy, latency, cost, and evidence quality when building AI systems for healthcare applications.

February 24, 20263 min100

What metrics should product manager or engineer should focus on, for their RAG performance

Be the first to comment

Leave a comment

RAG Made Ridiculously Simple: How AI Looks Up Answers

How does a product manager go about taking RAG architecture decision

Contextual Retrieval - Anthropic's Approach

Be the first to comment

Leave a comment

RAG Made Ridiculously Simple: How AI Looks Up Answers

How does a product manager go about taking RAG architecture decision

Contextual Retrieval - Anthropic's Approach

Be the first to comment

Leave a comment

More to read

RAG Made Ridiculously Simple: How AI Looks Up Answers

How does a product manager go about taking RAG architecture decision

Contextual Retrieval - Anthropic's Approach

Be the first to comment

Leave a comment

More to read

RAG Made Ridiculously Simple: How AI Looks Up Answers

How does a product manager go about taking RAG architecture decision

Contextual Retrieval - Anthropic's Approach