- Context Precision: Of all the retrieved chunks , how many are relevant to the user query ?
- Formula - Number of relevant chunks retrieived / Total Retrieved chunks
- Example - If retrieved chunks are 5 and relevant chunks in context of user query are only 3 then context precision is 60%
- Context Recall : Of all the relevant chunks in your database, how did you retrieve ?
- Formula - Number of relevant chunks retrieved / Total releveant chunks in database
- Example - Chunks in database are 8 and retrieived are 5 chunks as previous example (out of which 3 were relevant) then context recall is 3/8 -> 37.5%
What does this tells you ?
- You missed 62.5% of the relevant information
- Answer might be incomplete
- Might need to retrieve more chunks (increase top-k)
- Context F1 Score : Harmonic mean of precision and recall
- Formula : 2 x (Context Precision x Recall) / (Precision + Recall)
Why F1 Matters ? - High precision , low recall : example Precision of 0.9 , recall of 0.3 which translates to F1 of 0.45. Meanining, you get very relevant chunks but miss a lot. On the other side for Low precision and high recall, Precision of 0.3 and recall of 0.9 translates to F1 of 0.45. Meaning, you get most relevant chunks but lots of noise.
A balanced approach is to have for example a precision of 0.8 and recall of 0.7 which translates to F1 of 0.75. Meaning good balance of relevance and coverage
- Answer Relevance : Does the generated answer actually addressed the question?
- Method - use another LLM as judge.
- Question : {Original_Question}
- Answer : {Generated_Answer}
- On a scale of 1-5 how well does this answer address the question? 1 is completely off topic and 5 is fully and directly answered.
- Judge score of 5/5 is perfect scenario.
- Faithfullness : Is the answer supported by the retrieved context ? No Hallucinations ?
- Formula : (Number of claims supported by context)/total claims in the answer
- Example : Retrieved context : Our API rate limit is 1000 requests per hour for free tier users
- GENERATED ANSWERS :
- The API has a rate limit of 1000 requests per hour for free tier users - Supported claim
- Premium users get 10000 requests per hour - Not in context : hallucination
- Contact Support to increase limit - Not in context
Failthfullness = 1/3 which is 33.3% this is a poor score --> significant hallucinations
Use LLMto extract claims and verify


