Metrics to Evaluate Hybrid-AI LLM Pipelines
The aim here is to explore the various evaluation metrics that can be applied to an LLM pipeline and provide actionable insights to refine the pipeline further as intended.
Metrics for the Interactive Graphene Documentation Pipeline
Metrics
1. overall_avg_star_rating
- Computes the mean rating received for READMEs generated across various Graphene Tutorials.
2. overall_avg_feedback_sentiment_score
- Determines the average sentiment polarity score derived from feedback gathered for READMEs generated across various Graphene Tutorials.
Sentiment Analysis Process
To determine the sentiment scores for the user feedback, we use the 'TextBlob' Python library. For the processed text, a sentiment polarity score will then be generated,
- -1: Negative sentiment score
- 0: Neutral sentiment score
- 1: Positive sentiment score
Sample Output
Here's a sample example of the output:
Feedback Text | Corresponding Sentiment Score |
---|---|
Good explanation | 0.7 |
The generated docker commands were not accurate. | -0.2 |
Very good! | 1.0 |
LLM can generate a better response! | 0.5 |
Good generation | 0.7 |
For further insights into the pipeline and codebase regarding the calculation of these metrics, please refer to the corresponding ticket. This will provide more detailed information about calculating metrics within the codebase and the pipeline processes involved. #23 graphene_llm_readme_gen
Metrics for the Grounding LLM pipeline
Metrics for RAG
1. Faithfulness
Faithfulness is a RAG metric that evaluates whether the LLM/generator in your RAG pipeline is generating LLM outputs that factually aligns with the information presented in the retrieval context.
2. Answer Relevancy
Answer relevancy is a RAG metric that assesses whether your RAG generator outputs concise answers, and can be calculated by determining the proportion of sentences in an LLM output that a relevant to the input (ie. divide the number relevant sentences by the total number of sentences).There are few more contextual relevancy, contextual precision and contextual recall.
Libraries: Deepeval, Ragas