Skip to content

Metrics to Evaluate Hybrid-AI LLM Pipelines

The aim here is to explore the various evaluation metrics that can be applied to an LLM pipeline and provide actionable insights to refine the pipeline further as intended.

Metrics for the Interactive Graphene Documentation Pipeline

Metrics

1. overall_avg_star_rating

  • Computes the mean rating received for READMEs generated across various Graphene Tutorials.

2. overall_avg_feedback_sentiment_score

  • Determines the average sentiment polarity score derived from feedback gathered for READMEs generated across various Graphene Tutorials.

Sentiment Analysis Process

To determine the sentiment scores for the user feedback, we use the 'TextBlob' Python library. For the processed text, a sentiment polarity score will then be generated,

  • -1: Negative sentiment score
  • 0: Neutral sentiment score
  • 1: Positive sentiment score

Sample Output

Here's a sample example of the output:

Feedback Text Corresponding Sentiment Score
Good explanation 0.7
The generated docker commands were not accurate. -0.2
Very good! 1.0
LLM can generate a better response! 0.5
Good generation 0.7

image

For further insights into the pipeline and codebase regarding the calculation of these metrics, please refer to the corresponding ticket. This will provide more detailed information about calculating metrics within the codebase and the pipeline processes involved. #23 (closed) graphene_llm_readme_gen

Metrics for the Grounding LLM pipeline

Metrics for RAG

1. Faithfulness

Faithfulness is a RAG metric that evaluates whether the LLM/generator in your RAG pipeline is generating LLM outputs that factually aligns with the information presented in the retrieval context.

2. Answer Relevancy

Answer relevancy is a RAG metric that assesses whether your RAG generator outputs concise answers, and can be calculated by determining the proportion of sentences in an LLM output that a relevant to the input (ie. divide the number relevant sentences by the total number of sentences).There are few more contextual relevancy, contextual precision and contextual recall.

Libraries: Deepeval, Ragas

References

Edited by Sangamithra Panneer Selvam