LLM Evaluator
Available Frameworks
-
DeepEval — Open-source, pytest-style, with 20+ built-in metrics (hallucination, answer relevancy, faithfulness, toxicity, etc.). Comes with a UI dashboard.
-
RAGAS — Focused on RAG pipelines specifically. Metrics like context precision, recall, and faithfulness.
-
LangSmith — LangChain's evaluation + tracing platform. Tied to their ecosystem and partially paid.
-
EleutherAI LM Evaluation Harness — The industry standard for benchmarking open-source models (used by HuggingFace leaderboards). Hundreds of built-in tasks (MMLU, HellaSwag, GSM8K, etc.).
-
BIG-bench — Google's massive benchmark suite. Overkill for most use cases, but comprehensive.
-
OpenAI Evals — OpenAI's own framework. Model-agnostic, despite the name, good for custom eval sets.
-
Promptfoo — YAML-config-based, very easy to set up, supports side-by-side model comparisons with a nice web UI. Great for quick iteration.
-
Phoenix (Arize) — Focused on observability + evals together. Good for production monitoring.
- DeepEval - https://github.com/confident-ai/deepeval
- Promptfoo - https://github.com/promptfoo/promptfoo
G-Eval is a framework that uses LLMs (as judges) to evaluate text quality based on custom criteria you define in natural language. No ground truth or rigid metric formulas needed.
How does it work?
- You provide a criterion description (e.g., "How coherent is this response?")
- The LLM judge generates a chain-of-thought evaluation plan
- It scores the output (typically 1-5 or 0-1) based on that reasoning
