Evaluations

Build confidence in your AI applications with systematic testing. Measure, compare, and improve model performance.

Why Evaluate?

40%
of teams ship without evals
3x
faster iteration with evals
85%
fewer regressions caught
2hrs
to set up a pipeline

Evaluation Pipeline

📝
Test Cases
Define inputs & expected outputs
🚀
Run
Execute against model
📊
Score
Apply metrics
📈
Analyze
Review & iterate

Evaluation Types

🎯 Accuracy Evals

Measure correctness against known answers. Best for tasks with definitive right/wrong answers.

Exact Match F1 Score BLEU

📐 Factuality Evals

Check if responses contain accurate information without hallucinations.

Grounded Citation Check Fact Score

🤖 Model-Graded Evals

Use AI to judge AI. Scale evaluation without human review.

LLM-as-Judge Rubric Scoring Pairwise

👥 Human Evals

Get human feedback for nuanced quality assessment.

Preference Rating Scale A/B Test

Quick Start

Set up a basic evaluation pipeline in minutes:

Python
from mythicdot import MythicDot from mythicdot.evals import Evaluation, ExactMatch client = MythicDot() # Define test cases test_cases = [ {"input": "What is 2 + 2?", "expected": "4"}, {"input": "Capital of France?", "expected": "Paris"}, {"input": "Largest planet?", "expected": "Jupiter"}, ] # Create evaluation eval_run = Evaluation( name="basic_accuracy", model="mythic-4-mini", test_cases=test_cases, metrics=[ExactMatch()] ) # Run evaluation results = eval_run.run() # View results print(f"Accuracy: {results.accuracy:.1%}") print(f"Pass rate: {results.pass_rate:.1%}")

Sample Results

Model Accuracy Latency (p50) Cost / 1K Score
mythic-4
420ms $0.015 94%
mythic-4-mini
180ms $0.002 87%
mythic-o1
1.2s $0.060 97%

LLM-as-Judge

For open-ended tasks, use a model to evaluate responses:

Python
from mythicdot.evals import LLMJudge # Define rubric judge = LLMJudge( criteria=[ "Accuracy: Is the information correct?", "Clarity: Is the response easy to understand?", "Completeness: Does it fully answer the question?" ], scale=(1, 5), judge_model="mythic-4" ) # Evaluate a response score = judge.evaluate( question="Explain photosynthesis", response="Plants convert sunlight into energy..." ) print(f"Score: {score.average}/5")

💡 Best Practice: Use a Stronger Judge

When evaluating with LLM-as-judge, use a more capable model (like mythic-4 or mythic-o1) to judge responses from smaller models for more reliable scores.

Start Evaluating

Build confidence in your AI applications with systematic testing.

Read the Docs View Examples