Build confidence in your AI applications with systematic testing. Measure, compare, and improve model performance.
Measure correctness against known answers. Best for tasks with definitive right/wrong answers.
Check if responses contain accurate information without hallucinations.
Use AI to judge AI. Scale evaluation without human review.
Get human feedback for nuanced quality assessment.
Set up a basic evaluation pipeline in minutes:
from mythicdot import MythicDot
from mythicdot.evals import Evaluation, ExactMatch
client = MythicDot()
# Define test cases
test_cases = [
{"input": "What is 2 + 2?", "expected": "4"},
{"input": "Capital of France?", "expected": "Paris"},
{"input": "Largest planet?", "expected": "Jupiter"},
]
# Create evaluation
eval_run = Evaluation(
name="basic_accuracy",
model="mythic-4-mini",
test_cases=test_cases,
metrics=[ExactMatch()]
)
# Run evaluation
results = eval_run.run()
# View results
print(f"Accuracy: {results.accuracy:.1%}")
print(f"Pass rate: {results.pass_rate:.1%}")
| Model | Accuracy | Latency (p50) | Cost / 1K | Score |
|---|---|---|---|---|
| mythic-4 | 420ms | $0.015 | 94% | |
| mythic-4-mini | 180ms | $0.002 | 87% | |
| mythic-o1 | 1.2s | $0.060 | 97% |
For open-ended tasks, use a model to evaluate responses:
from mythicdot.evals import LLMJudge
# Define rubric
judge = LLMJudge(
criteria=[
"Accuracy: Is the information correct?",
"Clarity: Is the response easy to understand?",
"Completeness: Does it fully answer the question?"
],
scale=(1, 5),
judge_model="mythic-4"
)
# Evaluate a response
score = judge.evaluate(
question="Explain photosynthesis",
response="Plants convert sunlight into energy..."
)
print(f"Score: {score.average}/5")
When evaluating with LLM-as-judge, use a more capable model (like mythic-4 or mythic-o1) to judge responses from smaller models for more reliable scores.
Build confidence in your AI applications with systematic testing.