Evaluations - MythicDot.AI

Why Evaluate?

40%

of teams ship without evals

3x

faster iteration with evals

85%

fewer regressions caught

2hrs

to set up a pipeline

Evaluation Pipeline

📝

Test Cases

Define inputs & expected outputs

🚀

Run

Execute against model

📊

Score

Apply metrics

📈

Analyze

Review & iterate

Evaluation Types

🎯 Accuracy Evals

Measure correctness against known answers. Best for tasks with definitive right/wrong answers.

Exact Match F1 Score BLEU

📐 Factuality Evals

Check if responses contain accurate information without hallucinations.

Grounded Citation Check Fact Score

🤖 Model-Graded Evals

Use AI to judge AI. Scale evaluation without human review.

LLM-as-Judge Rubric Scoring Pairwise

👥 Human Evals

Get human feedback for nuanced quality assessment.

Preference Rating Scale A/B Test

Quick Start

Set up a basic evaluation pipeline in minutes:

                    Python
                

from mythicdot import MythicDot
from mythicdot.evals import Evaluation, ExactMatch

client = MythicDot()

# Define test cases
test_cases = [
    {"input": "What is 2 + 2?", "expected": "4"},
    {"input": "Capital of France?", "expected": "Paris"},
    {"input": "Largest planet?", "expected": "Jupiter"},
]

# Create evaluation
eval_run = Evaluation(
    name="basic_accuracy",
    model="mythic-4-mini",
    test_cases=test_cases,
    metrics=[ExactMatch()]
)

# Run evaluation
results = eval_run.run()

# View results
print(f"Accuracy: {results.accuracy:.1%}")
print(f"Pass rate: {results.pass_rate:.1%}")
                

Sample Results

Model	Latency (p50)	Cost / 1K	Score
mythic-4	420ms	$0.015	94%
mythic-4-mini	180ms	$0.002	87%
mythic-o1	1.2s	$0.060	97%

LLM-as-Judge

For open-ended tasks, use a model to evaluate responses:

                    Python
                

from mythicdot.evals import LLMJudge

# Define rubric
judge = LLMJudge(
    criteria=[
        "Accuracy: Is the information correct?",
        "Clarity: Is the response easy to understand?",
        "Completeness: Does it fully answer the question?"
    ],
    scale=(1, 5),
    judge_model="mythic-4"
)

# Evaluate a response
score = judge.evaluate(
    question="Explain photosynthesis",
    response="Plants convert sunlight into energy..."
)

print(f"Score: {score.average}/5")
                

💡 Best Practice: Use a Stronger Judge

When evaluating with LLM-as-judge, use a more capable model (like mythic-4 or mythic-o1) to judge responses from smaller models for more reliable scores.

Start Evaluating

Build confidence in your AI applications with systematic testing.

Read the Docs View Examples