🎯 Predicted Outputs

Speed up completions by providing expected output. When you know what most of the response will look like, give the model a head start.

5x
Faster for code edits
80%
Fewer output tokens
60%
Cost reduction

How It Works

When you provide a prediction, the model uses speculative decoding to verify and extend your prediction rather than generating every token from scratch. Matching tokens are nearly free.

Token Generation Comparison
Without
500 tokens generated
With Prediction
400 tokens predicted
100 new
Predicted (fast/cheap)
Generated (normal cost)

Example: Code Refactoring

The most common use case is code edits where most of the file stays the same:

Python
from mythicdot import MythicDot client = MythicDot() # Original code that needs a small change original_code = """ def calculate_total(items): total = 0 for item in items: total += item.price return total """ response = client.chat.completions.create( model="mythic-4", messages=[ { "role": "user", "content": f"Add tax calculation (8%) to this function:\n{original_code}" } ], prediction={ "type": "content", "content": original_code # Predict most of output matches input } ) # Model only generates the diff - 5x faster! print(response.choices[0].message.content)

Best Use Cases

🔧

Code Refactoring

Small edits to large files - variable renames, function modifications, style fixes

📝

Document Editing

Grammar fixes, formatting changes, or small updates to existing text

🔄

Template Filling

Structured outputs where the template is known but values change

🌐

Translation Updates

Updating translations where most content stays the same

💡 Tips for Best Results
  • Predict at least 50% of the expected output for significant speedups
  • Works best when the prediction is close to the actual output
  • Use for editing tasks, not creative generation
  • Combine with streaming for the best user experience
  • Mismatched predictions are handled gracefully - no errors, just normal generation

Speed Up Your Edits

Give the model a head start on predictable outputs.

Latency Guide → Streaming →