Latency Optimization

Techniques and strategies for building fast, responsive AI applications. Reduce time-to-first-token and improve user experience.

Understanding Latency

~200ms
Time to First Token
~50ms
Inter-token Latency
~80/s
Output Tokens/sec
~1.5s
Avg Request (100 tok)

Optimization Techniques

âš¡ Use Streaming High Impact

Stream responses instead of waiting for completion. Users see output immediately, dramatically improving perceived performance.

response = client.chat.completions.create( model="mythic-4", messages=messages, stream=True # Enable streaming ) for chunk in response: print(chunk.choices[0].delta.content, end="")
📦 Reduce Input Size High Impact

Fewer input tokens = faster time-to-first-token. Be concise in prompts, summarize long documents, and only include relevant context.

🎯 Limit Output Tokens Medium Impact

Set max_tokens to the minimum needed. Less output = faster completion time.

response = client.chat.completions.create( model="mythic-4", messages=messages, max_tokens=150 # Limit output length )
🚀 Choose the Right Model High Impact

Smaller models are faster. Use mythic-4-mini for simple tasks and reserve mythic-4 for complex reasoning.

💾 Context Caching High Impact

Cache repeated context (system prompts, documents) to skip reprocessing. Saves time and reduces costs.

🔄 Parallel Requests Medium Impact

For multi-step workflows, run independent requests in parallel instead of sequentially.

import asyncio async def parallel_requests(): tasks = [ client.chat.completions.create(...), client.chat.completions.create(...), client.chat.completions.create(...) ] results = await asyncio.gather(*tasks) return results

Model Comparison

Model TTFT Tokens/sec Best For
mythic-4-turbo ~150ms ~100/s Speed-critical applications
mythic-4-mini ~120ms ~120/s Simple tasks, high volume
mythic-4 ~200ms ~80/s Balanced performance
mythic-4-reasoning ~500ms ~40/s Complex reasoning

💡 Pro Tip: Speculative Execution

For multi-turn conversations, you can speculatively pre-fetch likely follow-up responses while the user is reading the current response. This creates the illusion of instant replies.

Latency Checklist

Need More Speed?

Talk to our team about dedicated capacity and custom optimizations.

Contact Enterprise →