Latency Optimization - MythicDot.AI

Understanding Latency

~200ms

Time to First Token

~50ms

Inter-token Latency

~80/s

Output Tokens/sec

~1.5s

Avg Request (100 tok)

Optimization Techniques

⚡ Use Streaming High Impact

Stream responses instead of waiting for completion. Users see output immediately, dramatically improving perceived performance.

response = client.chat.completions.create(
    model="mythic-4",
    messages=messages,
    stream=True  # Enable streaming
)
for chunk in response:
    print(chunk.choices[0].delta.content, end="")
                            

📦 Reduce Input Size High Impact

Fewer input tokens = faster time-to-first-token. Be concise in prompts, summarize long documents, and only include relevant context.

🎯 Limit Output Tokens Medium Impact

Set max_tokens to the minimum needed. Less output = faster completion time.

response = client.chat.completions.create(
    model="mythic-4",
    messages=messages,
    max_tokens=150  # Limit output length
)
                            

🚀 Choose the Right Model High Impact

Smaller models are faster. Use mythic-4-mini for simple tasks and reserve mythic-4 for complex reasoning.

💾 Context Caching High Impact

Cache repeated context (system prompts, documents) to skip reprocessing. Saves time and reduces costs.

🔄 Parallel Requests Medium Impact

For multi-step workflows, run independent requests in parallel instead of sequentially.

import asyncio

async def parallel_requests():
    tasks = [
        client.chat.completions.create(...),
        client.chat.completions.create(...),
        client.chat.completions.create(...)
    ]
    results = await asyncio.gather(*tasks)
    return results
                            

Model Comparison

Model	TTFT	Tokens/sec	Best For
`mythic-4-turbo`	~150ms	~100/s	Speed-critical applications
`mythic-4-mini`	~120ms	~120/s	Simple tasks, high volume
`mythic-4`	~200ms	~80/s	Balanced performance
`mythic-4-reasoning`	~500ms	~40/s	Complex reasoning

💡 Pro Tip: Speculative Execution

For multi-turn conversations, you can speculatively pre-fetch likely follow-up responses while the user is reading the current response. This creates the illusion of instant replies.

Latency Checklist

✓ Enable streaming for all user-facing responses
✓ Trim unnecessary context from prompts
✓ Set appropriate max_tokens limits
✓ Use the smallest model that meets quality needs
✓ Enable context caching for repeated prompts
✓ Parallelize independent API calls
✓ Use edge locations closest to your users
✓ Monitor and log latency metrics

Need More Speed?

Talk to our team about dedicated capacity and custom optimizations.

Contact Enterprise →