Techniques and strategies for building fast, responsive AI applications. Reduce time-to-first-token and improve user experience.
Stream responses instead of waiting for completion. Users see output immediately, dramatically improving perceived performance.
response = client.chat.completions.create(
model="mythic-4",
messages=messages,
stream=True # Enable streaming
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
Fewer input tokens = faster time-to-first-token. Be concise in prompts, summarize long documents, and only include relevant context.
Set max_tokens to the minimum needed. Less output = faster completion time.
response = client.chat.completions.create(
model="mythic-4",
messages=messages,
max_tokens=150 # Limit output length
)
Smaller models are faster. Use mythic-4-mini for simple tasks and reserve mythic-4 for complex reasoning.
Cache repeated context (system prompts, documents) to skip reprocessing. Saves time and reduces costs.
For multi-step workflows, run independent requests in parallel instead of sequentially.
import asyncio
async def parallel_requests():
tasks = [
client.chat.completions.create(...),
client.chat.completions.create(...),
client.chat.completions.create(...)
]
results = await asyncio.gather(*tasks)
return results
| Model | TTFT | Tokens/sec | Best For |
|---|---|---|---|
mythic-4-turbo |
~150ms | ~100/s | Speed-critical applications |
mythic-4-mini |
~120ms | ~120/s | Simple tasks, high volume |
mythic-4 |
~200ms | ~80/s | Balanced performance |
mythic-4-reasoning |
~500ms | ~40/s | Complex reasoning |
For multi-turn conversations, you can speculatively pre-fetch likely follow-up responses while the user is reading the current response. This creates the illusion of instant replies.
max_tokens limits
Talk to our team about dedicated capacity and custom optimizations.
Contact Enterprise →