
Sub-200ms Open-Model Inference Without a Server: Fireworks.ai Quickstart
Chris Harper
2 min read
Jun 30, 2026 · 04:11 UTC
TL;DR: Fireworks.ai hosts 200+ open models behind an OpenAI-compatible API with three latency tiers — change base_url and your existing SDK code gets 100+ tokens/second with no GPU to manage.
What you'll be able to do after this:
- Call DeepSeek V3.1, Kimi K2, Llama 3.3 70B, Qwen, and 200+ open models via REST with no infrastructure
- Pick Standard, Priority, or Fast tier to trade cost for latency guarantees
- Drop Fireworks into any existing OpenAI SDK codebase by changing two lines
After covering vLLM (self-hosted) and OpenRouter (multi-provider routing), the missing piece of the hosted-inference picture is Fireworks.ai — they own and optimize the GPU stack, processing 30T+ tokens per day. Notion reported dropping latency from 2 seconds to 350ms after switching.
Three service tiers, one API surface
| Tier | Pricing | Best for |
|---|---|---|
| Standard | Cheapest | Batch jobs, cost-sensitive workloads |
| Priority | ~1.5× Standard | Production; 0% 503s vs 0.082% on Standard over 14 days |
| Fast | Standard price, -fast model suffix | Real-time agent loops needing 100+ tokens/sec |
Quickstart walkthrough
# 1. Create a free account at fireworks.ai → Settings → API Keys
export FIREWORKS_API_KEY="fw_..."
pip install fireworks-client # OpenAI and Anthropic SDK also work natively
from fireworks.client import Fireworks
client = Fireworks() # reads FIREWORKS_API_KEY from env
response = client.chat.completions.create(
model="accounts/fireworks/models/deepseek-v3p1",
messages=[{"role": "user", "content": "Explain QLoRA in two sentences."}],
max_tokens=256,
)
print(response.choices[0].message.content)
Already using the OpenAI SDK? Change two lines — nothing else:
from openai import OpenAI
client = OpenAI(
api_key="fw_...",
base_url="https://api.fireworks.ai/inference/v1",
)
# Streaming, tool calls, structured output — all work exactly as before
For the Priority tier: add extra_body={"service_tier": "priority"} to your .create() call. For the Fast tier: use a model ID ending in -fast (e.g., glm-5p2-fast).
Browse available models and per-token pricing at fireworks.ai/models. New accounts start with $1 in free credits.
Sources: