Sub-200ms Open-Model Inference Without a Server: Fireworks.ai Quickstart

Chris Harper

2 min read

Jun 30, 2026 · 04:11 UTC

Tutorial

Self-Hosting

LLM

TL;DR: Fireworks.ai hosts 200+ open models behind an OpenAI-compatible API with three latency tiers — change base_url and your existing SDK code gets 100+ tokens/second with no GPU to manage.

What you'll be able to do after this:

Call DeepSeek V3.1, Kimi K2, Llama 3.3 70B, Qwen, and 200+ open models via REST with no infrastructure
Pick Standard, Priority, or Fast tier to trade cost for latency guarantees
Drop Fireworks into any existing OpenAI SDK codebase by changing two lines

After covering vLLM (self-hosted) and OpenRouter (multi-provider routing), the missing piece of the hosted-inference picture is Fireworks.ai — they own and optimize the GPU stack, processing 30T+ tokens per day. Notion reported dropping latency from 2 seconds to 350ms after switching.

Three service tiers, one API surface

Tier	Pricing	Best for
Standard	Cheapest	Batch jobs, cost-sensitive workloads
Priority	~1.5× Standard	Production; 0% 503s vs 0.082% on Standard over 14 days
Fast	Standard price, `-fast` model suffix	Real-time agent loops needing 100+ tokens/sec

Quickstart walkthrough

# 1. Create a free account at fireworks.ai → Settings → API Keys
export FIREWORKS_API_KEY="fw_..."

pip install fireworks-client   # OpenAI and Anthropic SDK also work natively

from fireworks.client import Fireworks

client = Fireworks()   # reads FIREWORKS_API_KEY from env

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Explain QLoRA in two sentences."}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Already using the OpenAI SDK? Change two lines — nothing else:

from openai import OpenAI

client = OpenAI(
    api_key="fw_...",
    base_url="https://api.fireworks.ai/inference/v1",
)
# Streaming, tool calls, structured output — all work exactly as before

For the Priority tier: add extra_body={"service_tier": "priority"} to your .create() call. For the Fast tier: use a model ID ending in -fast (e.g., glm-5p2-fast).

Browse available models and per-token pricing at fireworks.ai/models. New accounts start with $1 in free credits.

Sources:

CloudCodeTree

Sub-200ms Open-Model Inference Without a Server: Fireworks.ai Quickstart

Three service tiers, one API surface

Quickstart walkthrough