CloudCodeTree LogoCloudCodeTree
AI NewsTutorialsAbout
CloudCodeTree Logo
CloudCodeTree
  • AI News
  • Tutorials
  • About
← Back to AI News
Sub-200ms Open-Model Inference Without a Server: Fireworks.ai Quickstart

Sub-200ms Open-Model Inference Without a Server: Fireworks.ai Quickstart

Chris Harper

2 min read

Jun 30, 2026 · 04:11 UTC

AI
Tutorial
Self-Hosting
LLM

TL;DR: Fireworks.ai hosts 200+ open models behind an OpenAI-compatible API with three latency tiers — change base_url and your existing SDK code gets 100+ tokens/second with no GPU to manage.

What you'll be able to do after this:

  • Call DeepSeek V3.1, Kimi K2, Llama 3.3 70B, Qwen, and 200+ open models via REST with no infrastructure
  • Pick Standard, Priority, or Fast tier to trade cost for latency guarantees
  • Drop Fireworks into any existing OpenAI SDK codebase by changing two lines

After covering vLLM (self-hosted) and OpenRouter (multi-provider routing), the missing piece of the hosted-inference picture is Fireworks.ai — they own and optimize the GPU stack, processing 30T+ tokens per day. Notion reported dropping latency from 2 seconds to 350ms after switching.

Three service tiers, one API surface

TierPricingBest for
StandardCheapestBatch jobs, cost-sensitive workloads
Priority~1.5× StandardProduction; 0% 503s vs 0.082% on Standard over 14 days
FastStandard price, -fast model suffixReal-time agent loops needing 100+ tokens/sec

Quickstart walkthrough

# 1. Create a free account at fireworks.ai → Settings → API Keys
export FIREWORKS_API_KEY="fw_..."

pip install fireworks-client   # OpenAI and Anthropic SDK also work natively
from fireworks.client import Fireworks

client = Fireworks()   # reads FIREWORKS_API_KEY from env

response = client.chat.completions.create(
    model="accounts/fireworks/models/deepseek-v3p1",
    messages=[{"role": "user", "content": "Explain QLoRA in two sentences."}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Already using the OpenAI SDK? Change two lines — nothing else:

from openai import OpenAI

client = OpenAI(
    api_key="fw_...",
    base_url="https://api.fireworks.ai/inference/v1",
)
# Streaming, tool calls, structured output — all work exactly as before

For the Priority tier: add extra_body={"service_tier": "priority"} to your .create() call. For the Fast tier: use a model ID ending in -fast (e.g., glm-5p2-fast).

Browse available models and per-token pricing at fireworks.ai/models. New accounts start with $1 in free credits.

Sources: