
One API Key, 200+ Open Models: Run Serverless Inference with Together AI
Chris Harper
3 min read
Jul 3, 2026 · 20:02 UTC
TL;DR: Together AI is pay-per-token serverless inference for 200+ open models — OpenAI SDK-compatible, no infra to manage, from $0.05/1M tokens (GPT-OSS 20B) to $7.00/1M (DeepSeek R1).
What you'll be able to do after this:
- Hit Llama 4, DeepSeek V3.1, Qwen 3.6, Kimi K2, and 200+ open models through one REST endpoint
- Swap from the OpenAI SDK to Together in seconds — change one base URL, keep everything else
- Pick the right model for your cost/quality tradeoff using a live pricing catalog
The "run & serve" problem for open models: you want production inference without managing GPUs. Together AI solves this with shared, serverless compute — you call an API, tokens run on their hardware, you pay per token. No provisioning, no replicas to size, no minimum spend.
Install and set your key
pip install together
export TOGETHER_API_KEY=your_key_here # get yours at api.together.ai
Make your first request — Together SDK
from together import Together
client = Together()
response = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[{"role": "user", "content": "Explain HNSW indexing in 2 sentences."}]
)
print(response.choices[0].message.content)
Already using the OpenAI SDK? Swap one line
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["TOGETHER_API_KEY"],
base_url="https://api.together.xyz/v1", # ← the only change
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3-1",
messages=[{"role": "user", "content": "What is LoRA?"}]
)
print(response.choices[0].message.content)
All OpenAI SDK features — streaming, function/tool calls, structured output — work unchanged. Any code that calls OpenAI today can point at Together with one environment variable swap.
Streaming
stream = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[{"role": "user", "content": "Name 3 open-model inference providers."}],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Model picker — 2026 price anchors
The full catalog is at together.ai/models. A few anchors:
| Model | Best for | Input / Output per 1M tokens |
|---|---|---|
| GPT-OSS 20B | Cheap, fast general tasks | $0.05 / $0.20 |
| Llama 4 Scout 17B | Balanced quality/cost | ~$0.18 / $0.59 |
| DeepSeek V3.1 | Flagship open-weight quality | $0.60 / $1.70 |
| Qwen 3.6 Plus | Multilingual / code | $0.50 / $3.00 |
| DeepSeek R1 | Deep reasoning | $3.00 / $7.00 |
Together vs Fireworks vs OpenRouter: Together wins on raw inference speed (they run their own inference research); Fireworks has a tighter fine-tune-to-deploy pipeline; OpenRouter is best when you want to route across providers to find the cheapest live rate. All three use OpenAI-compatible APIs, so switching is a one-line change. Run your first 1M tokens on each and compare latency for your model and use case.
Sources: Together AI Quickstart · Serverless Inference docs · Model catalog · Debugger Cafe walkthrough