One API Key, 200+ Open Models: Run Serverless Inference with Together AI

Chris Harper

3 min read

Jul 3, 2026 · 20:02 UTC

Tutorial

Self-Hosting

LLM

TL;DR: Together AI is pay-per-token serverless inference for 200+ open models — OpenAI SDK-compatible, no infra to manage, from $0.05/1M tokens (GPT-OSS 20B) to $7.00/1M (DeepSeek R1).

What you'll be able to do after this:

Hit Llama 4, DeepSeek V3.1, Qwen 3.6, Kimi K2, and 200+ open models through one REST endpoint
Swap from the OpenAI SDK to Together in seconds — change one base URL, keep everything else
Pick the right model for your cost/quality tradeoff using a live pricing catalog

The "run & serve" problem for open models: you want production inference without managing GPUs. Together AI solves this with shared, serverless compute — you call an API, tokens run on their hardware, you pay per token. No provisioning, no replicas to size, no minimum spend.

Install and set your key

pip install together
export TOGETHER_API_KEY=your_key_here   # get yours at api.together.ai

Make your first request — Together SDK

from together import Together

client = Together()
response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Explain HNSW indexing in 2 sentences."}]
)
print(response.choices[0].message.content)

Already using the OpenAI SDK? Swap one line

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["TOGETHER_API_KEY"],
    base_url="https://api.together.xyz/v1",   # ← the only change
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-1",
    messages=[{"role": "user", "content": "What is LoRA?"}]
)
print(response.choices[0].message.content)

All OpenAI SDK features — streaming, function/tool calls, structured output — work unchanged. Any code that calls OpenAI today can point at Together with one environment variable swap.

Streaming

stream = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Name 3 open-model inference providers."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Model picker — 2026 price anchors

The full catalog is at together.ai/models. A few anchors:

Model	Best for	Input / Output per 1M tokens
GPT-OSS 20B	Cheap, fast general tasks	$0.05 / $0.20
Llama 4 Scout 17B	Balanced quality/cost	~$0.18 / $0.59
DeepSeek V3.1	Flagship open-weight quality	$0.60 / $1.70
Qwen 3.6 Plus	Multilingual / code	$0.50 / $3.00
DeepSeek R1	Deep reasoning	$3.00 / $7.00

Together vs Fireworks vs OpenRouter: Together wins on raw inference speed (they run their own inference research); Fireworks has a tighter fine-tune-to-deploy pipeline; OpenRouter is best when you want to route across providers to find the cheapest live rate. All three use OpenAI-compatible APIs, so switching is a one-line change. Run your first 1M tokens on each and compare latency for your model and use case.

Sources: Together AI Quickstart · Serverless Inference docs · Model catalog · Debugger Cafe walkthrough

CloudCodeTree

One API Key, 200+ Open Models: Run Serverless Inference with Together AI