CloudCodeTree LogoCloudCodeTree
AI NewsTutorialsAbout
CloudCodeTree Logo
CloudCodeTree
  • AI News
  • Tutorials
  • About
← Back to AI News
One API Key, 200+ Open Models: Run Serverless Inference with Together AI

One API Key, 200+ Open Models: Run Serverless Inference with Together AI

Chris Harper

3 min read

Jul 3, 2026 · 20:02 UTC

AI
Tutorial
Self-Hosting
LLM

TL;DR: Together AI is pay-per-token serverless inference for 200+ open models — OpenAI SDK-compatible, no infra to manage, from $0.05/1M tokens (GPT-OSS 20B) to $7.00/1M (DeepSeek R1).

What you'll be able to do after this:

  • Hit Llama 4, DeepSeek V3.1, Qwen 3.6, Kimi K2, and 200+ open models through one REST endpoint
  • Swap from the OpenAI SDK to Together in seconds — change one base URL, keep everything else
  • Pick the right model for your cost/quality tradeoff using a live pricing catalog

The "run & serve" problem for open models: you want production inference without managing GPUs. Together AI solves this with shared, serverless compute — you call an API, tokens run on their hardware, you pay per token. No provisioning, no replicas to size, no minimum spend.

Install and set your key

pip install together
export TOGETHER_API_KEY=your_key_here   # get yours at api.together.ai

Make your first request — Together SDK

from together import Together

client = Together()
response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Explain HNSW indexing in 2 sentences."}]
)
print(response.choices[0].message.content)

Already using the OpenAI SDK? Swap one line

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["TOGETHER_API_KEY"],
    base_url="https://api.together.xyz/v1",   # ← the only change
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3-1",
    messages=[{"role": "user", "content": "What is LoRA?"}]
)
print(response.choices[0].message.content)

All OpenAI SDK features — streaming, function/tool calls, structured output — work unchanged. Any code that calls OpenAI today can point at Together with one environment variable swap.

Streaming

stream = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Name 3 open-model inference providers."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Model picker — 2026 price anchors

The full catalog is at together.ai/models. A few anchors:

ModelBest forInput / Output per 1M tokens
GPT-OSS 20BCheap, fast general tasks$0.05 / $0.20
Llama 4 Scout 17BBalanced quality/cost~$0.18 / $0.59
DeepSeek V3.1Flagship open-weight quality$0.60 / $1.70
Qwen 3.6 PlusMultilingual / code$0.50 / $3.00
DeepSeek R1Deep reasoning$3.00 / $7.00

Together vs Fireworks vs OpenRouter: Together wins on raw inference speed (they run their own inference research); Fireworks has a tighter fine-tune-to-deploy pipeline; OpenRouter is best when you want to route across providers to find the cheapest live rate. All three use OpenAI-compatible APIs, so switching is a one-line change. Run your first 1M tokens on each and compare latency for your model and use case.

Sources: Together AI Quickstart · Serverless Inference docs · Model catalog · Debugger Cafe walkthrough