
Serverless AI at the Edge: Run Any Open-Weight Model Globally With Cloudflare Workers AI
Chris Harper
3 min read
Jul 5, 2026 · 12:04 UTC
TL;DR: Cloudflare Workers AI puts 70+ open models on Cloudflare's global edge network — 10,000 free Neurons per day, one env.AI.run() call from a Worker, zero GPUs to manage.
What you'll be able to do after this:
- Wire AI inference into a Cloudflare Worker with one binding — no external API keys inside the Worker
- Swap between 70+ models (Llama 3, Mistral, Gemma, Flux, Whisper, and more) by changing one string
- Stay within the free tier (10,000 Neurons/day ≈ hundreds of chat turns) for development and small-scale production
Why Workers AI is different from the other run & serve options
You've seen Ollama (local), vLLM (your GPU), OpenRouter/Fireworks/Together (hosted, single-region). Workers AI is the edge-native option: Cloudflare runs GPUs across ~300 cities, and your Worker runs close to each user globally. You don't rent a GPU — you call a binding. No cold-start spin-up, no idle charges.
Step-by-step: from zero to edge inference
1. Create a new Worker project
npm create cloudflare@latest hello-ai -- --type hello-world
cd hello-ai
2. Add the AI binding to wrangler.toml
[ai]
binding = "AI"
3. Call the model in src/index.ts
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const response = await env.AI.run(
"@cf/meta/llama-3.1-8b-instruct",
{ messages: [{ role: "user", content: "Why is edge AI useful?" }] }
);
return new Response(JSON.stringify(response));
},
};
env.AI.run(modelName, input) is the entire API. The model identifier comes from the Workers AI catalog; the input schema varies by model type (text gen, image, embeddings, speech-to-text).
4. Run locally
npx wrangler dev
Open http://localhost:8787. Note: local dev still routes through Cloudflare and uses real Neurons from your account.
5. Deploy globally
npx wrangler deploy
Your Worker is live on ~300 Cloudflare edge locations worldwide.
Model catalog and pricing
The catalog covers text generation (Llama 3.1-8B/70B, Mistral 7B, Gemma 2 2B, GLM-5.2 with function calling), image generation (Flux 1 Schnell, FLUX.2 [klein] 9B), embeddings (BGE-M3), speech-to-text (Whisper), and translation.
Pricing: 10,000 Neurons/day free; $0.011/1,000 Neurons paid (60–90% cheaper than comparable OpenAI pricing). A typical chat turn costs roughly 1–5 Neurons.
When to reach for Workers AI vs. alternatives
| Scenario | Best fit |
|---|---|
| App lives in a Cloudflare Worker | Workers AI — zero latency hop |
| Global users, low-latency requirement | Workers AI |
| Specific fine-tuned model | vLLM or Fireworks |
| Newest frontier model | OpenRouter / direct API |
| Zero-cloud, full local | Ollama |
Workers AI model catalog trails the frontier by a few months — it's not the place for the very latest models, but it excels for well-established open-weights in a serverless, globally distributed context.
Anchor resource: The official Cloudflare Workers AI get-started guide walks from npm create cloudflare through deploy in about 10 minutes, with the full wrangler.toml reference, model catalog link, and billing details.
Sources: Cloudflare Workers AI Docs — Get started with Wrangler | Workers AI model catalog | Workers AI overview | Cloudflare Workers AI: Run AI Models at the Edge (2026 Guide) — mecanik.dev