Serverless AI at the Edge: Run Any Open-Weight Model Globally With Cloudflare Workers AI

Chris Harper

3 min read

Jul 5, 2026 · 12:04 UTC

Tutorial

Self-Hosting

LLM

TL;DR: Cloudflare Workers AI puts 70+ open models on Cloudflare's global edge network — 10,000 free Neurons per day, one env.AI.run() call from a Worker, zero GPUs to manage.

What you'll be able to do after this:

Wire AI inference into a Cloudflare Worker with one binding — no external API keys inside the Worker
Swap between 70+ models (Llama 3, Mistral, Gemma, Flux, Whisper, and more) by changing one string
Stay within the free tier (10,000 Neurons/day ≈ hundreds of chat turns) for development and small-scale production

Why Workers AI is different from the other run & serve options

You've seen Ollama (local), vLLM (your GPU), OpenRouter/Fireworks/Together (hosted, single-region). Workers AI is the edge-native option: Cloudflare runs GPUs across ~300 cities, and your Worker runs close to each user globally. You don't rent a GPU — you call a binding. No cold-start spin-up, no idle charges.

Step-by-step: from zero to edge inference

1. Create a new Worker project

npm create cloudflare@latest hello-ai -- --type hello-world
cd hello-ai

2. Add the AI binding to wrangler.toml

[ai]
binding = "AI"

3. Call the model in src/index.ts

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const response = await env.AI.run(
      "@cf/meta/llama-3.1-8b-instruct",
      { messages: [{ role: "user", content: "Why is edge AI useful?" }] }
    );
    return new Response(JSON.stringify(response));
  },
};

env.AI.run(modelName, input) is the entire API. The model identifier comes from the Workers AI catalog; the input schema varies by model type (text gen, image, embeddings, speech-to-text).

4. Run locally

npx wrangler dev

Open http://localhost:8787. Note: local dev still routes through Cloudflare and uses real Neurons from your account.

5. Deploy globally

npx wrangler deploy

Your Worker is live on ~300 Cloudflare edge locations worldwide.

Model catalog and pricing

The catalog covers text generation (Llama 3.1-8B/70B, Mistral 7B, Gemma 2 2B, GLM-5.2 with function calling), image generation (Flux 1 Schnell, FLUX.2 [klein] 9B), embeddings (BGE-M3), speech-to-text (Whisper), and translation.

Pricing: 10,000 Neurons/day free; $0.011/1,000 Neurons paid (60–90% cheaper than comparable OpenAI pricing). A typical chat turn costs roughly 1–5 Neurons.

When to reach for Workers AI vs. alternatives

Scenario	Best fit
App lives in a Cloudflare Worker	Workers AI — zero latency hop
Global users, low-latency requirement	Workers AI
Specific fine-tuned model	vLLM or Fireworks
Newest frontier model	OpenRouter / direct API
Zero-cloud, full local	Ollama

Workers AI model catalog trails the frontier by a few months — it's not the place for the very latest models, but it excels for well-established open-weights in a serverless, globally distributed context.

Anchor resource: The official Cloudflare Workers AI get-started guide walks from npm create cloudflare through deploy in about 10 minutes, with the full wrangler.toml reference, model catalog link, and billing details.

Sources: Cloudflare Workers AI Docs — Get started with Wrangler | Workers AI model catalog | Workers AI overview | Cloudflare Workers AI: Run AI Models at the Edge (2026 Guide) — mecanik.dev

CloudCodeTree

Serverless AI at the Edge: Run Any Open-Weight Model Globally With Cloudflare Workers AI

Why Workers AI is different from the other run & serve options

Step-by-step: from zero to edge inference

Model catalog and pricing

When to reach for Workers AI vs. alternatives