AI experts sharing free tutorials to accelerate your business.
← Back to News
Breaking

Cloudflare's Rust-Built Infire Engine Cuts LLM Latency 3x at the Edge

Krasa AI

2026-05-31

5 minute read

Cloudflare's Rust-Built Infire Engine Cuts LLM Latency 3x at the Edge

Cloudflare published the engineering details behind Infire this week, a Rust-written inference engine the company built to run large language models across its global network. The headline number: a 3x reduction in inter-token latency by splitting model prefill and decode onto separate machines, plus a 20% throughput bump from the same GPUs.

It's the clearest signal yet that the cost structure of running frontier models — not the models themselves — is becoming the central battleground for AI infrastructure providers in 2026.

What Cloudflare actually built

Infire is a custom LLM inference engine written in Rust. Cloudflare designed it specifically for its edge network, which prioritizes memory efficiency, network I/O, and GPU utilization differently than a centralized data center stack like vLLM or TensorRT-LLM. The engine extracts up to 20% higher tokens-per-second throughput from the same hardware, according to Cloudflare's published benchmarks.

The deeper change is architectural. Cloudflare separated LLM inference into two distinct stages running on different optimized machines. Prefill — the stage where the model reads and processes your input prompt — is compute-bound and benefits from raw FLOPs. Decode — the stage where the model generates tokens one at a time — is memory-bound and benefits from fast access to model weights.

By running these on separate hardware tuned for each workload, Cloudflare cut inter-token latency by roughly 3x compared to running both stages on the same machine. That's the gap between an AI response that streams smoothly and one that stutters as it generates.

Cloudflare also rolled out Unweight, a compression system the company says shrinks LLM weights by 15–22% without measurable accuracy loss. Smaller weights mean less data moves between GPU memory and compute units during inference, which compounds Infire's latency wins.

Why this matters

Inference economics are now the dominant cost center for every AI company that runs a production product. Training a frontier model is a fixed cost. Serving it to millions of users is a recurring one, and it scales with usage rather than amortizing over time.

Cloudflare's pitch is that running models at the edge — closer to users, on smaller fleets of GPUs, with more efficient software — beats the centralized hyperscaler approach for a growing class of workloads. The 3x latency cut matters because perceived AI quality at the user level is almost entirely a function of how fast tokens stream. A model that's smarter on paper but slower on the wire feels worse in production.

The other audience for this announcement is enterprises evaluating where to host their AI workloads. Cloudflare Workers AI is now a credible alternative to AWS Bedrock, Azure AI Foundry, and Google Vertex AI for the inference-heavy half of any AI product. The Infire and Unweight technical posts are effectively a sales pitch dressed as engineering content.

Industry impact

The inference market has been quietly restructuring all year. Groq, the chip company that pivoted into the inference cloud business, is raising $650 million to build Groq 2.0 as a hardware-free neocloud. Cerebras went public in early 2026 with inference services as part of its growth story. AWS just disclosed that its custom silicon — Trainium, Graviton, Nitro — is running at a $20 billion annual revenue rate, much of it driven by Anthropic's commitment to up to 5 gigawatts of Trainium capacity.

What Cloudflare brings that the others don't is geographic distribution. The company has data centers in over 300 cities, which means a Workers AI inference call can land within a few milliseconds of almost any user on the planet. For latency-sensitive AI products — voice assistants, real-time translation, agentic workflows that need to react quickly — that's a meaningful edge over a centralized AWS deployment.

The move also aligns Cloudflare with the agent platforms being announced this spring. Cloudflare's recent AI Platform refresh positioned its inference layer specifically for agents — workloads that make many small, latency-sensitive model calls rather than a few large ones. Infire's prefill/decode split is exactly the optimization an agent runtime benefits from.

Expert perspectives

InfoQ's analysis of the architecture noted that Cloudflare is "separating the model's input processing and output generation onto different optimized systems," which it described as one of the more novel deployment patterns in production inference today. The technique, called disaggregated prefill/decode, has been discussed in academic papers but rarely shipped at scale.

The Rust choice is also notable. Most inference engines in production are written in C++ (vLLM, TensorRT-LLM) or Python with C++ backends. Cloudflare's commitment to Rust across its infrastructure stack is now extending into the AI layer, which it has framed as a memory-safety and concurrency story.

What's next

Workers AI now runs large models on Infire, starting with Kimi K2.5 from Moonshot AI. That's a deliberate choice: Kimi K2.5 is one of the strongest open-weight coding models from the recent wave of Chinese frontier labs, and pricing it competitively on Cloudflare's edge undercuts both Western API costs and any latency disadvantage Chinese open models had against US-hosted closed models.

The other thing to watch is whether Cloudflare extends the prefill/decode split to BYO-model deployments. If customers can bring custom fine-tunes and still get the disaggregated inference architecture, Workers AI becomes a more direct AWS Bedrock competitor for enterprise AI.

The bottom line

Infire and Unweight are the kind of infrastructure work that doesn't make front-page AI news but determines which providers stay competitive on the AI products people actually use. A 3x latency cut and 20% throughput bump from the same hardware translates directly into faster AI products at lower cost. For anyone building an AI app in 2026, where you run inference is now as important as which model you call.

#ai#cloudflare#infrastructure#inference

Related Articles