AI experts sharing free tutorials to accelerate your business.
← Back to News
Breaking

Google Unveils TPU 8t and 8i: Two Chips for the Agentic Era

Krasa AI

2026-04-23

5 minute read

Google Unveils TPU 8t and 8i: Two Chips for the Agentic Era

Google used the opening keynote of Cloud Next 2026 in Las Vegas on Wednesday to announce its eighth-generation TPUs — and for the first time, the company is splitting training and inference into two purpose-built chips instead of one. TPU 8t is designed for model training. TPU 8i is designed for serving models at scale. Both ship later this year.

The performance numbers Google published are aggressive. TPU 8t delivers 2.8x the per-pod compute of Ironwood, Google's 7th-gen TPU announced in November. TPU 8i delivers 80% better inference performance than Ironwood at the same price point. Both chips hit up to 2x better performance-per-watt than the previous generation.

Context: Why Split the Chip in Two

For most of TPU's history, Google has built one chip per generation to handle both training and inference. That made sense when the bottleneck was just "more FLOPs." It doesn't anymore.

Training a frontier model is a sustained, bandwidth-heavy workload that runs for weeks across thousands of accelerators. Inference — especially for agent workloads — is burstier, more memory-bound, and more sensitive to latency and cost per token. Trying to serve both with one design means compromising on both.

Nvidia has been moving the same direction. The H100/H200 lineage was nominally general-purpose, but Blackwell and its successors are increasingly segmented by use case. Google's TPU 8t/8i split mirrors that logic and frames the announcement as an explicit Nvidia challenge.

What's Actually New

TPU 8t — the training chip — is built around a "reduce the frontier model development cycle from months to weeks" pitch. The 2.8x compute-per-pod gain over Ironwood is the headline, and pod scale matters here: Google connects thousands of chips into a single pod with high-bandwidth interconnect, and training throughput scales with pod efficiency more than raw per-chip numbers.

TPU 8i — the inference chip — has 384 MB of on-chip SRAM, triple what Ironwood had. SRAM capacity matters for inference because keeping more data on-chip avoids the latency tax of going to HBM (high-bandwidth memory). For agent workloads that make lots of small, latency-sensitive calls, that translates directly to lower cost per token. A single TPU 8i pod connects 1,152 chips.

Both chips switch to Google's Axion ARM-based CPU as the host processor, replacing the x86 hosts used in prior TPU generations. That's a strategic shift: Google is moving toward end-to-end vertical integration on its AI infrastructure stack, with Google-designed silicon from the accelerator down through the CPU. ARM in the datacenter is also more power-efficient, which compounds the 2x perf-per-watt story.

Industry Impact

The immediate audience for TPU 8t is the small list of organizations actually training frontier models — Google DeepMind internally, Anthropic (which uses both TPUs and AWS Trainium), Thinking Machines Lab (which just signed a multi-billion Google Cloud deal on Wednesday), and a handful of others. For that audience, "months to weeks" is not a marketing phrase — it's the difference between iterating on a frontier model three times a year and iterating ten times a year. The company that can iterate faster wins.

TPU 8i is the bigger commercial story. Inference is where the real money gets made — inference spend across the industry is on track to dwarf training spend by the end of 2026 — and an 80% perf-per-dollar improvement at that scale matters for every customer running production AI workloads on Google Cloud. If TPU 8i delivers on its spec sheet, it gives Google Cloud a meaningful pricing advantage over AWS Trainium and Nvidia-based competitors on serving workloads.

For Nvidia, the read is nuanced. Google still announced a Nvidia partnership at Cloud Next and continues to offer GPUs to customers who want them. But the trajectory is clear: Google wants its customers running on Google silicon. Every TPU shipped is a GPU not shipped.

Expert Perspectives

ServeTheHome's analysis noted that the split into training and inference chips represents "the clearest sign yet that the industry is stratifying." Reviewers at The Register flagged the 384 MB SRAM on TPU 8i as "the spec that matters most" for real-world agent workloads, since agents tend to make many short inference calls rather than a few long ones.

Google Cloud CEO Thomas Kurian, speaking at the keynote, positioned the announcement around the enterprise agent push rather than the raw chip specs — the TPU story is, in Google's framing, the foundation for the Gemini Enterprise Agent Platform that was announced alongside it.

What's Next

Availability is "later this year," which likely means Q3 or Q4 2026 for general availability. Early access and large customer deployments almost always come first — expect Anthropic and a few other anchor customers to be running on TPU 8t before external cloud availability opens up.

Watch for benchmark leaks from independent labs. Google's headline numbers are always pod-level and always versus its own prior generation; what matters for the industry is how TPU 8i performs against H200 and Trainium3 on specific production workloads. Those comparisons will show up in the next two to three months.

Watch the pricing disclosure, which hasn't been published yet. Cloud Next announcements are usually followed by price-per-hour and reserved-instance pricing within a few weeks. That's when the "80% better inference" claim gets stress-tested against the actual total cost of ownership for Cloud customers.

Bottom Line

If you're running a large inference workload on Google Cloud, TPU 8i is worth watching closely — the SRAM bump and the 80% perf-per-dollar claim are both real advantages for agent-heavy workloads. If you're training frontier models, TPU 8t is only relevant if you're at the scale of Anthropic or Thinking Machines. For everyone else, the more interesting takeaway is what the split says about where the industry is headed: training and inference are diverging into different workloads with different hardware, and one chip no longer fits both.

#ai#google#tpu#hardware#nvidia

Related Articles