AI accelerators, compared

Comparing AI chips is mostly a fight over definitions. Peak FLOPS aren't delivered FLOPS, "efficiency" splits three ways, and half of these you can't actually buy. Fourteen accelerators from NVIDIA, Google, AWS, AMD, Huawei and Groq — normalized to FP8 and stripped to the numbers that decide a deployment.

WHAT IS FP8

FP8 is an 8-bit floating-point number format — a low-precision way to represent numbers that roughly doubles compute throughput and halves memory use versus 16-bit FP16/BF16. It has become the default precision for modern AI training and inference, which makes it the fairest common axis for comparing chips. Every compute figure here is dense FP8, peak theoretical — real delivered throughput runs ~30–50% of peak. Chips that can't do FP8 (Huawei 910C, Groq) are shown at FP16 or hatched, never faked.

Color = company

The efficiency frontier

FP8 compute-per-watt against memory bandwidth · bubble = memory capacity · top-right is better

Only the six chips with both a published FP8 rate and a usable power figure can be placed here. Trainium3 (power undisclosed), Ascend 910C (no native FP8) and Groq (per-chip metrics don't apply) sit in the table instead.

Full spec table

Sort any column · filter by use case. Per-chip figures; rack-scale systems pool these differently.

What the numbers don't show

Spec sheets are ceilings. Every FP8 figure here is peak theoretical. Delivered utilization (MFU) is commonly 30–50%, and depends on model shape, batch size and software maturity more than the headline number.
Software is the real moat. CUDA vs. ROCm vs. JAX/XLA vs. AWS Neuron decides whether you can actually use the silicon. It can't be charted, and it often outweighs every spec on this page.
Most of these you can't buy. TPU Ironwood and Trainium3 are rent-only on Google Cloud and AWS. Ascend 910C is China-only under export controls. Groq sells tokens, not chips. Only NVIDIA and AMD ship merchant silicon you can rack yourself.
Groq plays a different game. The LPU has 230 MB of on-chip SRAM and no HBM — its ~80 TB/s on-chip bandwidth and tiny capacity mean it needs many chips to hold one model. Per-chip specs understate its system-level latency advantage, which is the entire point.
The unit is the rack now. NVIDIA's GB300 NVL72, an Ironwood 9,216-chip pod and Huawei's CloudMatrix 384 are the real products. Interconnect — NVLink, ICI, NeuronLink, Infinity Fabric, UnifiedBus — decides how these per-chip numbers actually scale.
This is not a buy recommendation. It's a snapshot for orientation. Prices move weekly and the roadmap rows (Rubin, Trainium4, MI400, TPU 8t/8i) land through 2027.