Maia 200 is Microsoft’s new in home AI accelerator designed for inference in Azure datacenters. It targets the price of token era for giant language fashions and different reasoning workloads by combining slim precision compute, a dense on chip reminiscence hierarchy and an Ethernet based mostly scale up cloth.
Why Microsoft constructed a devoted inference chip?
Coaching and inference stress {hardware} in numerous methods. Coaching wants very massive all to all communication and lengthy working jobs. Inference cares about tokens per second, latency and tokens per greenback. Microsoft positions Maia 200 as its most effective inference system, with about 30 % higher efficiency per greenback than the newest {hardware} in its fleet.
Maia 200 is a part of a heterogeneous Azure stack. It is going to serve a number of fashions, together with the newest GPT 5.2 fashions from OpenAI, and can energy workloads in Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence staff will use the chip for artificial information era and reinforcement studying to enhance in home fashions.
Core silicon and numeric specs
Every Maia 200 die is fabricated on TSMC’s 3 nanometer course of. The chip integrates greater than 140 billion transistors.
The compute pipeline is constructed round native FP8 and FP4 tensor cores. A single chip delivers greater than 10 petaFLOPS in FP4 and greater than 5 petaFLOPS in FP8, inside a 750W SoC TDP envelope.
Reminiscence is cut up between stacked HBM and on die SRAM. Maia 200 supplies 216 GB of HBM3e with about 7TB per second of bandwidth and 272MB of on die SRAM. The SRAM is organized into tile degree SRAM and cluster degree SRAM and is absolutely software program managed. Compilers and runtimes can place working units explicitly to maintain consideration and GEMM kernels near compute.
Tile based mostly microarchitecture and reminiscence hierarchy
The Maia 200 microarchitecture is hierarchical. The bottom unit is the tile. A tile is the smallest autonomous compute and storage unit on the chip. Every tile features a Tile Tensor Unit for top throughput matrix operations and a Tile Vector Processor as a programmable SIMD engine. Tile SRAM feeds each models and tile DMA engines transfer information out and in of SRAM with out stalling compute. A Tile Management Processor orchestrates the sequence of tensor and DMA work.
A number of tiles kind a cluster. Every cluster exposes a bigger multi banked Cluster SRAM that’s shared throughout tiles in that cluster. Cluster degree DMA engines transfer information between Cluster SRAM and the co packaged HBM stacks. A cluster core coordinates multi tile execution and makes use of redundancy schemes for tiles and SRAM to enhance yield whereas maintaining the identical programming mannequin.
This hierarchy lets the software program stack pin completely different components of the mannequin in numerous tiers. For instance, consideration kernels can maintain Q, Okay, V tensors in tile SRAM, whereas collective communication kernels can stage payloads in cluster SRAM and cut back HBM strain. The design objective is sustained excessive utilization when fashions develop in measurement and sequence size.
On chip information motion and Ethernet scale up cloth
Inference is commonly restricted by information motion, not peak compute. Maia 200 makes use of a customized Community on Chip together with a hierarchy of DMA engines. The Community on Chip spans tiles, clusters, reminiscence controllers and I/O models. It has separate planes for giant tensor visitors and for small management messages. This separation retains synchronization and small outputs from being blocked behind massive transfers.
Past the chip boundary, Maia 200 integrates its personal NIC and an Ethernet based mostly scale up community that runs the AI Transport Layer protocol. The on-die NIC exposes about 1.4 TB per second in every course, or 2.8 TB per second bidirectional bandwidth, and scales to six,144 accelerators in a two tier area.
Inside every tray, 4 Maia accelerators kind a Absolutely Linked Quad. These 4 units have direct non switched hyperlinks to one another. Most tensor parallel visitors stays inside this group, whereas solely lighter collective visitors goes out to switches. This improves latency and reduces swap port depend for typical inference collectives.
Azure system integration and cooling
At system degree, Maia 200 follows the identical rack, energy and mechanical requirements as Azure GPU servers. It helps air cooled and liquid cooled configurations and makes use of a second era closed loop liquid cooling Warmth Exchanger Unit for top density racks. This permits combined deployments of GPUs and Maia accelerators in the identical datacenter footprint.
The accelerator integrates with the Azure management airplane. Firmware administration, well being monitoring and telemetry use the identical workflows as different Azure compute providers. This allows fleet extensive rollouts and upkeep with out disrupting working AI workloads.
Key Takeaways
Listed here are 5 concise, technical takeaways:
- Inference first design: Maia 200 is Microsoft’s first silicon and system platform constructed just for AI inference, optimized for giant scale token era in trendy reasoning fashions and huge language fashions.
- Numeric specs and reminiscence hierarchy: The chip is fabricated on TSMCs 3nm, integrates about 140 billion transistors and delivers greater than 10 PFLOPS FP4 and greater than 5 PFLOPS FP8, with 216 GB HBM3e at 7TB per second together with 272 MB on chip SRAM cut up into tile SRAM and cluster SRAM and managed in software program.
- Efficiency versus different cloud accelerators: Microsoft studies about 30 % higher efficiency per greenback than the newest Azure inference methods and claims 3 occasions FP4 efficiency of third era Amazon Trainium and better FP8 efficiency than Google TPU v7 on the accelerator degree.
- Tile based mostly structure and Ethernet cloth: Maia 200 organizes compute into tiles and clusters with native SRAM, DMA engines and a Community on Chip, and exposes an built-in NIC with about 1.4 TB per second per course Ethernet bandwidth that scales to six,144 accelerators utilizing Absolutely Linked Quad teams because the native tensor parallel area.
Michal Sutter is a knowledge science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking complicated datasets into actionable insights.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s tendencies as we speak: learn extra, subscribe to our e-newsletter, and grow to be a part of the NextTech neighborhood at NextTech-news.com

