NVIDIA engineers constructed the Vera CPU on 88 distinctive cores they dubbed Olympus. These cores are based mostly on Arm v9.2 know-how, with NVIDIA-specific modifications. As an alternative of dividing every thing throughout quite a few smaller chips, they’ve put all of it on a single piece of silicon, lowering sign delays attributable to needing to move information throughout a number of dies.
Spatial multithreading permits every Olympus core to do two full-fledged duties on the similar time, fairly than switching forwards and backwards in little slices. They’ve bodily divided up the pipelines, caches, and registers such that every thread has its personal devoted assets, whereas the second thread runs alongside with out disturbance. That mixture ends in 176 threads out of 88 cores. Additionally they have directions decoded in batches of ten at a time, a neural predictor that predicts which conditional branches the code will take twice per cycle, and a prefetch unit that has been optimized for graph information, so it will probably pull in precisely what the code wants earlier than the code even is aware of what it’s asking for.
Reminiscence-wise you’re speaking 1.5 terabytes utilizing LPDDR5X modules. Beneath full load, every core averages 13.6 gigabytes per second, leading to a per-chip combination bandwidth of 1.2 terabytes. When in comparison with a traditional data-center CPU, this configuration strikes twice as a lot information whereas consuming half the ability for the reminiscence subsystem. Once more, all of that is related through a second-generation coherency material, so that you don’t have to fret about NUMA zones since all cores at all times have the identical view of reminiscence.
In relation to connecting to GPUs, you’re taking a look at NVLink-C2C traces, which might transmit 1.8 terabytes per second in coherent communication. That’s greater than seven instances faster than the latest PCIe era, whereas regular PCIe 6.0 and CXL 3.1 connectors stay obtainable for different gadgets. The Vera can be the primary CPU to natively help FP8 math precision, which hurries up AI calculations with out shedding accuracy.
Rack-level scaling takes this to a completely new stage. With 256 liquid-cooled Vera processors packed right into a single cupboard, plus networking and storage accelerators, whole reminiscence throughout the rack exceeds 400 terabytes. With that many processors, you’re taking a look at an combination reminiscence throughput of 300 gigabytes per second, to not point out supporting a colossal variety of threads (45,056 to be precise) all working concurrently. Which means you possibly can have greater than 22,500 completely different environments working on the similar time, all at full pace and with none interference. Actual-world assessments revealed that scripting, compilation, information processing, and different duties ran 1.8 to 2.2 instances faster than NVIDIA’s earlier Grace processors in the identical rack, with sure conditions seeing a six-fold enhance in CPU throughput in comparison with prior designs.
Manufacturing is at the moment working at full capability, and companions plan to start delivery techniques based mostly on Vera within the second half of 2026. Early deployments are geared toward hyperscale clouds, enterprise analytics clusters, and high-performance computing services, which require a reliable CPU to maintain up with the calls for of all these massive GPU arrays.
[Source]
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s developments at the moment: learn extra, subscribe to our publication, and turn into a part of the NextTech neighborhood at NextTech-news.com

