Close Menu
  • Home
  • Opinion
  • Region
    • Africa
    • Asia
    • Europe
    • Middle East
    • North America
    • Oceania
    • South America
  • AI & Machine Learning
  • Robotics & Automation
  • Space & Deep Tech
  • Web3 & Digital Economies
  • Climate & Sustainability Tech
  • Biotech & Future Health
  • Mobility & Smart Cities
  • Global Tech Pulse
  • Cybersecurity & Digital Rights
  • Future of Work & Education
  • Trend Radar & Startup Watch
  • Creator Economy & Culture
What's Hot

Google retains extending the Fitbit account migration deadline

February 3, 2026

Can the maker financial system drive native financial revitalization? Six Michigan communities are about to search out out.

February 3, 2026

India units the stage to develop into a World Cloud and AI Hub with its Tax vacation coverage until 2047 – Funds 2026

February 3, 2026
Facebook X (Twitter) Instagram LinkedIn RSS
NextTech NewsNextTech News
Facebook X (Twitter) Instagram LinkedIn RSS
  • Home
  • Africa
  • Asia
  • Europe
  • Middle East
  • North America
  • Oceania
  • South America
  • Opinion
Trending
  • Google retains extending the Fitbit account migration deadline
  • Can the maker financial system drive native financial revitalization? Six Michigan communities are about to search out out.
  • India units the stage to develop into a World Cloud and AI Hub with its Tax vacation coverage until 2047 – Funds 2026
  • Up-Shut with The Blackhat, a Customized Open-Supply Handheld Pc
  • $41M Contract Fuels ATDev’s Imaginative and prescient for AI-Powered PT Ecosystem
  • The Home windows 11 February patch is a giant one – here is what PC customers are getting
  • Functions for ANDHealth’s Activate scale-up program open
  • Outsourced Estimating Helps Contractors Bid Extra With out Burnout
Tuesday, February 3
NextTech NewsNextTech News
Home - AI & Machine Learning - NVIDIA AI Brings Nemotron-3-Nano-30B to NVFP4 with Quantization Conscious Distillation (QAD) for Environment friendly Reasoning Inference
AI & Machine Learning

NVIDIA AI Brings Nemotron-3-Nano-30B to NVFP4 with Quantization Conscious Distillation (QAD) for Environment friendly Reasoning Inference

NextTechBy NextTechFebruary 2, 2026No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
Follow Us
Google News Flipboard
NVIDIA AI Brings Nemotron-3-Nano-30B to NVFP4 with Quantization Conscious Distillation (QAD) for Environment friendly Reasoning Inference
Share
Facebook Twitter LinkedIn Pinterest Email


NVIDIA has launched Nemotron-Nano-3-30B-A3B-NVFP4, a manufacturing checkpoint that runs a 30B parameter reasoning mannequin in 4 bit NVFP4 format whereas retaining accuracy near its BF16 baseline. The mannequin combines a hybrid Mamba2 Transformer Combination of Specialists structure with a Quantization Conscious Distillation (QAD) recipe designed particularly for NVFP4 deployment. Total, it’s an ultra-efficient NVFP4 precision model of Nemotron-3-Nano that delivers as much as 4x increased throughput on Blackwell B200.

G w1 DBXUAU2mwv 1 scaled
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

What’s Nemotron-Nano-3-30B-A3B-NVFP4?

Nemotron-Nano-3-30B-A3B-NVFP4 is a quantized model of Nemotron-3-Nano-30B-A3B-BF16, educated from scratch by NVIDIA staff as a unified reasoning and chat mannequin. It’s constructed as a hybrid Mamba2 Transformer MoE community:

  • 30B parameters in complete
  • 52 layers in depth
  • 23 Mamba2 and MoE layers
  • 6 grouped question consideration layers with 2 teams
  • Every MoE layer has 128 routed specialists and 1 shared skilled
  • 6 specialists are energetic per token, which provides about 3.5B energetic parameters per token

The mannequin is pre-trained on 25T tokens utilizing a Warmup Steady Decay studying price schedule with a batch measurement of 3072, a peak studying price of 1e-3 and a minimal studying price of 1e-5.

Publish coaching follows a 3 stage pipeline:

  1. Supervised high-quality tuning on artificial and curated knowledge for code, math, science, instrument calling, instruction following and structured outputs.
  2. Reinforcement studying with synchronous GRPO throughout multi step instrument use, multi flip chat and structured environments, and RLHF with a generative reward mannequin.
  3. Publish coaching quantization to NVFP4 with FP8 KV cache and a selective excessive precision structure, adopted by QAD.

The NVFP4 checkpoint retains the eye layers and the Mamba layers that feed into them in BF16, quantizes remaining layers to NVFP4 and makes use of FP8 for the KV cache.

NVFP4 format and why it issues?

NVFP4 is a 4 bit floating level format designed for each coaching and inference on current NVIDIA GPUs. The primary properties of NVFP4:

  • In contrast with FP8, NVFP4 delivers 2 to three instances increased arithmetic throughput.
  • It reduces reminiscence utilization by about 1.8 instances for weights and activations.
  • It extends MXFP4 by lowering the block measurement from 32 to 16 and introduces two degree scaling.

The 2 degree scaling makes use of E4M3-FP8 scales per block and a FP32 scale per tensor. The smaller block measurement permits the quantizer to adapt to native statistics and the twin scaling will increase dynamic vary whereas retaining quantization error low.

For very giant LLMs, easy put up coaching quantization (PTQ) to NVFP4 already offers respectable accuracy throughout benchmarks. For smaller fashions, particularly these closely postage pipelines, the analysis staff notes that PTQ causes non negligible accuracy drops, which motivates a coaching based mostly restoration technique.

From QAT to QAD

Customary Quantization Conscious Coaching (QAT) inserts a pseudo quantization into the ahead go and reuses the authentic job loss, akin to subsequent token cross entropy. This works properly for convolutional networks, however the analysis staff lists 2 most important points for contemporary LLMs:

  • Complicated multi stage put up coaching pipelines with SFT, RL and mannequin merging are laborious to breed.
  • Authentic coaching knowledge for open fashions is usually unavailabublic kind.

Quantization Conscious Distillation (QAD) adjustments the target as a substitute of the complete pipeline. A frozen BF16 mannequin acts as trainer and the NVFP4 mannequin is a pupil. Coaching minimizes KL divergence between their output token distributions, not the unique supervised or RL goal.

The analysis staff highlights 3 properties of QAD:

  1. It aligns the quantized mannequin with the excessive precision trainer extra precisely than QAT.
  2. It stays secure even when the trainer has already gone by means of a number of levels, akin to supervised high-quality tuning, reinforcement studying and mannequin merging, as a result of QAD solely tries to match the ultimate trainer conduct.
  3. It really works with partial, artificial or filtered knowledge, as a result of it solely wants enter textual content to question the trainer and pupil, not the unique labels or reward fashions.

Benchmarks on Nemotron-3-Nano-30B

Nemotron-3-Nano-30B-A3B is likely one of the RL heavy fashions within the QAD analysis. The under Desk exhibits accuracy on AA-LCR, AIME25, GPQA-D, LiveCodeBench-v5 and SciCode-TQ, NVFP4-QAT and NVFP4-QAD.

Screenshot 2026 02 01 at 10.43.40 PMScreenshot 2026 02 01 at 10.43.40 PM
https://analysis.nvidia.com/labs/nemotron/recordsdata/NVFP4-QAD-Report.pdf

Key Takeaways

  • Nemotron-3-Nano-30B-A3B-NVFP4 is a 30B parameter hybrid Mamba2 Transformer MoE mannequin that runs in 4 bit NVFP4 with FP8 KV cache and a small set of BF16 layers preserved for stability, whereas retaining about 3.5B energetic parameters per token and supporting context home windows as much as 1M tokens.
  • NVFP4 is a 4 bit floating level format with block measurement 16 and two degree scaling, utilizing E4M3-FP8 per block scales and a FP32 per tensor scale, which provides about 2 to three instances increased arithmetic throughput and about 1.8 instances decrease reminiscence price than FP8 for weights and activations.
  • Quantization Conscious Distillation (QAD) replaces the unique job loss with KL divergence to a frozen BF16 trainer, so the NVFP4 pupil immediately matches the trainer’s output distribution with out replaying the complete SFT, RL and mannequin merge pipeline or needing the unique reward fashions.
  • Utilizing the brand new Quantization Conscious Distillation technique, the NVFP4 model achieves as much as 99.4% accuracy of BF16
  • On AA-LCR, AIME25, GPQA-D, LiveCodeBench and SciCode, NVFP4-PTQ exhibits noticeable accuracy loss and NVFP4-QAT degrades additional, whereas NVFP4-QAD recovers efficiency to close BF16 ranges, lowering the hole to only some factors throughout these reasoning and coding benchmarks.

Take a look at the Paper and Mannequin Weights. Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.


Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s developments as we speak: learn extra, subscribe to our e-newsletter, and develop into a part of the NextTech group at NextTech-news.com

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
NextTech
  • Website

Related Posts

Google Releases Conductor: a context pushed Gemini CLI extension that shops data as Markdown and orchestrates agentic workflows

February 2, 2026

The Statistical Value of Zero Padding in Convolutional Neural Networks (CNNs)

February 2, 2026

Learn how to Construct Reminiscence-Pushed AI Brokers with Quick-Time period, Lengthy-Time period, and Episodic Reminiscence

February 2, 2026
Add A Comment
Leave A Reply Cancel Reply

Economy News

Google retains extending the Fitbit account migration deadline

By NextTechFebruary 3, 2026

Google has required Fitbit customers emigrate their Fitbit account knowledge to a Google account for…

Can the maker financial system drive native financial revitalization? Six Michigan communities are about to search out out.

February 3, 2026

India units the stage to develop into a World Cloud and AI Hub with its Tax vacation coverage until 2047 – Funds 2026

February 3, 2026
Top Trending

Google retains extending the Fitbit account migration deadline

By NextTechFebruary 3, 2026

Google has required Fitbit customers emigrate their Fitbit account knowledge to a…

Can the maker financial system drive native financial revitalization? Six Michigan communities are about to search out out.

By NextTechFebruary 3, 2026

Hearken to the article 4 min This audio is auto-generated. Please tell…

India units the stage to develop into a World Cloud and AI Hub with its Tax vacation coverage until 2047 – Funds 2026

By NextTechFebruary 3, 2026

The Union Funds 2026 has launched a landmark tax vacation for overseas…

Subscribe to News

Get the latest sports news from NewsSite about world, sports and politics.

NEXTTECH-LOGO
Facebook X (Twitter) Instagram YouTube

AI & Machine Learning

Robotics & Automation

Space & Deep Tech

Web3 & Digital Economies

Climate & Sustainability Tech

Biotech & Future Health

Mobility & Smart Cities

Global Tech Pulse

Cybersecurity & Digital Rights

Future of Work & Education

Creator Economy & Culture

Trend Radar & Startup Watch

News By Region

Africa

Asia

Europe

Middle East

North America

Oceania

South America

2025 © NextTech-News. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Advertise With Us
  • Write For Us
  • Submit Article & Press Release

Type above and press Enter to search. Press Esc to cancel.

Subscribe For Latest Updates

Sign up to best of Tech news, informed analysis and opinions on what matters to you.

Invalid email address
 We respect your inbox and never send spam. You can unsubscribe from our newsletter at any time.     
Thanks for subscribing!