Close Menu
  • Home
  • Opinion
  • Region
    • Africa
    • Asia
    • Europe
    • Middle East
    • North America
    • Oceania
    • South America
  • AI & Machine Learning
  • Robotics & Automation
  • Space & Deep Tech
  • Web3 & Digital Economies
  • Climate & Sustainability Tech
  • Biotech & Future Health
  • Mobility & Smart Cities
  • Global Tech Pulse
  • Cybersecurity & Digital Rights
  • Future of Work & Education
  • Trend Radar & Startup Watch
  • Creator Economy & Culture
What's Hot

Indian business interiors startup Officebanao raises $7.7M

March 17, 2026

MTN to soak up 2,762 IHS Towers employees in $2.2 billion acquisition

March 17, 2026

Lawsuit claims NCAR adjustments pose ‘direct menace’ to US safety

March 17, 2026
Facebook X (Twitter) Instagram LinkedIn RSS
NextTech NewsNextTech News
Facebook X (Twitter) Instagram LinkedIn RSS
  • Home
  • Africa
  • Asia
  • Europe
  • Middle East
  • North America
  • Oceania
  • South America
  • Opinion
Trending
  • Indian business interiors startup Officebanao raises $7.7M
  • MTN to soak up 2,762 IHS Towers employees in $2.2 billion acquisition
  • Lawsuit claims NCAR adjustments pose ‘direct menace’ to US safety
  • Freedom Cellular’s 250GB deal is sticking round with a barely larger value
  • Why Bodily AI wants higher {hardware}, not simply higher fashions
  • NASA Checks Expertise for Refueling Landers
  • Specs, costs, and launch date
  • Apple acquires video enhancing software program firm MotionVFX
Tuesday, March 17
NextTech NewsNextTech News
Home - AI & Machine Learning - Nanbeige4-3B-Pondering: How a 23T Token Pipeline Pushes 3B Fashions Previous 30B Class Reasoning
AI & Machine Learning

Nanbeige4-3B-Pondering: How a 23T Token Pipeline Pushes 3B Fashions Previous 30B Class Reasoning

NextTechBy NextTechDecember 13, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
Follow Us
Google News Flipboard
Nanbeige4-3B-Pondering: How a 23T Token Pipeline Pushes 3B Fashions Previous 30B Class Reasoning
Share
Facebook Twitter LinkedIn Pinterest Email


Can a 3B mannequin ship 30B class reasoning by fixing the coaching recipe as a substitute of scaling parameters? Nanbeige LLM Lab at Boss Zhipin has launched Nanbeige4-3B, a 3B parameter small language mannequin household educated with an unusually heavy emphasis on knowledge high quality, curriculum scheduling, distillation, and reinforcement studying.

The analysis workforce ships 2 major checkpoints, Nanbeige4-3B-Base and Nanbeige4-3B-Pondering, and evaluates the reasoning tuned mannequin in opposition to Qwen3 checkpoints from 4B as much as 32B parameters.

Screenshot 2025 12 12 at 9.34.56 PM 1
https://arxiv.org/pdf/2512.06266

Benchmark outcomes

On AIME 2024, Nanbeige4-3B-2511 studies 90.4, whereas Qwen3-32B-2504 studies 81.4. On GPQA-Diamond, Nanbeige4-3B-2511 studies 82.2, whereas Qwen3-14B-2504 studies 64.0 and Qwen3-32B-2504 studies 68.7. These are the two benchmarks the place the analysis’s “3B beats 10× bigger” framing is straight supported.

The analysis workforce additionally showcase robust software use positive aspects on BFCL-V4, Nanbeige4-3B studies 53.8 versus 47.9 for Qwen3-32B and 48.6 for Qwen3-30B-A3B. On Area-Exhausting V2, Nanbeige4-3B studies 60.0, matching the very best rating listed in that comparability desk contained in the analysis paper. On the identical time, the mannequin is just not greatest throughout each class, on Fullstack-Bench it studies 48.0, beneath Qwen3-14B at 55.7 and Qwen3-32B at 58.2, and on SuperGPQA it studies 53.2, barely beneath Qwen3-32B at 54.1.

Screenshot 2025 12 12 at 9.38.15 PM 1Screenshot 2025 12 12 at 9.38.15 PM 1
https://arxiv.org/pdf/2512.06266

The coaching recipe, the components that transfer a 3B mannequin

Hybrid Information Filtering, then resampling at scale

For pretraining, the analysis workforce mix multi dimensional tagging with similarity based mostly scoring. They cut back their labeling house to twenty dimensions and report 2 key findings, content material associated labels are extra predictive than format labels, and a superb grained 0 to 9 scoring scheme outperforms binary labeling. For similarity based mostly scoring, they construct a retrieval database with lots of of billions of entries supporting hybrid textual content and vector retrieval.

They filter to 12.5T tokens of top of the range knowledge, then choose a 6.5T greater high quality subset and upsample it for two or extra epochs, producing a last 23T token coaching corpus. That is the primary place the place the report diverges from typical small mannequin coaching, the pipeline isn’t just “clear knowledge”, it’s scored, retrieved, and resampled with specific utility assumptions.

FG-WSD, a knowledge utility scheduler as a substitute of uniform sampling

Most comparable analysis tasks deal with warmup steady decay as a studying price schedule solely. Nanbeige4-3B provides a knowledge curriculum contained in the steady section through FG-WSD, Tremendous-Grained Warmup-Steady-Decay. As a substitute of sampling a set combination all through steady coaching, they progressively focus greater high quality knowledge later in coaching.

Screenshot 2025 12 12 at 9.41.48 PM 1Screenshot 2025 12 12 at 9.41.48 PM 1
https://arxiv.org/pdf/2512.06266

In a 1B ablation educated on 1T tokens, the above Desk reveals GSM8K enhancing from 27.1 underneath vanilla WSD to 34.3 underneath FG-WSD, with positive aspects throughout CMATH, BBH, MMLU, CMMLU, and MMLU-Professional. Within the full 3B run, the analysis workforce splits coaching into Warmup, Variety-Enriched Steady, Excessive-High quality Steady, and Decay, and makes use of ABF within the decay stage to increase context size to 64K.

Screenshot 2025 12 12 at 9.42.29 PM 1Screenshot 2025 12 12 at 9.42.29 PM 1
https://arxiv.org/pdf/2512.06266

Multi-stage SFT, then repair the supervision traces

Put up coaching begins with chilly begin SFT, then general SFT. The chilly begin stage makes use of about 30M QA samples targeted on math, science, and code, with 32K context size, and a reported mixture of about 50% math reasoning, 30% scientific reasoning, and 20% code duties. The analysis workforce additionally declare that scaling chilly begin SFT directions from 0.5M to 35M retains enhancing AIME 2025 and GPQA-Diamond, with no early saturation of their experiments.

Screenshot 2025 12 12 at 9.43.36 PM 1Screenshot 2025 12 12 at 9.43.36 PM 1
https://arxiv.org/pdf/2512.06266

General SFT shifts to a 64K context size combine together with basic dialog and writing, agent type software use and planning, tougher reasoning that targets weaknesses, and coding duties. This stage introduces Answer refinement plus Chain-of-Thought reconstruction. The system runs iterative generate, critique, revise cycles guided by a dynamic guidelines, then makes use of a series completion mannequin to reconstruct a coherent CoT that’s per the ultimate refined resolution. That is meant to keep away from coaching on damaged reasoning traces after heavy modifying.

Screenshot 2025 12 12 at 9.44.05 PM 1Screenshot 2025 12 12 at 9.44.05 PM 1
https://arxiv.org/pdf/2512.06266

DPD distillation, then multi stage RL with verifiers

Distillation makes use of Twin-Stage Desire Distillation, DPD. The scholar learns token degree distributions from the trainer mannequin, whereas a sequence degree DPO goal maximizes the margin between optimistic and unfavorable responses. Positives come from sampling the trainer Nanbeige3.5-Professional, negatives are sampled from the 3B pupil, and distillation is utilized on each pattern varieties to scale back assured errors and enhance options.

Reinforcement studying is staged by area, and every stage makes use of on coverage GRPO. The analysis workforce describes on coverage knowledge filtering utilizing avg@16 go price and retaining samples strictly between 10% and 90% to keep away from trivial or inconceivable objects. STEM RL makes use of an agentic verifier that calls a Python interpreter to examine equivalence past string matching. Coding RL makes use of artificial check features, validated through sandbox execution, and makes use of go fail rewards from these exams. Human desire alignment RL makes use of a pairwise reward mannequin designed to provide preferences in a number of tokens and cut back reward hacking threat in comparison with basic language mannequin rewarders.

Screenshot 2025 12 12 at 9.45.41 PM 1Screenshot 2025 12 12 at 9.45.41 PM 1
https://arxiv.org/pdf/2512.06266

Comparability Desk

Benchmark, metric Qwen3-14B-2504 Qwen3-32B-2504 Nanbeige4-3B-2511
AIME2024, avg@8 79.3 81.4 90.4
AIME2025, avg@8 70.4 72.9 85.6
GPQA-Diamond, avg@3 64.0 68.7 82.2
SuperGPQA, avg@3 46.8 54.1 53.2
BFCL-V4, avg@3 45.4 47.9 53.8
Fullstack Bench, avg@3 55.7 58.2 48.0
ArenaHard-V2, avg@3 39.9 48.4 60.0

Key Takeaways

  • 3B can lead a lot bigger open fashions on reasoning, underneath the paper’s averaged sampling setup. Nanbeige4-3B-Pondering studies AIME 2024 avg@8 90.4 vs Qwen3-32B 81.4, and GPQA-Diamond avg@3 82.2 vs Qwen3-14B 64.0.
  • The analysis workforce is cautious about analysis, these are avg@okay outcomes with particular decoding, not single shot accuracy. AIME is avg@8, most others are avg@3, with temperature 0.6, high p 0.95, and lengthy max technology.
  • Pretraining positive aspects are tied to knowledge curriculum, not simply extra tokens. Tremendous-Grained WSD schedules greater high quality mixtures later, and the 1B ablation reveals GSM8K shifting from 27.1 to 34.3 versus vanilla scheduling.
  • Put up-training focuses on supervision high quality, then desire conscious distillation. The pipeline makes use of deliberative resolution refinement plus chain-of-thought reconstruction, then Twin Desire Distillation that mixes token distribution matching with sequence degree desire optimization.

Try the Paper and Mannequin Weights. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.


a professional linkedin headshot photogr 0jcmb0R9Sv6nW5XK zkPHw uARV5VW1ST6osLNlunoVWg

Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at remodeling advanced datasets into actionable insights.

NVIDIA 1

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s traits at this time: learn extra, subscribe to our publication, and develop into a part of the NextTech neighborhood at NextTech-news.com

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
NextTech
  • Website

Related Posts

Google AI Releases WAXAL: A Multilingual African Speech Dataset for Coaching Automated Speech Recognition and Textual content-to-Speech Fashions

March 17, 2026

The way to Construct Excessive-Efficiency GPU-Accelerated Simulations and Differentiable Physics Workflows Utilizing NVIDIA Warp Kernels

March 17, 2026

Mistral AI Releases Mistral Small 4: A 119B-Parameter MoE Mannequin that Unifies Instruct, Reasoning, and Multimodal Workloads

March 17, 2026
Add A Comment
Leave A Reply Cancel Reply

Economy News

Indian business interiors startup Officebanao raises $7.7M

By NextTechMarch 17, 2026

Business interiors startup Officebanao has raised $7.7 million in a brand new funding spherical led…

MTN to soak up 2,762 IHS Towers employees in $2.2 billion acquisition

March 17, 2026

Lawsuit claims NCAR adjustments pose ‘direct menace’ to US safety

March 17, 2026
Top Trending

Indian business interiors startup Officebanao raises $7.7M

By NextTechMarch 17, 2026

Business interiors startup Officebanao has raised $7.7 million in a brand new…

MTN to soak up 2,762 IHS Towers employees in $2.2 billion acquisition

By NextTechMarch 17, 2026

MTN Group, Africa’s greatest telco, will take in 2,762 IHS workers as…

Lawsuit claims NCAR adjustments pose ‘direct menace’ to US safety

By NextTechMarch 17, 2026

Restructuring the atmospheric analysis middle would disrupt climate and local weather knowledge…

Subscribe to News

Get the latest sports news from NewsSite about world, sports and politics.

NEXTTECH-LOGO
Facebook X (Twitter) Instagram YouTube

AI & Machine Learning

Robotics & Automation

Space & Deep Tech

Web3 & Digital Economies

Climate & Sustainability Tech

Biotech & Future Health

Mobility & Smart Cities

Global Tech Pulse

Cybersecurity & Digital Rights

Future of Work & Education

Creator Economy & Culture

Trend Radar & Startup Watch

News By Region

Africa

Asia

Europe

Middle East

North America

Oceania

South America

2025 © NextTech-News. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Advertise With Us
  • Write For Us
  • Submit Article & Press Release

Type above and press Enter to search. Press Esc to cancel.

Subscribe For Latest Updates

Sign up to best of Tech news, informed analysis and opinions on what matters to you.

Invalid email address
 We respect your inbox and never send spam. You can unsubscribe from our newsletter at any time.     
Thanks for subscribing!