Giant language fashions (LLMs) have just lately demonstrated outstanding progress in multi-step reasoning, establishing mathematical problem-solving as a rigorous benchmark for assessing superior capabilities. Whereas proprietary fashions like GPT-4o and Claude Sonnet 4 lead efficiency, their closed-source nature impedes transparency and reproducibility. Addressing these gaps, MiroMind AI Launched the MiroMind-M1 sequence, a completely open-source pipeline—spanning datasets, fashions, coaching code, and analysis scripts—that units new requirements for openness and state-of-the-art mathematical reasoning throughout the Qwen-2.5 mannequin ecosystem.
Architectural Basis and Motivation
MiroMind-M1 is constructed on the sturdy Qwen-2.5 spine, with enhancements geared explicitly for mathematical reasoning. The workforce adopts a two-stage coaching protocol:
- Supervised Tremendous-Tuning (SFT): The mannequin is fine-tuned on 719K fastidiously curated and verified mathematical issues, equipping it with robust step-by-step reasoning skills.
- Reinforcement Studying with Verifiable Rewards (RLVR): Subsequent, the mannequin undergoes RL on 62K difficult and rigorously verifiable math issues, leveraging reward indicators from a sturdy exterior verifier.
This method is motivated by each the necessity for robust mathematical logic and by the teachings realized from main RLMs: imitating chain-of-thought exemplars improves basic reasoning, whereas reinforcement studying, guided by exact rewards, additional refines accuracy and effectivity.
Knowledge Transparency and High quality
A trademark of the MiroMind-M1 challenge is the total openness and cleanliness of its coaching information:
- SFT corpus composition: Attracts from OpenR1, OpenThoughts, Gentle-R1, and Artificial-1, guaranteeing issues have verified options and wealthy, multi-step reasoning traces.
- Stringent deduplication and decontamination: Employs N-gram overlap filtering to get rid of duplication and information leakage with analysis units (e.g., AIME24, AIME25, MATH500).
- Choice for lengthy trajectories: Experiments present that coaching on samples with longer reasoning traces constantly yields increased benchmark scores, highlighting the significance of deep semantic content material within the reasoning sign.
The ensuing dataset gives 719K verified coaching traces—considerably advancing open reproducible analysis over prior efforts.
Supervised Tremendous-Tuning: Empirical Excellence
For SFT, MiroMind-SFT-7B is initialized from Qwen2.5-Math-7B and skilled with a big context window (max 32,768 tokens) and a no-packing technique to keep away from cross-sample consideration contamination. Its efficiency on key math benchmarks outpaces peer open fashions:
| Mannequin | AIME24 | AIME25 | MATH500 |
|---|---|---|---|
| DeepSeek-R1-Distill | 55.5 | 40.4 | 92.8 |
| MiMo-7B-SFT | 58.7 | 44.3 | 93.0 |
| MiroMind-SFT-7B | 60.4 | 45.0 | 94.6 |
These outcomes validate the efficacy of the information curation and coaching design: richer, deeper samples and no-packing result in constantly superior efficiency.
CAMPO: Context-Conscious Multi-Stage Coverage Optimization
A key innovation in MiroMind-M1’s RLVR section is the CAMPO algorithm. CAMPO addresses two vital RL challenges—coaching instability and token inefficiency—by:
- Multi-stage coaching with increasing context limits: Coaching begins with constrained output lengths (e.g., 16K tokens), then regularly will increase to permit deeper reasoning, balancing effectivity and thoroughness.
- Dynamic repetition penalty: A devoted repetition critic penalizes outputs exhibiting early or extreme repetition, stopping utility collapse and imposing output variety.
- Correct exterior verifier: The reward suggestions system is considerably improved to robustly rating math solutions (together with tough instances with items, π, and percentages), guaranteeing coaching indicators are tightly aligned with true correctness.
CAMPO not solely stabilizes RL dynamics but in addition ends in fashions that clear up issues with fewer, extra related tokens—accelerating inference and lowering prices with out sacrificing accuracy.
Benchmark Efficiency: State-of-the-Artwork Effectivity
MiroMind’s open fashions obtain extremely aggressive or state-of-the-art outcomes for open Qwen-2.5-based math fashions (7B/32B parameters):
| Mannequin | AIME24 | AIME25 | MATH500 |
|---|---|---|---|
| DeepSeek-R1-7B | 55.5 | 39.2 | – |
| MiMo-7B-RL | 68.2 | 55.4 | 95.8 |
| Skywork-OR1-7B | 72.2 | 54.6 | – |
| MiroMind-RL-7B | 73.4 | 57.8 | 96.7 |
| Skywork-OR1-32B | 77.1 | 68.2 | 97.5 |
| MiroMind-RL-32B | 77.5 | 65.6 | 96.4 |
Notably, MiroMind-M1-RL fashions not solely match or exceed peer accuracy, however achieve this with larger token effectivity—the 32B mannequin produces shorter, extra concise options with out lack of correctness, due to CAMPO’s coaching.
Full Stack and Reproducibility
Each element of the MiroMind-M1 stack is brazenly launched:
- Mannequin weights (SFT and RL checkpoints for each 7B and 32B scales)
- Datasets (full 719K SFT, 62K RLVR)
- Coaching scripts (supporting multi-node distributed coaching on Ray)
- Analysis code (standardized scripts and benchmark configs)
Researchers can replicate, audit, and lengthen MiroMind-M1 from uncooked information to skilled fashions, advancing reproducibility and accelerating new open analysis.
Conclusion
MiroMind-M1 demonstrates that with cautious information curation, revolutionary RL algorithms (CAMPO), and radical transparency, open-source language fashions can rival proprietary programs in superior mathematical reasoning. This challenge units a brand new bar for reproducibility and collaborative development in reasoning LLMs, offering each a high-quality useful resource and a sturdy platform for future innovation.
Try the Paper, GitHub Web page and Mannequin on Hugging Face. All credit score for this analysis goes to the researchers of this challenge. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter.
Nikhil is an intern advisor at Marktechpost. He’s pursuing an built-in twin diploma in Supplies on the Indian Institute of Know-how, Kharagpur. Nikhil is an AI/ML fanatic who’s all the time researching functions in fields like biomaterials and biomedical science. With a robust background in Materials Science, he’s exploring new developments and creating alternatives to contribute.

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s developments right now: learn extra, subscribe to our publication, and turn out to be a part of the NextTech neighborhood at NextTech-news.com

