Large Math Models: Curriculum Growing Math-Native Transformers with Synthetic Chain-of-Thought

Feng Ye

Butterfly - Machine Learning Capabilities

View Code on GitHub

1

What Our Model Can Do

Input: (((36002--20+-6062327--15)--87)-8027)

Step-by-Step Solution:

= ( ( 3 6 0 0 2 - - 2 0 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 ) - 8 0 2 7

= ( 3 6 0 0 2 - - 2 0 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 - 8 0 2 7

= ( 3 6 0 2 2 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 - 8 0 2 7

= ( - 6 0 2 6 3 0 5 - - 1 5 ) - - 8 7 - 8 0 2 7

= - 6 0 2 6 3 0 5 - - 1 5 - - 8 7 - 8 0 2 7

= - 6 0 2 6 2 9 0 - - 8 7 - 8 0 2 7

= - 6 0 2 6 2 0 3 - 8 0 2 7

= - 6 0 3 4 2 3 0 <eos>

Metrics: Loss: 0.0068 | Accuracy: 100.00% (195/195) | Confidence: 99.52%

2

Motivation

Current AI models struggle with:

  • Mathematical reasoning beyond pattern matching
  • Generalization to novel problems
  • Structured thinking and proof generation
  • Understanding vs. computation

We need models that think mathematically, not just compute.

3

Key Insights

  1. Mathematical reasoning requires structural understanding
  2. Chain-of-Thought reasoning makes thinking explicit
  3. Curriculum learning enables mastery of complex concepts
  4. Transformers can be math-native from the ground up
  5. Structured reasoning principles generalize across domains

System evolved from prototype to 3000+ lines of production code in ~48 hours

Code 99% generated by Gemini-2.5-Pro using Cursor

4

Math-Native Architecture

Structure-Aware Design

  • Standard decoder-only transformer
  • Mathematical tokenizer
  • Sinusoidal encoding
  • Place-value encoding
  • Hierarchical operations

Key Differences from LLMs

  • Not initialized from language models
  • Mathematical structure directly encoded
  • Optimized for step-by-step reasoning
  • Patterns transferable to other domains
5

Encoding Process

Example for expression: 23 - (5 + 7)

# Tokenization: ['2', '3', '-', '(', '5', '+', '7', ')', '\n'] # Token IDs: [2, 3, 14, 16, 5, 10, 7, 17, 12] # Place Values: [1, 0, -1, -1, 0, -1, 0, -1, -1] # Level IDs: [4, 4, 3, 1, 5, 2, 5, 1, 0] # Sinusoidal encoding converts these to dense vectors

Note: Structural Level ID assignments evolve as our parser improves.

6

Synthetic Chain-of-Thought

Input: (((36002--20+-6062327--15)--87)-8027)

CoT Steps:

= ( ( 3 6 0 0 2 - - 2 0 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 ) - 8 0 2 7

= ( 3 6 0 0 2 - - 2 0 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 - 8 0 2 7

= ( 3 6 0 2 2 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 - 8 0 2 7

= ( - 6 0 2 6 3 0 5 - - 1 5 ) - - 8 7 - 8 0 2 7

= - 6 0 2 6 3 0 5 - - 1 5 - - 8 7 - 8 0 2 7

= - 6 0 2 6 2 9 0 - - 8 7 - 8 0 2 7

= - 6 0 2 6 2 0 3 - 8 0 2 7

= - 6 0 3 4 2 3 0 <eos>

Generated CoT enables:

  • Explicit intermediate steps
  • Dense supervision signals
  • Automated verification
  • Human-like "showing work"
7

Curriculum Learning Strategy

Stage Complexity Example
1 Single-digit addition 2+3
4 Negative numbers -25+17
15 Parentheses (45-12)+(82-3)
51 Complex expressions (((36002--20+-6062327--15)--87)-8027)
Future Physics equations F = m·a, E = mc²

Current: Linear progression

Future: Graph-based curriculum with model-chosen paths

8

Dynamic Model Growth

  • Parameters expand as task complexity increases
  • Architecture adapts to mastery metrics
  • Weight transfer preserves learned capabilities
if val_accuracy > MASTERY_THRESHOLD: next_stage = current_stage + 1 else: # Grow model capacity new_d_model = min(d_model * 2, MAX_D_MODEL) new_layers = min(num_layers + 1, MAX_LAYERS)

Future: Graph-based curriculum with self-directed learning paths.

9

Results at Stage 51

Model Architecture

  • Parameters: ~13-15 million (62MB)
  • d_model: 512 (64→128→256→512)
  • Layers: 4 (2→3→4)
  • Attention heads: 8

Performance

  • Perfect accuracy on complex expressions
  • Loss: 0.0068
  • Confidence: 99.52%
  • Fully deterministic - same input = same output
  • 10x smaller than GPT-2

Key insight: Domain-specific architecture achieves superior mathematical reasoning with deterministic, reliable outputs.

10

What's Next

Ready Now

  • 3000+ lines code
  • Full test coverage
  • Addition & subtraction
  • Parentheses support
  • Deterministic reasoning

Next Steps

  1. Test on GSM8K word problems
  2. Explore hybrid parsing approaches
  3. × ÷ operations
  4. Algebra: 2x = 10, x = ?
  5. Symbols: π, e, ∑, ∫

Investigating whether specialized architectures can compete with general models

11

Conclusion

We've demonstrated that:

  • Structure-aware architectures enable genuine mathematical understanding
  • Curriculum learning creates robust reasoning
  • Models can grow dynamically to match complexity
  • Mathematical cognition provides foundation for broader understanding

Mathematics as the foundation of genuine AI understanding.

Our approach builds systematic mathematical cognition from first principles.

12

See It In Action

Live Training Demonstration

Watch the model learn complex mathematical reasoning in real-time

Current Model:

Stage 51+ | 13-15M Parameters | d_model=512 | 4 Layers | Fully Deterministic

Dynamic curriculum learning with self-adapting architecture

13

Questions?

Feng Ye

Butterfly - Machine Learning Capabilities

Pretraining on math & physics, post-training on human languages

Try the Code on GitHub!

(Warning: May be slow on your Mac 😅)

View Slides Online

🤔
I feel math is hard
So I'm training a model to learn math for me instead
🤖📚
- Modern Problems Require Modern Solutions
14