Large Math Models: Curriculum Growing Math-Native Transformers with Synthetic Chain-of-Thought

Feng Ye

Butterfly - Machine Learning Capabilities

View Code on GitHub

1

What Our Model Can Do

Input: (((36002--20+-6062327--15)--87)-8027)

Step-by-Step Solution:

= ( ( 3 6 0 0 2 - - 2 0 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 ) - 8 0 2 7

= ( 3 6 0 0 2 - - 2 0 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 - 8 0 2 7

= ( 3 6 0 2 2 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 - 8 0 2 7

= ( - 6 0 2 6 3 0 5 - - 1 5 ) - - 8 7 - 8 0 2 7

= - 6 0 2 6 3 0 5 - - 1 5 - - 8 7 - 8 0 2 7

= - 6 0 2 6 2 9 0 - - 8 7 - 8 0 2 7

= - 6 0 2 6 2 0 3 - 8 0 2 7

= - 6 0 3 4 2 3 0 <eos>

Metrics: Loss: 0.0068 | Accuracy: 100.00% (195/195) | Confidence: 99.52%

2

Motivation

Current AI models struggle with:

Mathematical reasoning beyond pattern matching
Generalization to novel problems
Structured thinking and proof generation
Understanding vs. computation

We need models that think mathematically, not just compute.

3

Key Insights

Mathematical reasoning requires structural understanding
Chain-of-Thought reasoning makes thinking explicit
Curriculum learning enables mastery of complex concepts
Transformers can be math-native from the ground up
Structured reasoning principles generalize across domains

System evolved from prototype to 3000+ lines of production code in ~48 hours

Code 99% generated by Gemini-2.5-Pro using Cursor

4

Math-Native Architecture

Structure-Aware Design

Standard decoder-only transformer
Mathematical tokenizer
Sinusoidal encoding
Place-value encoding
Hierarchical operations

Key Differences from LLMs

Not initialized from language models
Mathematical structure directly encoded
Optimized for step-by-step reasoning
Patterns transferable to other domains

5

Encoding Process

Example for expression: 23 - (5 + 7)

# Tokenization: ['2', '3', '-', '(', '5', '+', '7', ')', '\n']
# Token IDs:     [2, 3, 14, 16, 5, 10, 7, 17, 12]

# Place Values:  [1, 0, -1, -1, 0, -1, 0, -1, -1]
# Level IDs:     [4, 4, 3, 1, 5, 2, 5, 1, 0]

# Sinusoidal encoding converts these to dense vectors

Note: Structural Level ID assignments evolve as our parser improves.

6

Synthetic Chain-of-Thought

Input: (((36002--20+-6062327--15)--87)-8027)

CoT Steps:

= ( ( 3 6 0 0 2 - - 2 0 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 ) - 8 0 2 7

= ( 3 6 0 0 2 - - 2 0 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 - 8 0 2 7

= ( 3 6 0 2 2 + - 6 0 6 2 3 2 7 - - 1 5 ) - - 8 7 - 8 0 2 7

= ( - 6 0 2 6 3 0 5 - - 1 5 ) - - 8 7 - 8 0 2 7

= - 6 0 2 6 3 0 5 - - 1 5 - - 8 7 - 8 0 2 7

= - 6 0 2 6 2 9 0 - - 8 7 - 8 0 2 7

= - 6 0 2 6 2 0 3 - 8 0 2 7

= - 6 0 3 4 2 3 0 <eos>

Generated CoT enables:

Explicit intermediate steps
Dense supervision signals
Automated verification
Human-like "showing work"

7

Curriculum Learning Strategy

Stage	Complexity	Example
1	Single-digit addition	2+3
4	Negative numbers	-25+17
15	Parentheses	(45-12)+(82-3)
51	Complex expressions	(((36002--20+-6062327--15)--87)-8027)
Future	Physics equations	F = m·a, E = mc²

Current: Linear progression

Future: Graph-based curriculum with model-chosen paths

8

Dynamic Model Growth

Parameters expand as task complexity increases
Architecture adapts to mastery metrics
Weight transfer preserves learned capabilities

if val_accuracy > MASTERY_THRESHOLD:
    next_stage = current_stage + 1
else:
    # Grow model capacity
    new_d_model = min(d_model * 2, MAX_D_MODEL)
    new_layers = min(num_layers + 1, MAX_LAYERS)

Future: Graph-based curriculum with self-directed learning paths.

9

Results at Stage 51

Model Architecture

Parameters: ~13-15 million (62MB)
d_model: 512 (64→128→256→512)
Layers: 4 (2→3→4)
Attention heads: 8

Performance

Perfect accuracy on complex expressions
Loss: 0.0068
Confidence: 99.52%
Fully deterministic - same input = same output
10x smaller than GPT-2

Key insight: Domain-specific architecture achieves superior mathematical reasoning with deterministic, reliable outputs.

10

What's Next

Ready Now

3000+ lines code
Full test coverage
Addition & subtraction
Parentheses support
Deterministic reasoning

Next Steps

Test on GSM8K word problems
Explore hybrid parsing approaches
× ÷ operations
Algebra: 2x = 10, x = ?
Symbols: π, e, ∑, ∫

Investigating whether specialized architectures can compete with general models

11

Conclusion

We've demonstrated that:

Structure-aware architectures enable genuine mathematical understanding
Curriculum learning creates robust reasoning
Models can grow dynamically to match complexity
Mathematical cognition provides foundation for broader understanding

Mathematics as the foundation of genuine AI understanding.

Our approach builds systematic mathematical cognition from first principles.

12

See It In Action

Live Training Demonstration

Watch the model learn complex mathematical reasoning in real-time

Current Model:

Stage 51+ | 13-15M Parameters | d_model=512 | 4 Layers | Fully Deterministic

Dynamic curriculum learning with self-adapting architecture

13

Questions?

Feng Ye

Butterfly - Machine Learning Capabilities

Pretraining on math & physics, post-training on human languages

Try the Code on GitHub!

(Warning: May be slow on your Mac 😅)

View Slides Online

🤔

I feel math is hard

So I'm training a model to learn math for me instead

🤖📚

- Modern Problems Require Modern Solutions

14