Imu1-Midtrain
First small language model trained on consumer GPUs with competitive performance.
Trained on 2B tokens of publicly available post-training datasets using advanced NorMuon optimizer with Cautious Weight Decay and Polar Express Newton-Shulz coefficients and WSD scheduler.
Custom library for training: https://github.com/thepowerfuldeez/sample_efficient_gpt
Training phase Stable stage - start with 50bs + 1024 context grad acc 8, so 409k tokens per update step.
Decay stage - 40bs, context 1024, grad acc 10. Inverse sqrt decay.
EMA during cooldown, post-hoc due to memory limits, increase checkpoint frequency Total micro steps: 50000 Evals
'ARC-Easy': 0.44023569023569026
'ARC-Challenge': 0.34897610921501704
'MMLU': 0.3349950149551346
'GSM8K': 0.01288855193328279
'HumanEval': 0.10365853658536585
'ChatCORE metric': 0.12309790154528757
Inference Custom fork of transformers required
uv pip install "git+https://github.com/thepowerfuldeez/transformers.git@imu1"
Custom fork of vllm required
uv pip install "git+https://github.com/thepowerfuldeez/vllm.git@imu1"
Chat template uses custom tokens such as <bos>, <user_start>, <user_end>, <assistant_start>, <assistant_end>
- Downloads last month
- 28