Imu1-Midtrain

First small language model trained on consumer GPUs with competitive performance.

Trained on 2B tokens of publicly available post-training datasets using advanced NorMuon optimizer with Cautious Weight Decay and Polar Express Newton-Shulz coefficients and WSD scheduler.

Custom library for training: https://github.com/thepowerfuldeez/sample_efficient_gpt

Training phase Stable stage - start with 50bs + 1024 context grad acc 8, so 409k tokens per update step.

Decay stage - 40bs, context 1024, grad acc 10. Inverse sqrt decay.

EMA during cooldown, post-hoc due to memory limits, increase checkpoint frequency Total micro steps: 50000 Evals

'ARC-Easy': 0.44023569023569026
'ARC-Challenge': 0.34897610921501704
'MMLU': 0.3349950149551346
'GSM8K': 0.01288855193328279
'HumanEval': 0.10365853658536585
'ChatCORE metric': 0.12309790154528757

Inference Custom fork of transformers required

uv pip install "git+https://github.com/thepowerfuldeez/transformers.git@imu1"

Custom fork of vllm required

uv pip install "git+https://github.com/thepowerfuldeez/vllm.git@imu1"

Chat template uses custom tokens such as <bos>, <user_start>, <user_end>, <assistant_start>, <assistant_end>

Downloads last month: 28

Safetensors

Model size

0.5B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

thepowerfuldeez
/

imu_1_midtrain

Imu1-Midtrain

Datasets used to train thepowerfuldeez/imu_1_midtrain