Qwen2.5-7B-Instruct-GRPO-Math
This model is a fine-tuned version of Qwen/Qwen2.5-7B-Instruct using GRPO (Group Relative Policy Optimization) on mathematical reasoning tasks.
Model Description
- Base Model: Qwen2.5-7B-Instruct
- Training Method: GRPO (Reinforcement Learning)
- Training Framework: ms-swift
- Training Data: AI-MO/NuminaMath-TIR (500 samples)
- Hardware: 1x NVIDIA H100 PCIe (80GB)
- Training Time: ~2.5 hours
Training Details
Training Configuration
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
--rlhf_type grpo \
--model Qwen/Qwen2.5-7B-Instruct \
--reward_funcs accuracy format \
--train_type lora \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--torch_dtype bfloat16 \
--dataset 'AI-MO/NuminaMath-TIR#500' \
--num_train_epochs 1 \
--per_device_train_batch_size 2 \
--learning_rate 5e-5 \
--num_generations 2
Training Metrics
- Final Loss: 0.00011567
- Math Accuracy: 70%
- Reward: 0.7
- Training Steps: 500
Usage
Using with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype="auto",
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"FutureMa/Qwen2.5-7B-Instruct-GRPO-Math"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
# Generate
messages = [
{"role": "user", "content": "Solve for x: 2x^2 - 3x + 1 = 0"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Using with ms-swift
# Inference
swift infer \
--ckpt_dir FutureMa/Qwen2.5-7B-Instruct-GRPO-Math \
--eval_human false
Intended Use
This model is optimized for:
- β Mathematical reasoning and problem-solving
- β Step-by-step solution generation
- β Algebraic equation solving
- β Arithmetic calculations
Limitations
- Trained on a relatively small dataset (500 samples)
- May not generalize well to very complex mathematical problems
- LoRA fine-tuning may have limited capacity compared to full fine-tuning
Citation
@misc{qwen2.5-grpo-math,
author = {FutureMa},
title = {Qwen2.5-7B-Instruct Fine-tuned with GRPO on Math Tasks},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/FutureMa/Qwen2.5-7B-Instruct-GRPO-Math}}
}
Acknowledgments
- Base model: Qwen Team
- Training framework: ms-swift
- Dataset: AI-MO/NuminaMath-TIR