Qwen2.5-7B-Instruct-GRPO-Math

This model is a fine-tuned version of Qwen/Qwen2.5-7B-Instruct using GRPO (Group Relative Policy Optimization) on mathematical reasoning tasks.

Model Description

Base Model: Qwen2.5-7B-Instruct
Training Method: GRPO (Reinforcement Learning)
Training Framework: ms-swift
Training Data: AI-MO/NuminaMath-TIR (500 samples)
Hardware: 1x NVIDIA H100 PCIe (80GB)
Training Time: ~2.5 hours

Training Details

Training Configuration

CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-7B-Instruct \
    --reward_funcs accuracy format \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --torch_dtype bfloat16 \
    --dataset 'AI-MO/NuminaMath-TIR#500' \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --learning_rate 5e-5 \
    --num_generations 2

Training Metrics

Final Loss: 0.00011567
Math Accuracy: 70%
Reward: 0.7
Training Steps: 500

Usage

Using with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "FutureMa/Qwen2.5-7B-Instruct-GRPO-Math"
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Generate
messages = [
    {"role": "user", "content": "Solve for x: 2x^2 - 3x + 1 = 0"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using with ms-swift

# Inference
swift infer \
    --ckpt_dir FutureMa/Qwen2.5-7B-Instruct-GRPO-Math \
    --eval_human false

Intended Use

This model is optimized for:

✅ Mathematical reasoning and problem-solving
✅ Step-by-step solution generation
✅ Algebraic equation solving
✅ Arithmetic calculations

Limitations

Trained on a relatively small dataset (500 samples)
May not generalize well to very complex mathematical problems
LoRA fine-tuning may have limited capacity compared to full fine-tuning

Citation

@misc{qwen2.5-grpo-math,
  author = {FutureMa},
  title = {Qwen2.5-7B-Instruct Fine-tuned with GRPO on Math Tasks},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/FutureMa/Qwen2.5-7B-Instruct-GRPO-Math}}
}

Acknowledgments

Base model: Qwen Team
Training framework: ms-swift
Dataset: AI-MO/NuminaMath-TIR

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for FutureMa/Qwen2.5-7B-Instruct-GRPO-Math

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct