---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
tags:
  - qwen2.5
  - grpo
  - rlhf
  - math
  - reasoning
  - ms-swift
datasets:
  - AI-MO/NuminaMath-TIR
language:
  - en
library_name: transformers
pipeline_tag: text-generation
---

# Qwen2.5-7B-Instruct-GRPO-Math

This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using **GRPO (Group Relative Policy Optimization)** on mathematical reasoning tasks.

## Model Description

- **Base Model**: Qwen2.5-7B-Instruct
- **Training Method**: GRPO (Reinforcement Learning)
- **Training Framework**: [ms-swift](https://github.com/modelscope/ms-swift)
- **Training Data**: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) (500 samples)
- **Hardware**: 1x NVIDIA H100 PCIe (80GB)
- **Training Time**: ~2.5 hours

## Training Details

### Training Configuration

```bash
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen2.5-7B-Instruct \
    --reward_funcs accuracy format \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --torch_dtype bfloat16 \
    --dataset 'AI-MO/NuminaMath-TIR#500' \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --learning_rate 5e-5 \
    --num_generations 2
```

### Training Metrics

- **Final Loss**: 0.00011567
- **Math Accuracy**: 70%
- **Reward**: 0.7
- **Training Steps**: 500

## Usage

### Using with Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "FutureMa/Qwen2.5-7B-Instruct-GRPO-Math"
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Generate
messages = [
    {"role": "user", "content": "Solve for x: 2x^2 - 3x + 1 = 0"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Using with ms-swift

```bash
# Inference
swift infer \
    --ckpt_dir FutureMa/Qwen2.5-7B-Instruct-GRPO-Math \
    --eval_human false
```

## Intended Use

This model is optimized for:
- ✅ Mathematical reasoning and problem-solving
- ✅ Step-by-step solution generation
- ✅ Algebraic equation solving
- ✅ Arithmetic calculations

## Limitations

- Trained on a relatively small dataset (500 samples)
- May not generalize well to very complex mathematical problems
- LoRA fine-tuning may have limited capacity compared to full fine-tuning

## Citation

```bibtex
@misc{qwen2.5-grpo-math,
  author = {FutureMa},
  title = {Qwen2.5-7B-Instruct Fine-tuned with GRPO on Math Tasks},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/FutureMa/Qwen2.5-7B-Instruct-GRPO-Math}}
}
```

## Acknowledgments

- Base model: [Qwen Team](https://huggingface.co/Qwen)
- Training framework: [ms-swift](https://github.com/modelscope/ms-swift)
- Dataset: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR)