--- license: apache-2.0 base_model: Qwen/Qwen2.5-7B-Instruct tags: - qwen2.5 - grpo - rlhf - math - reasoning - ms-swift datasets: - AI-MO/NuminaMath-TIR language: - en library_name: transformers pipeline_tag: text-generation --- # Qwen2.5-7B-Instruct-GRPO-Math This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using **GRPO (Group Relative Policy Optimization)** on mathematical reasoning tasks. ## Model Description - **Base Model**: Qwen2.5-7B-Instruct - **Training Method**: GRPO (Reinforcement Learning) - **Training Framework**: [ms-swift](https://github.com/modelscope/ms-swift) - **Training Data**: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) (500 samples) - **Hardware**: 1x NVIDIA H100 PCIe (80GB) - **Training Time**: ~2.5 hours ## Training Details ### Training Configuration ```bash CUDA_VISIBLE_DEVICES=0 \ swift rlhf \ --rlhf_type grpo \ --model Qwen/Qwen2.5-7B-Instruct \ --reward_funcs accuracy format \ --train_type lora \ --lora_rank 8 \ --lora_alpha 32 \ --target_modules all-linear \ --torch_dtype bfloat16 \ --dataset 'AI-MO/NuminaMath-TIR#500' \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --learning_rate 5e-5 \ --num_generations 2 ``` ### Training Metrics - **Final Loss**: 0.00011567 - **Math Accuracy**: 70% - **Reward**: 0.7 - **Training Steps**: 500 ## Usage ### Using with Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model base_model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-7B-Instruct", torch_dtype="auto", device_map="auto" ) # Load LoRA adapter model = PeftModel.from_pretrained( base_model, "FutureMa/Qwen2.5-7B-Instruct-GRPO-Math" ) tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct") # Generate messages = [ {"role": "user", "content": "Solve for x: 2x^2 - 3x + 1 = 0"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Using with ms-swift ```bash # Inference swift infer \ --ckpt_dir FutureMa/Qwen2.5-7B-Instruct-GRPO-Math \ --eval_human false ``` ## Intended Use This model is optimized for: - ✅ Mathematical reasoning and problem-solving - ✅ Step-by-step solution generation - ✅ Algebraic equation solving - ✅ Arithmetic calculations ## Limitations - Trained on a relatively small dataset (500 samples) - May not generalize well to very complex mathematical problems - LoRA fine-tuning may have limited capacity compared to full fine-tuning ## Citation ```bibtex @misc{qwen2.5-grpo-math, author = {FutureMa}, title = {Qwen2.5-7B-Instruct Fine-tuned with GRPO on Math Tasks}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/FutureMa/Qwen2.5-7B-Instruct-GRPO-Math}} } ``` ## Acknowledgments - Base model: [Qwen Team](https://huggingface.co/Qwen) - Training framework: [ms-swift](https://github.com/modelscope/ms-swift) - Dataset: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR)