Upload GRPO fine-tuned Qwen2.5-7B-Instruct model

Browse files

Files changed (10) hide show

README.md +134 -0
adapter_config.json +46 -0
adapter_model.safetensors +3 -0
additional_config.json +1 -0
args.json +475 -0
optimizer.pt +3 -0
rng_state.pth +3 -0
scheduler.pt +3 -0
trainer_state.json +2458 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,134 @@

+---
+license: apache-2.0
+base_model: Qwen/Qwen2.5-7B-Instruct
+tags:
+  - qwen2.5
+  - grpo
+  - rlhf
+  - math
+  - reasoning
+  - ms-swift
+datasets:
+  - AI-MO/NuminaMath-TIR
+language:
+  - en
+library_name: transformers
+pipeline_tag: text-generation
+---
+# Qwen2.5-7B-Instruct-GRPO-Math
+This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) using **GRPO (Group Relative Policy Optimization)** on mathematical reasoning tasks.
+## Model Description
+- **Base Model**: Qwen2.5-7B-Instruct
+- **Training Method**: GRPO (Reinforcement Learning)
+- **Training Framework**: [ms-swift](https://github.com/modelscope/ms-swift)
+- **Training Data**: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR) (500 samples)
+- **Hardware**: 1x NVIDIA H100 PCIe (80GB)
+- **Training Time**: ~2.5 hours
+## Training Details
+### Training Configuration
+```bash
+CUDA_VISIBLE_DEVICES=0 \
+swift rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen2.5-7B-Instruct \
+    --reward_funcs accuracy format \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --torch_dtype bfloat16 \
+    --dataset 'AI-MO/NuminaMath-TIR#500' \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 2 \
+    --learning_rate 5e-5 \
+    --num_generations 2
+```
+### Training Metrics
+- **Final Loss**: 0.00011567
+- **Math Accuracy**: 70%
+- **Reward**: 0.7
+- **Training Steps**: 500
+## Usage
+### Using with Transformers
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen2.5-7B-Instruct",
+    torch_dtype="auto",
+    device_map="auto"
+)
+# Load LoRA adapter
+model = PeftModel.from_pretrained(
+    base_model,
+    "FutureMa/Qwen2.5-7B-Instruct-GRPO-Math"
+)
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
+# Generate
+messages = [
+    {"role": "user", "content": "Solve for x: 2x^2 - 3x + 1 = 0"}
+]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer([text], return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=512)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+### Using with ms-swift
+```bash
+# Inference
+swift infer \
+    --ckpt_dir FutureMa/Qwen2.5-7B-Instruct-GRPO-Math \
+    --eval_human false
+```
+## Intended Use
+This model is optimized for:
+- ✅ Mathematical reasoning and problem-solving
+- ✅ Step-by-step solution generation
+- ✅ Algebraic equation solving
+- ✅ Arithmetic calculations
+## Limitations
+- Trained on a relatively small dataset (500 samples)
+- May not generalize well to very complex mathematical problems
+- LoRA fine-tuning may have limited capacity compared to full fine-tuning
+## Citation
+```bibtex
+@misc{qwen2.5-grpo-math,
+  author = {FutureMa},
+  title = {Qwen2.5-7B-Instruct Fine-tuned with GRPO on Math Tasks},
+  year = {2025},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/FutureMa/Qwen2.5-7B-Instruct-GRPO-Math}}
+}
+```
+## Acknowledgments
+- Base model: [Qwen Team](https://huggingface.co/Qwen)
+- Training framework: [ms-swift](https://github.com/modelscope/ms-swift)
+- Dataset: [AI-MO/NuminaMath-TIR](https://huggingface.co/datasets/AI-MO/NuminaMath-TIR)

adapter_config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "/home/ubuntu/.cache/modelscope/hub/models/Qwen/Qwen2___5-7B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": [],
+  "peft_type": "LORA",
+  "peft_version": "0.18.0",
+  "qalora_group_size": 16,
+  "r": 8,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "gate_proj",
+    "down_proj",
+    "v_proj",
+    "up_proj",
+    "q_proj",
+    "k_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d2730d736399f1c5a46fc879d33a5540d8cf3d6c3f0a797e7e089e7922d259ed
+size 80792096

additional_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"lora_dtype": null, "lorap_lr_ratio": null, "lorap_emb_lr": 1e-06}

args.json ADDED Viewed

	@@ -0,0 +1,475 @@

+{
+  "output_dir": "/home/ubuntu/ms-swift/output/grpo_qwen2.5_7b/v1-20251128-020354",
+  "overwrite_output_dir": false,
+  "do_train": false,
+  "do_eval": false,
+  "do_predict": false,
+  "eval_strategy": "no",
+  "prediction_loss_only": false,
+  "per_device_train_batch_size": 2,
+  "per_device_eval_batch_size": 2,
+  "per_gpu_train_batch_size": null,
+  "per_gpu_eval_batch_size": null,
+  "gradient_accumulation_steps": 1,
+  "eval_accumulation_steps": null,
+  "eval_delay": 0,
+  "torch_empty_cache_steps": null,
+  "learning_rate": 5e-05,
+  "weight_decay": 0.1,
+  "adam_beta1": 0.9,
+  "adam_beta2": 0.95,
+  "adam_epsilon": 1e-08,
+  "max_grad_norm": 1.0,
+  "num_train_epochs": 1.0,
+  "max_steps": -1,
+  "lr_scheduler_type": "cosine",
+  "lr_scheduler_kwargs": null,
+  "warmup_ratio": 0.05,
+  "warmup_steps": 0,
+  "log_level": "passive",
+  "log_level_replica": "warning",
+  "log_on_each_node": true,
+  "logging_dir": "/home/ubuntu/ms-swift/output/grpo_qwen2.5_7b/v1-20251128-020354/runs",
+  "logging_strategy": "steps",
+  "logging_first_step": true,
+  "logging_steps": 5,
+  "logging_nan_inf_filter": true,
+  "save_strategy": "steps",
+  "save_steps": 50.0,
+  "save_total_limit": 2,
+  "save_safetensors": true,
+  "save_on_each_node": false,
+  "save_only_model": false,
+  "restore_callback_states_from_checkpoint": false,
+  "no_cuda": false,
+  "use_cpu": false,
+  "use_mps_device": false,
+  "seed": 42,
+  "data_seed": 42,
+  "jit_mode_eval": false,
+  "bf16": true,
+  "fp16": false,
+  "fp16_opt_level": "O1",
+  "half_precision_backend": "auto",
+  "bf16_full_eval": false,
+  "fp16_full_eval": false,
+  "tf32": null,
+  "local_rank": -1,
+  "ddp_backend": null,
+  "tpu_num_cores": null,
+  "tpu_metrics_debug": false,
+  "debug": null,
+  "dataloader_drop_last": false,
+  "eval_steps": 50.0,
+  "dataloader_num_workers": 4,
+  "dataloader_prefetch_factor": null,
+  "past_index": -1,
+  "run_name": "/home/ubuntu/ms-swift/output/grpo_qwen2.5_7b/v1-20251128-020354",
+  "disable_tqdm": null,
+  "remove_unused_columns": false,
+  "label_names": null,
+  "load_best_model_at_end": false,
+  "metric_for_best_model": "loss",
+  "greater_is_better": false,
+  "ignore_data_skip": false,
+  "fsdp": null,
+  "fsdp_min_num_params": 0,
+  "fsdp_config": null,
+  "fsdp_transformer_layer_cls_to_wrap": null,
+  "accelerator_config": {
+    "dispatch_batches": false
+  },
+  "parallelism_config": null,
+  "deepspeed": null,
+  "label_smoothing_factor": 0.0,
+  "optim": "adamw_torch_fused",
+  "optim_args": null,
+  "adafactor": false,
+  "group_by_length": false,
+  "length_column_name": "length",
+  "report_to": [
+    "tensorboard"
+  ],
+  "project": "huggingface",
+  "trackio_space_id": "trackio",
+  "ddp_find_unused_parameters": null,
+  "ddp_bucket_cap_mb": null,
+  "ddp_broadcast_buffers": null,
+  "dataloader_pin_memory": true,
+  "dataloader_persistent_workers": false,
+  "skip_memory_metrics": true,
+  "use_legacy_prediction_loop": false,
+  "push_to_hub": false,
+  "resume_from_checkpoint": null,
+  "hub_model_id": null,
+  "hub_strategy": "every_save",
+  "hub_token": null,
+  "hub_private_repo": null,
+  "hub_always_push": false,
+  "hub_revision": null,
+  "gradient_checkpointing": true,
+  "gradient_checkpointing_kwargs": null,
+  "include_inputs_for_metrics": false,
+  "include_for_metrics": [],
+  "eval_do_concat_batches": true,
+  "fp16_backend": "auto",
+  "push_to_hub_model_id": null,
+  "push_to_hub_organization": null,
+  "push_to_hub_token": null,
+  "mp_parameters": "",
+  "auto_find_batch_size": false,
+  "full_determinism": false,
+  "torchdynamo": null,
+  "ray_scope": "last",
+  "ddp_timeout": 18000000,
+  "torch_compile": false,
+  "torch_compile_backend": null,
+  "torch_compile_mode": null,
+  "include_tokens_per_second": false,
+  "include_num_input_tokens_seen": false,
+  "neftune_noise_alpha": null,
+  "optim_target_modules": null,
+  "batch_eval_metrics": false,
+  "eval_on_start": false,
+  "use_liger_kernel": false,
+  "liger_kernel_config": null,
+  "eval_use_gather_object": false,
+  "average_tokens_across_devices": true,
+  "sortish_sampler": false,
+  "predict_with_generate": false,
+  "generation_max_length": null,
+  "generation_num_beams": null,
+  "generation_config": null,
+  "tuner_backend": "peft",
+  "vit_gradient_checkpointing": null,
+  "router_aux_loss_coef": 0.0,
+  "enable_dft_loss": false,
+  "enable_channel_loss": false,
+  "check_model": true,
+  "acc_strategy": "token",
+  "train_dataloader_shuffle": true,
+  "max_epochs": null,
+  "aligner_lr": null,
+  "vit_lr": null,
+  "use_logits_to_keep": null,
+  "ds3_gather_for_generation": true,
+  "resume_only_model": false,
+  "optimizer": null,
+  "loss_type": "grpo",
+  "metric": null,
+  "eval_use_evalscope": false,
+  "eval_dataset": [],
+  "eval_dataset_args": null,
+  "eval_limit": null,
+  "eval_generation_config": null,
+  "extra_eval_args": null,
+  "use_flash_ckpt": false,
+  "use_ray": false,
+  "ray_exp_name": null,
+  "device_groups": null,
+  "model": "Qwen/Qwen2.5-7B-Instruct",
+  "model_type": "qwen2_5",
+  "model_revision": null,
+  "task_type": "causal_lm",
+  "torch_dtype": "bfloat16",
+  "attn_impl": null,
+  "new_special_tokens": [],
+  "num_labels": null,
+  "problem_type": null,
+  "rope_scaling": null,
+  "device_map": null,
+  "max_memory": {},
+  "max_model_len": null,
+  "local_repo_path": null,
+  "init_strategy": null,
+  "template": "qwen2_5",
+  "system": null,
+  "max_length": 2048,
+  "truncation_strategy": "left",
+  "max_pixels": null,
+  "agent_template": null,
+  "norm_bbox": null,
+  "use_chat_template": true,
+  "padding_free": false,
+  "padding_side": "right",
+  "loss_scale": "last_round",
+  "sequence_parallel_size": 1,
+  "response_prefix": null,
+  "template_backend": "swift",
+  "dataset": [
+    "AI-MO/NuminaMath-TIR#500"
+  ],
+  "val_dataset": [],
+  "cached_dataset": [],
+  "split_dataset_ratio": 0.0,
+  "dataset_num_proc": 4,
+  "load_from_cache_file": true,
+  "dataset_shuffle": true,
+  "val_dataset_shuffle": false,
+  "streaming": false,
+  "interleave_prob": null,
+  "stopping_strategy": "first_exhausted",
+  "shuffle_buffer_size": 1000,
+  "download_mode": "reuse_dataset_if_exists",
+  "columns": {},
+  "strict": false,
+  "model_name": null,
+  "model_author": null,
+  "custom_dataset_info": [],
+  "quant_method": null,
+  "quant_bits": null,
+  "hqq_axis": null,
+  "bnb_4bit_compute_dtype": "bfloat16",
+  "bnb_4bit_quant_type": "nf4",
+  "bnb_4bit_use_double_quant": true,
+  "bnb_4bit_quant_storage": null,
+  "max_new_tokens": 1024,
+  "temperature": 0.9,
+  "top_k": 50,
+  "top_p": 0.9,
+  "repetition_penalty": 1.0,
+  "num_beams": 1,
+  "stream": false,
+  "stop_words": [],
+  "logprobs": false,
+  "top_logprobs": null,
+  "ckpt_dir": null,
+  "lora_modules": [],
+  "train_type": "lora",
+  "adapters": [],
+  "external_plugins": [],
+  "model_kwargs": {},
+  "load_args": false,
+  "load_data_args": false,
+  "packing": false,
+  "packing_length": null,
+  "packing_num_proc": 1,
+  "lazy_tokenize": false,
+  "custom_register_path": [],
+  "use_hf": false,
+  "ignore_args_error": false,
+  "use_swift_lora": false,
+  "freeze_parameters": [],
+  "freeze_parameters_regex": null,
+  "freeze_parameters_ratio": 0.0,
+  "trainable_parameters": [],
+  "trainable_parameters_regex": null,
+  "freeze_llm": false,
+  "freeze_vit": true,
+  "freeze_aligner": true,
+  "target_modules": [
+    "all-linear"
+  ],
+  "target_regex": null,
+  "target_parameters": null,
+  "modules_to_save": [],
+  "lora_rank": 8,
+  "lora_alpha": 32,
+  "lora_dropout": 0.05,
+  "lora_bias": "none",
+  "lora_dtype": null,
+  "lorap_lr_ratio": null,
+  "use_rslora": false,
+  "use_dora": false,
+  "lora_ga_batch_size": 2,
+  "lora_ga_iters": 2,
+  "lora_ga_max_length": 1024,
+  "lora_ga_direction": "ArB2r",
+  "lora_ga_scale": "stable",
+  "lora_ga_stable_gamma": 16,
+  "init_weights": true,
+  "fourier_n_frequency": 2000,
+  "fourier_scaling": 300.0,
+  "boft_block_size": 4,
+  "boft_block_num": 0,
+  "boft_n_butterfly_factor": 1,
+  "boft_dropout": 0.0,
+  "vera_rank": 256,
+  "vera_projection_prng_key": 0,
+  "vera_dropout": 0.0,
+  "vera_d_initial": 0.1,
+  "adapter_act": "gelu",
+  "adapter_length": 128,
+  "use_galore": false,
+  "galore_target_modules": null,
+  "galore_rank": 128,
+  "galore_update_proj_gap": 50,
+  "galore_scale": 1.0,
+  "galore_proj_type": "std",
+  "galore_optim_per_parameter": false,
+  "galore_with_embedding": false,
+  "galore_quantization": false,
+  "galore_proj_quant": false,
+  "galore_proj_bits": 4,
+  "galore_proj_group_size": 256,
+  "galore_cos_threshold": 0.4,
+  "galore_gamma_proj": 2,
+  "galore_queue_size": 5,
+  "adalora_target_r": 8,
+  "adalora_init_r": 12,
+  "adalora_tinit": 0,
+  "adalora_tfinal": 0,
+  "adalora_deltaT": 1,
+  "adalora_beta1": 0.85,
+  "adalora_beta2": 0.85,
+  "adalora_orth_reg_weight": 0.5,
+  "llamapro_num_new_blocks": 4,
+  "llamapro_num_groups": null,
+  "lisa_activated_layers": 0,
+  "lisa_step_interval": 20,
+  "reft_layer_key": null,
+  "reft_layers": null,
+  "reft_rank": 4,
+  "reft_intervention_type": "LoreftIntervention",
+  "reft_args": null,
+  "swanlab_token": null,
+  "swanlab_project": null,
+  "swanlab_workspace": null,
+  "swanlab_exp_name": null,
+  "swanlab_lark_webhook_url": null,
+  "swanlab_lark_secret": null,
+  "swanlab_mode": "cloud",
+  "add_version": true,
+  "create_checkpoint_symlink": false,
+  "zero_hpz_partition_size": null,
+  "deepspeed_autotp_size": null,
+  "early_stop_interval": null,
+  "sft_alpha": 0,
+  "chord_sft_dataset": [],
+  "chord_sft_per_device_train_batch_size": null,
+  "chord_enable_phi_function": false,
+  "chord_mu_warmup_steps": null,
+  "chord_mu_decay_steps": null,
+  "chord_mu_peak": null,
+  "chord_mu_valley": null,
+  "reward_model": null,
+  "reward_adapters": [],
+  "reward_model_type": null,
+  "reward_model_revision": null,
+  "num_ppo_epochs": 4,
+  "whiten_rewards": false,
+  "kl_coef": 0.05,
+  "cliprange": 0.2,
+  "vf_coef": 0.1,
+  "cliprange_value": 0.2,
+  "gamma": 1.0,
+  "lam": 0.95,
+  "num_mini_batches": 1,
+  "local_rollout_forward_batch_size": 64,
+  "num_sample_generations": 10,
+  "response_length": 1024,
+  "missing_eos_penalty": null,
+  "vllm_gpu_memory_utilization": 0.9,
+  "vllm_tensor_parallel_size": 1,
+  "vllm_pipeline_parallel_size": 1,
+  "vllm_enable_expert_parallel": false,
+  "vllm_max_num_seqs": 256,
+  "vllm_max_model_len": null,
+  "vllm_disable_custom_all_reduce": true,
+  "vllm_enforce_eager": false,
+  "vllm_limit_mm_per_prompt": null,
+  "vllm_max_lora_rank": 16,
+  "vllm_enable_prefix_caching": true,
+  "vllm_use_async_engine": false,
+  "vllm_quantization": null,
+  "vllm_reasoning_parser": null,
+  "vllm_disable_cascade_attn": false,
+  "vllm_mm_processor_cache_gb": null,
+  "vllm_speculative_config": null,
+  "vllm_engine_kwargs": {},
+  "vllm_data_parallel_size": 1,
+  "use_vllm": false,
+  "vllm_mode": "colocate",
+  "vllm_enable_lora": false,
+  "vllm_server_base_url": null,
+  "vllm_server_host": null,
+  "vllm_server_port": [
+    8000
+  ],
+  "vllm_server_timeout": 240.0,
+  "async_generate": false,
+  "sleep_level": 0,
+  "move_model_batches": null,
+  "offload_optimizer": false,
+  "offload_model": false,
+  "wandb_log_unique_prompts": null,
+  "epsilon": 0.2,
+  "epsilon_high": null,
+  "delta": null,
+  "cosine_min_len_value_wrong": -0.5,
+  "cosine_max_len_value_wrong": 0.0,
+  "cosine_min_len_value_correct": 1.0,
+  "cosine_max_len_value_correct": 0.5,
+  "cosine_max_len": null,
+  "repetition_n_grams": 3,
+  "repetition_max_penalty": -1.0,
+  "reward_model_plugin": null,
+  "sync_ref_model": false,
+  "ref_model_sync_steps": 512,
+  "ref_model_mixup_alpha": 0.6,
+  "multi_turn_scheduler": null,
+  "max_turns": null,
+  "completion_length_limit_scope": "per_round",
+  "vllm_server_pass_dataset": false,
+  "dynamic_sample": false,
+  "max_resample_times": 3,
+  "overlong_filter": false,
+  "soft_max_length": null,
+  "soft_cache_length": null,
+  "scale_rewards": "group",
+  "log_entropy": false,
+  "top_entropy_quantile": 1.0,
+  "importance_sampling_level": "token",
+  "tau_pos": 1.0,
+  "tau_neg": 1.05,
+  "advantage_estimator": "grpo",
+  "kl_in_reward": false,
+  "generation_batch_size": null,
+  "steps_per_generation": null,
+  "rollout_importance_sampling_mode": null,
+  "rollout_importance_sampling_threshold": 2.0,
+  "num_generations": 2,
+  "reward_funcs": [
+    "accuracy",
+    "format"
+  ],
+  "reward_weights": null,
+  "log_completions": true,
+  "num_iterations": 1,
+  "teacher_model": null,
+  "teacher_adapters": [],
+  "teacher_model_type": null,
+  "teacher_model_revision": null,
+  "teacher_deepspeed": null,
+  "rlhf_type": "grpo",
+  "ref_model": null,
+  "ref_adapters": [],
+  "ref_model_type": null,
+  "ref_model_revision": null,
+  "beta": 0.04,
+  "label_smoothing": 0,
+  "max_completion_length": 1024,
+  "rpo_alpha": null,
+  "ld_alpha": null,
+  "discopop_tau": 0.05,
+  "loss_weights": null,
+  "cpo_alpha": 1.0,
+  "simpo_gamma": 1,
+  "desirable_weight": 1.0,
+  "undesirable_weight": 1.0,
+  "center_rewards_coefficient": null,
+  "lmbda": 0.5,
+  "seq_kd": false,
+  "offload_teacher_model": false,
+  "rank": -1,
+  "global_world_size": 1,
+  "local_world_size": 1,
+  "model_suffix": "Qwen2.5-7B-Instruct",
+  "model_info": "ModelInfo(model_type='qwen2_5', model_dir='/home/ubuntu/.cache/modelscope/hub/models/Qwen/Qwen2___5-7B-Instruct', torch_dtype=torch.bfloat16, max_model_len=32768, quant_method=None, quant_bits=None, rope_scaling=None, is_moe_model=False, config=None, task_type='causal_lm', num_labels=None)",
+  "model_meta": "ModelMeta(model_type='qwen2_5', model_groups=[ModelGroup(models=[Model(ms_model_id='Qwen/Qwen2.5-0.5B-Instruct', hf_model_id='Qwen/Qwen2.5-0.5B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-1.5B-Instruct', hf_model_id='Qwen/Qwen2.5-1.5B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-3B-Instruct', hf_model_id='Qwen/Qwen2.5-3B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-7B-Instruct', hf_model_id='Qwen/Qwen2.5-7B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-14B-Instruct', hf_model_id='Qwen/Qwen2.5-14B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-32B-Instruct', hf_model_id='Qwen/Qwen2.5-32B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-72B-Instruct', hf_model_id='Qwen/Qwen2.5-72B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-0.5B', hf_model_id='Qwen/Qwen2.5-0.5B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-1.5B', hf_model_id='Qwen/Qwen2.5-1.5B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-3B', hf_model_id='Qwen/Qwen2.5-3B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-7B', hf_model_id='Qwen/Qwen2.5-7B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-14B', hf_model_id='Qwen/Qwen2.5-14B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-32B', hf_model_id='Qwen/Qwen2.5-32B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-72B', hf_model_id='Qwen/Qwen2.5-72B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-14B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-32B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-3B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-3B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-7B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-7B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-14B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-32B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-72B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-72B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-0.5B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-0.5B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-1.5B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-1.5B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-3B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-3B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-7B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-7B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-14B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-14B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-32B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-32B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-72B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-72B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None)], ignore_patterns=None, requires=None, tags=[]), ModelGroup(models=[Model(ms_model_id='Qwen/Qwen2.5-Coder-0.5B-Instruct', hf_model_id='Qwen/Qwen2.5-Coder-0.5B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-1.5B-Instruct', hf_model_id='Qwen/Qwen2.5-Coder-1.5B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-3B-Instruct', hf_model_id='Qwen/Qwen2.5-Coder-3B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-7B-Instruct', hf_model_id='Qwen/Qwen2.5-Coder-7B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-14B-Instruct', hf_model_id='Qwen/Qwen2.5-Coder-14B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-32B-Instruct', hf_model_id='Qwen/Qwen2.5-Coder-32B-Instruct', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-0.5B', hf_model_id='Qwen/Qwen2.5-Coder-0.5B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-1.5B', hf_model_id='Qwen/Qwen2.5-Coder-1.5B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-3B', hf_model_id='Qwen/Qwen2.5-Coder-3B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-7B', hf_model_id='Qwen/Qwen2.5-Coder-7B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-14B', hf_model_id='Qwen/Qwen2.5-Coder-14B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-32B', hf_model_id='Qwen/Qwen2.5-Coder-32B', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-0.5B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-Coder-0.5B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-1.5B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-Coder-1.5B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-3B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-Coder-3B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-7B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-Coder-7B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-14B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-Coder-14B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-32B-Instruct-AWQ', hf_model_id='Qwen/Qwen2.5-Coder-32B-Instruct-AWQ', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-0.5B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-Coder-0.5B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-0.5B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-Coder-0.5B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-3B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-Coder-3B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-3B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-Coder-3B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-7B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-Coder-7B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-7B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-Coder-7B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-14B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-Coder-14B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-14B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-Coder-14B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4', hf_model_id='Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int4', model_path=None, ms_revision=None, hf_revision=None), Model(ms_model_id='Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int8', hf_model_id='Qwen/Qwen2.5-Coder-32B-Instruct-GPTQ-Int8', model_path=None, ms_revision=None, hf_revision=None)], ignore_patterns=None, requires=None, tags=['coding']), ModelGroup(models=[Model(ms_model_id='moonshotai/Kimi-Dev-72B', hf_model_id='moonshotai/Kimi-Dev-72B', model_path=None, ms_revision=None, hf_revision=None)], ignore_patterns=None, requires=None, tags=[])], template='qwen2_5', get_function=<function get_model_tokenizer_with_flash_attn at 0x770ad2b563b0>, model_arch=ModelKeys(arch_name='llama', embedding='model.embed_tokens', module_list='model.layers', lm_head='lm_head', q_proj='model.layers.{}.self_attn.q_proj', k_proj='model.layers.{}.self_attn.k_proj', v_proj='model.layers.{}.self_attn.v_proj', o_proj='model.layers.{}.self_attn.o_proj', attention='model.layers.{}.self_attn', mlp='model.layers.{}.mlp', down_proj='model.layers.{}.mlp.down_proj', qkv_proj=None, qk_proj=None, qa_proj=None, qb_proj=None, kv_proj=None, kva_proj=None, kvb_proj=None), architectures=['Qwen2ForCausalLM'], additional_saved_files=[], torch_dtype=None, is_multimodal=False, is_reward=False, is_reranker=False, task_type=None, ignore_patterns=None, requires=['transformers>=4.37'], tags=[])",
+  "model_dir": "/home/ubuntu/.cache/modelscope/hub/models/Qwen/Qwen2___5-7B-Instruct",
+  "_val_dataset_exists": [],
+  "hub": "<class 'swift.hub.hub.MSHub'>",
+  "evaluation_strategy": "steps",
+  "training_args": "GRPOConfig(output_dir='/home/ubuntu/ms-swift/output/grpo_qwen2.5_7b/v1-20251128-020354', overwrite_output_dir=False, do_train=False, do_eval=False, do_predict=False, eval_strategy=<IntervalStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=2, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, eval_delay=0, torch_empty_cache_steps=None, learning_rate=5e-05, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs=None, warmup_ratio=0.05, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/ubuntu/ms-swift/output/grpo_qwen2.5_7b/v1-20251128-020354/runs', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=True, logging_steps=5, logging_nan_inf_filter=True, save_strategy=<SaveStrategy.STEPS: 'steps'>, save_steps=50, save_total_limit=2, save_safetensors=True, save_on_each_node=False, save_only_model=False, restore_callback_states_from_checkpoint=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=42, jit_mode_eval=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=None, local_rank=0, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=True, eval_steps=50.0, dataloader_num_workers=4, dataloader_prefetch_factor=10, past_index=-1, run_name='/home/ubuntu/ms-swift/output/grpo_qwen2.5_7b/v1-20251128-020354', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model='loss', greater_is_better=False, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=False, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None, use_configured_state=False), parallelism_config=None, deepspeed=None, label_smoothing_factor=0.0, optim=<OptimizerNames.ADAMW_TORCH_FUSED: 'adamw_torch_fused'>, optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['tensorboard'], project='huggingface', trackio_space_id='trackio', ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=None, hub_always_push=False, hub_revision=None, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, include_for_metrics=[], eval_do_concat_batches=True, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=18000000, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, include_tokens_per_second=None, include_num_input_tokens_seen=None, neftune_noise_alpha=None, optim_target_modules=None, batch_eval_metrics=False, eval_on_start=False, use_liger_kernel=False, liger_kernel_config=None, eval_use_gather_object=False, average_tokens_across_devices=None, model_init_kwargs=None, disable_dropout=False, max_prompt_length=512, num_generations=2, max_completion_length=1024, ds3_gather_for_generation=True, shuffle_dataset=True, generation_batch_size=2, steps_per_generation=1, temperature=0.9, top_p=0.9, top_k=50, min_p=None, generation_kwargs=None, repetition_penalty=1.0, use_transformers_paged=False, cache_implementation=None, use_vllm=False, vllm_mode='colocate', vllm_model_impl='vllm', vllm_enable_sleep_mode=False, vllm_guided_decoding_regex=None, vllm_server_base_url=None, vllm_server_host=None, vllm_server_port=[8000], vllm_server_timeout=240.0, vllm_gpu_memory_utilization=0.9, vllm_tensor_parallel_size=1, beta=0.04, num_iterations=1, epsilon=0.2, delta=None, epsilon_high=None, importance_sampling_level='token', reward_weights=None, scale_rewards='group', loss_type='grpo', mask_truncated_completions=False, sync_ref_model=False, ref_model_mixup_alpha=0.6, ref_model_sync_steps=512, top_entropy_quantile=1.0, use_liger_loss=False, vllm_importance_sampling_correction=True, vllm_importance_sampling_cap=2.0, log_completions=True, num_completions_to_print=None, wandb_log_unique_prompts=None, tuner_backend='peft', vit_gradient_checkpointing=True, router_aux_loss_coef=0.0, enable_dft_loss=False, enable_channel_loss=False, check_model=True, acc_strategy='token', train_dataloader_shuffle=True, max_epochs=None, aligner_lr=None, vit_lr=None, use_logits_to_keep=None, resume_only_model=False, optimizer=None, metric=None, eval_use_evalscope=False, eval_dataset=[], eval_dataset_args=None, eval_limit=None, eval_generation_config=None, extra_eval_args=None, use_flash_ckpt=False, sft_alpha=0, chord_sft_dataset=[], chord_sft_per_device_train_batch_size=None, chord_enable_phi_function=False, chord_mu_warmup_steps=None, chord_mu_decay_steps=None, chord_mu_peak=None, chord_mu_valley=None, train_type='lora', local_repo_path=None, galore_config=None, padding_side='right', padding_free=False, task_type='causal_lm', problem_type=None, vllm_pipeline_parallel_size=1, vllm_enable_expert_parallel=False, vllm_max_num_seqs=256, vllm_max_model_len=None, vllm_disable_custom_all_reduce=True, vllm_enforce_eager=False, vllm_limit_mm_per_prompt=None, vllm_max_lora_rank=16, vllm_enable_prefix_caching=True, vllm_use_async_engine=False, vllm_quantization=None, vllm_reasoning_parser=None, vllm_disable_cascade_attn=False, vllm_mm_processor_cache_gb=None, vllm_speculative_config=None, vllm_engine_kwargs={}, vllm_data_parallel_size=1, stop_words=[], vllm_enable_lora=False, lora_rank=8, async_generate=False, sleep_level=0, move_model_batches=None, offload_optimizer=False, offload_model=False, cosine_min_len_value_wrong=-0.5, cosine_max_len_value_wrong=0.0, cosine_min_len_value_correct=1.0, cosine_max_len_value_correct=0.5, cosine_max_len=1024, repetition_n_grams=3, repetition_max_penalty=-1.0, reward_model=None, reward_model_plugin=None, multi_turn_scheduler=None, max_turns=None, completion_length_limit_scope='per_round', vllm_server_pass_dataset=False, dynamic_sample=False, max_resample_times=3, overlong_filter=False, soft_max_length=None, soft_cache_length=None, log_entropy=False, tau_pos=1.0, tau_neg=1.05, advantage_estimator='grpo', kl_in_reward=False, dataset_shuffle=True, rollout_importance_sampling_mode=None, rollout_importance_sampling_threshold=2.0)"
+}

optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:431f676d4b452a02254776f63e46224aa4ce1cd7ad37a999764d2568e58b211f
+size 161816187

rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e18e2d2755a4eceddb4a0fedea406eb9ca81a6bd3330e768838765a9ff0fd6c6
+size 14645

scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3105ebe8471f9890c3eb1f20cc0f2520fa5fdb0128474bbc87e607b2ec7c53dc
+size 1465

trainer_state.json ADDED Viewed

	@@ -0,0 +1,2458 @@

+{
+  "best_global_step": null,
+  "best_metric": null,
+  "best_model_checkpoint": null,
+  "epoch": 1.0,
+  "eval_steps": 50.0,
+  "global_step": 500,
+  "is_hyper_param_search": false,
+  "is_local_process_zero": true,
+  "is_world_process_zero": true,
+  "log_history": [
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 389.0,
+      "completions/mean_length": 377.5,
+      "completions/min_length": 366.0,
+      "epoch": 0.002,
+      "frac_reward_zero_std": 0.0,
+      "grad_norm": 0.2745305299758911,
+      "kl": 0.0,
+      "learning_rate": 2.0000000000000003e-06,
+      "loss": 0.0,
+      "reward": 0.5,
+      "reward_std": 0.7071067690849304,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.7071067690849304,
+      "step": 1
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.25,
+      "completions/max_length": 712.75,
+      "completions/mean_length": 689.375,
+      "completions/min_length": 666.0,
+      "epoch": 0.01,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0006804907461628318,
+      "kl": 9.946800855686888e-05,
+      "learning_rate": 1e-05,
+      "loss": 4.000145054305904e-06,
+      "reward": 0.25,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.25,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 5
+    },
+    {
+      "clip_ratio/high_max": 0.0024464832618832587,
+      "clip_ratio/high_mean": 0.0024464832618832587,
+      "clip_ratio/low_mean": 0.0001640689093619585,
+      "clip_ratio/low_min": 0.0001640689093619585,
+      "clip_ratio/region_mean": 0.002610552171245217,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 497.2,
+      "completions/mean_length": 485.6,
+      "completions/min_length": 474.0,
+      "epoch": 0.02,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.22628173232078552,
+      "kl": 0.0002987155457958579,
+      "learning_rate": 2e-05,
+      "loss": -0.0003628176636993885,
+      "reward": 0.6,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 10
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 421.6,
+      "completions/mean_length": 390.2,
+      "completions/min_length": 358.8,
+      "epoch": 0.03,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0014656687853857875,
+      "kl": 0.00022319573326967657,
+      "learning_rate": 3e-05,
+      "loss": -1.4230319357011467e-05,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 15
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 486.0,
+      "completions/mean_length": 444.9,
+      "completions/min_length": 403.8,
+      "epoch": 0.04,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0009142484632320702,
+      "kl": 0.0004241452901624143,
+      "learning_rate": 4e-05,
+      "loss": 1.7114212096203118e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 20
+    },
+    {
+      "clip_ratio/high_max": 0.0005540780373848974,
+      "clip_ratio/high_mean": 0.0005540780373848974,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0005540780373848974,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 787.4,
+      "completions/mean_length": 738.5,
+      "completions/min_length": 689.6,
+      "epoch": 0.05,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.0029386563692241907,
+      "kl": 0.0017149186198366806,
+      "learning_rate": 5e-05,
+      "loss": 8.679315214976668e-05,
+      "reward": 0.4,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 25
+    },
+    {
+      "clip_ratio/high_max": 0.0006403414998203516,
+      "clip_ratio/high_mean": 0.0006403414998203516,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0006403414998203516,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 647.6,
+      "completions/mean_length": 566.2,
+      "completions/min_length": 484.8,
+      "epoch": 0.06,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0006255562184378505,
+      "kl": 0.0003066264180233702,
+      "learning_rate": 4.9986331433523156e-05,
+      "loss": 1.4322872448246927e-05,
+      "reward": 0.3,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.3,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 30
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.3,
+      "completions/max_length": 686.4,
+      "completions/mean_length": 664.7,
+      "completions/min_length": 643.0,
+      "epoch": 0.07,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0014924455899745226,
+      "kl": 0.00036709415726363657,
+      "learning_rate": 4.994534068046937e-05,
+      "loss": 1.4717187150381506e-05,
+      "reward": 0.4,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 35
+    },
+    {
+      "clip_ratio/high_max": 0.00040349699556827543,
+      "clip_ratio/high_mean": 0.00040349699556827543,
+      "clip_ratio/low_mean": 0.00040349699556827543,
+      "clip_ratio/low_min": 0.00040349699556827543,
+      "clip_ratio/region_mean": 0.0008069939911365509,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 681.6,
+      "completions/mean_length": 587.4,
+      "completions/min_length": 493.2,
+      "epoch": 0.08,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0019065124215558171,
+      "kl": 0.0004894518526270986,
+      "learning_rate": 4.9877072563625285e-05,
+      "loss": -0.00015065595507621766,
+      "reward": 0.3,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.3,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 40
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 677.2,
+      "completions/mean_length": 661.8,
+      "completions/min_length": 646.4,
+      "epoch": 0.09,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.000817921943962574,
+      "kl": 0.0003075484826695174,
+      "learning_rate": 4.978160173317438e-05,
+      "loss": 1.24339887406677e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 45
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 656.8,
+      "completions/mean_length": 621.3,
+      "completions/min_length": 585.8,
+      "epoch": 0.1,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.15634514391422272,
+      "kl": 0.000459137320285663,
+      "learning_rate": 4.965903258506806e-05,
+      "loss": 2.121384022757411e-06,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 50
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.4,
+      "completions/max_length": 677.0,
+      "completions/mean_length": 671.2,
+      "completions/min_length": 665.4,
+      "epoch": 0.11,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0011203172616660595,
+      "kl": 0.0012636209139600396,
+      "learning_rate": 4.9509499146870236e-05,
+      "loss": 4.886850947514176e-05,
+      "reward": 0.2,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.2,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 55
+    },
+    {
+      "clip_ratio/high_max": 0.00022246940061450006,
+      "clip_ratio/high_mean": 0.00022246940061450006,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00022246940061450006,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 418.2,
+      "completions/mean_length": 400.8,
+      "completions/min_length": 383.4,
+      "epoch": 0.12,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0010890079429373145,
+      "kl": 0.0009534806886222214,
+      "learning_rate": 4.933316493120015e-05,
+      "loss": 2.114146773237735e-05,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 60
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.3,
+      "completions/max_length": 681.6,
+      "completions/mean_length": 648.9,
+      "completions/min_length": 616.2,
+      "epoch": 0.13,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0018795138457790017,
+      "kl": 0.0005977108958177269,
+      "learning_rate": 4.913022275693372e-05,
+      "loss": 2.4121845490299167e-05,
+      "reward": 0.4,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 65
+    },
+    {
+      "clip_ratio/high_max": 0.0009542598738335073,
+      "clip_ratio/high_mean": 0.0009542598738335073,
+      "clip_ratio/low_mean": 0.00020920501556247472,
+      "clip_ratio/low_min": 0.00020920501556247472,
+      "clip_ratio/region_mean": 0.0011634649126790464,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 435.8,
+      "completions/mean_length": 412.4,
+      "completions/min_length": 389.0,
+      "epoch": 0.14,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.25056585669517517,
+      "kl": 0.0006330947682727129,
+      "learning_rate": 4.8900894538358944e-05,
+      "loss": -0.00019633164629340172,
+      "reward": 0.4,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 70
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 611.0,
+      "completions/mean_length": 575.0,
+      "completions/min_length": 539.0,
+      "epoch": 0.15,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.002435472793877125,
+      "kl": 0.0007646598271094263,
+      "learning_rate": 4.864543104251587e-05,
+      "loss": 3.094758721999824e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 75
+    },
+    {
+      "clip_ratio/high_max": 0.00010515246540307999,
+      "clip_ratio/high_mean": 0.00010515246540307999,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00010515246540307999,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 668.6,
+      "completions/mean_length": 606.7,
+      "completions/min_length": 544.8,
+      "epoch": 0.16,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.11095567792654037,
+      "kl": 0.0007162548077758402,
+      "learning_rate": 4.8364111614986527e-05,
+      "loss": 7.679397240281104e-05,
+      "reward": 0.3,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.3,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 80
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 568.8,
+      "completions/mean_length": 527.3,
+      "completions/min_length": 485.8,
+      "epoch": 0.17,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0033143432810902596,
+      "kl": 0.0006754565751180053,
+      "learning_rate": 4.805724387443462e-05,
+      "loss": 4.5427383156493306e-05,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 85
+    },
+    {
+      "clip_ratio/high_max": 0.0003456221194937825,
+      "clip_ratio/high_mean": 0.0003456221194937825,
+      "clip_ratio/low_mean": 0.00011520737316459418,
+      "clip_ratio/low_min": 0.00011520737316459418,
+      "clip_ratio/region_mean": 0.0004608294926583767,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 831.6,
+      "completions/mean_length": 796.0,
+      "completions/min_length": 760.4,
+      "epoch": 0.18,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0037328507751226425,
+      "kl": 0.0010223451943602413,
+      "learning_rate": 4.7725163376229064e-05,
+      "loss": 4.158227238804102e-05,
+      "reward": 0.3,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.3,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 90
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 497.4,
+      "completions/mean_length": 441.2,
+      "completions/min_length": 385.0,
+      "epoch": 0.19,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.002750825835391879,
+      "kl": 0.0019137584429699927,
+      "learning_rate": 4.736823324551909e-05,
+      "loss": 0.00014175053220242262,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 95
+    },
+    {
+      "clip_ratio/high_max": 0.0005050505045801401,
+      "clip_ratio/high_mean": 0.0005050505045801401,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0005050505045801401,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 508.2,
+      "completions/mean_length": 493.9,
+      "completions/min_length": 479.6,
+      "epoch": 0.2,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.4193066656589508,
+      "kl": 0.0017093931266572327,
+      "learning_rate": 4.698684378016222e-05,
+      "loss": 0.00012627228861674666,
+      "reward": 0.8,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 100
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 360.8,
+      "completions/mean_length": 341.7,
+      "completions/min_length": 322.6,
+      "epoch": 0.21,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.007469469215720892,
+      "kl": 0.003066345490515232,
+      "learning_rate": 4.6581412023939354e-05,
+      "loss": 0.00012305844575166702,
+      "reward": 1.0,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 1.0,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 105
+    },
+    {
+      "clip_ratio/high_max": 0.0001224739709869027,
+      "clip_ratio/high_mean": 0.0001224739709869027,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0001224739709869027,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 611.2,
+      "completions/mean_length": 572.1,
+      "completions/min_length": 533.0,
+      "epoch": 0.22,
+      "frac_reward_zero_std": 0.4,
+      "grad_norm": 0.0033601748291403055,
+      "kl": 0.003409948293119669,
+      "learning_rate": 4.6152381310523387e-05,
+      "loss": 0.00023221683222800492,
+      "reward": 0.5,
+      "reward_std": 0.42426406145095824,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.42426406145095824,
+      "step": 110
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 734.0,
+      "completions/mean_length": 692.2,
+      "completions/min_length": 650.4,
+      "epoch": 0.23,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0015698346542194486,
+      "kl": 0.0008670258859638125,
+      "learning_rate": 4.5700220778700504e-05,
+      "loss": 0.0001534310751594603,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 115
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 723.8,
+      "completions/mean_length": 682.2,
+      "completions/min_length": 640.6,
+      "epoch": 0.24,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.010230440646409988,
+      "kl": 0.005100146430777386,
+      "learning_rate": 4.522542485937369e-05,
+      "loss": 0.00020869788713753223,
+      "reward": 0.4,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 120
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.4,
+      "completions/max_length": 922.4,
+      "completions/mean_length": 872.0,
+      "completions/min_length": 821.6,
+      "epoch": 0.25,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.002747906604781747,
+      "kl": 0.0014180985395796596,
+      "learning_rate": 4.4728512734909844e-05,
+      "loss": 5.677485605701804e-05,
+      "reward": 0.0,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.0,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 125
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 405.8,
+      "completions/mean_length": 388.4,
+      "completions/min_length": 371.0,
+      "epoch": 0.26,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0014852522872388363,
+      "kl": 0.001983049605041742,
+      "learning_rate": 4.421002777142148e-05,
+      "loss": 7.918149349279701e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 130
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0001583531266078353,
+      "clip_ratio/low_min": 0.0001583531266078353,
+      "clip_ratio/region_mean": 0.0001583531266078353,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 458.4,
+      "completions/mean_length": 429.9,
+      "completions/min_length": 401.4,
+      "epoch": 0.27,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.01503363810479641,
+      "kl": 0.005801378504838794,
+      "learning_rate": 4.367053692460385e-05,
+      "loss": 0.0004215865395963192,
+      "reward": 0.6,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 135
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 645.2,
+      "completions/mean_length": 537.3,
+      "completions/min_length": 429.4,
+      "epoch": 0.28,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.002116028917953372,
+      "kl": 0.005066322290804237,
+      "learning_rate": 4.311063011977723e-05,
+      "loss": 0.00020899884402751922,
+      "reward": 0.4,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 140
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 375.8,
+      "completions/mean_length": 354.6,
+      "completions/min_length": 333.4,
+      "epoch": 0.29,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.006887642201036215,
+      "kl": 0.009827147470787168,
+      "learning_rate": 4.2530919606812216e-05,
+      "loss": 0.0003938500303775072,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 145
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 416.4,
+      "completions/mean_length": 366.7,
+      "completions/min_length": 317.0,
+      "epoch": 0.3,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.002672386122867465,
+      "kl": 0.008298561931587756,
+      "learning_rate": 4.193203929064353e-05,
+      "loss": 0.00047482880763709544,
+      "reward": 0.9,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.9,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 150
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 640.8,
+      "completions/mean_length": 594.7,
+      "completions/min_length": 548.6,
+      "epoch": 0.31,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0023163340520113707,
+      "kl": 0.0016974479891359805,
+      "learning_rate": 4.131464403810422e-05,
+      "loss": 6.728884764015675e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 155
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 381.2,
+      "completions/mean_length": 353.1,
+      "completions/min_length": 325.0,
+      "epoch": 0.32,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0056184823624789715,
+      "kl": 0.006532504153437912,
+      "learning_rate": 4.067940896183843e-05,
+      "loss": 0.0001474126009270549,
+      "reward": 0.9,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.9,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 160
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 499.4,
+      "completions/mean_length": 479.6,
+      "completions/min_length": 459.8,
+      "epoch": 0.33,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.007484205532819033,
+      "kl": 0.0038784807082265617,
+      "learning_rate": 4.002702868207563e-05,
+      "loss": 0.0001543789985589683,
+      "reward": 0.4,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 165
+    },
+    {
+      "clip_ratio/high_max": 0.0006509372964501381,
+      "clip_ratio/high_mean": 0.0006509372964501381,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0006509372964501381,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 658.2,
+      "completions/mean_length": 576.8,
+      "completions/min_length": 495.4,
+      "epoch": 0.34,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.004718282260000706,
+      "kl": 0.007649715105071664,
+      "learning_rate": 3.935821656707359e-05,
+      "loss": 0.0005094979424029589,
+      "reward": 0.8,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 170
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 587.4,
+      "completions/mean_length": 544.4,
+      "completions/min_length": 501.4,
+      "epoch": 0.35,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0019858903251588345,
+      "kl": 0.0035309843719005586,
+      "learning_rate": 3.867370395306068e-05,
+      "loss": 6.014038226567209e-05,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 175
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 639.6,
+      "completions/mean_length": 594.5,
+      "completions/min_length": 549.4,
+      "epoch": 0.36,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.015034261159598827,
+      "kl": 0.003863858920522034,
+      "learning_rate": 3.797423934453038e-05,
+      "loss": 0.0001626830198802054,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 180
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.3,
+      "completions/max_length": 805.0,
+      "completions/mean_length": 710.5,
+      "completions/min_length": 616.0,
+      "epoch": 0.37,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0022431982215493917,
+      "kl": 0.0025493323453702034,
+      "learning_rate": 3.726058759576271e-05,
+      "loss": 0.0001080367248505354,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 185
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.00020140986889600755,
+      "clip_ratio/low_min": 0.00020140986889600755,
+      "clip_ratio/region_mean": 0.00020140986889600755,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 384.8,
+      "completions/mean_length": 366.2,
+      "completions/min_length": 347.6,
+      "epoch": 0.38,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.019761426374316216,
+      "kl": 0.011293478566221893,
+      "learning_rate": 3.65335290744672e-05,
+      "loss": 0.0005690994672477246,
+      "reward": 0.9,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.9,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 190
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.00012698412174358965,
+      "clip_ratio/low_min": 0.00012698412174358965,
+      "clip_ratio/region_mean": 0.00012698412174358965,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 703.6,
+      "completions/mean_length": 662.9,
+      "completions/min_length": 622.2,
+      "epoch": 0.39,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.002572572324424982,
+      "kl": 0.0015339702018536626,
+      "learning_rate": 3.579385880846232e-05,
+      "loss": 2.8236012440174817e-06,
+      "reward": 0.3,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.3,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 195
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 643.2,
+      "completions/mean_length": 612.7,
+      "completions/min_length": 582.2,
+      "epoch": 0.4,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0031194768380373716,
+      "kl": 0.002493513422086835,
+      "learning_rate": 3.504238561632424e-05,
+      "loss": 9.978280868381262e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 200
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.3,
+      "completions/max_length": 674.2,
+      "completions/mean_length": 638.7,
+      "completions/min_length": 603.2,
+      "epoch": 0.41,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0036480328999459743,
+      "kl": 0.0019606892135925593,
+      "learning_rate": 3.427993122295552e-05,
+      "loss": 7.743847672827541e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 205
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 632.4,
+      "completions/mean_length": 570.5,
+      "completions/min_length": 508.6,
+      "epoch": 0.42,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.00799116026610136,
+      "kl": 0.003256951330695301,
+      "learning_rate": 3.350732936104108e-05,
+      "loss": 0.00010458981851115822,
+      "reward": 0.3,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.3,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 210
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 315.6,
+      "completions/mean_length": 300.6,
+      "completions/min_length": 285.6,
+      "epoch": 0.43,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.005798510741442442,
+      "kl": 0.011395945539698004,
+      "learning_rate": 3.272542485937369e-05,
+      "loss": 0.00045509766787290575,
+      "reward": 0.8,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 215
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.4,
+      "completions/max_length": 777.6,
+      "completions/mean_length": 756.6,
+      "completions/min_length": 735.6,
+      "epoch": 0.44,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.00223241513594985,
+      "kl": 0.0014477839809842407,
+      "learning_rate": 3.1935072719046115e-05,
+      "loss": 5.752563010901213e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 220
+    },
+    {
+      "clip_ratio/high_max": 0.00031645570416003463,
+      "clip_ratio/high_mean": 0.00031645570416003463,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00031645570416003463,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 443.8,
+      "completions/mean_length": 398.4,
+      "completions/min_length": 353.0,
+      "epoch": 0.45,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.0020589185878634453,
+      "kl": 0.0025043860776349904,
+      "learning_rate": 3.1137137178519985e-05,
+      "loss": 0.0001688675722107291,
+      "reward": 0.4,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 225
+    },
+    {
+      "clip_ratio/high_max": 9.955201530829073e-05,
+      "clip_ratio/high_mean": 9.955201530829073e-05,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 9.955201530829073e-05,
+      "completions/clipped_ratio": 0.3,
+      "completions/max_length": 747.4,
+      "completions/mean_length": 715.4,
+      "completions/min_length": 683.4,
+      "epoch": 0.46,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.002850248944014311,
+      "kl": 0.0052430763142183425,
+      "learning_rate": 3.0332490768593675e-05,
+      "loss": 0.00025521754287183285,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 230
+    },
+    {
+      "clip_ratio/high_max": 0.0004385964944958687,
+      "clip_ratio/high_mean": 0.0004385964944958687,
+      "clip_ratio/low_mean": 0.00021929824724793435,
+      "clip_ratio/low_min": 0.00021929824724793435,
+      "clip_ratio/region_mean": 0.000657894741743803,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 625.8,
+      "completions/mean_length": 551.9,
+      "completions/min_length": 478.0,
+      "epoch": 0.47,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.005838941317051649,
+      "kl": 0.002799734321888536,
+      "learning_rate": 2.952201335830275e-05,
+      "loss": 7.562300888821482e-05,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 235
+    },
+    {
+      "clip_ratio/high_max": 0.000272479560226202,
+      "clip_ratio/high_mean": 0.000272479560226202,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.000272479560226202,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 530.6,
+      "completions/mean_length": 480.4,
+      "completions/min_length": 430.2,
+      "epoch": 0.48,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.003879491239786148,
+      "kl": 0.0028692058520391585,
+      "learning_rate": 2.870659119279605e-05,
+      "loss": 0.0001448941882699728,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 240
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 482.6,
+      "completions/mean_length": 463.6,
+      "completions/min_length": 444.6,
+      "epoch": 0.49,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.008210963569581509,
+      "kl": 0.0075855673989281055,
+      "learning_rate": 2.788711592423966e-05,
+      "loss": 0.0003023243509232998,
+      "reward": 0.8,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 245
+    },
+    {
+      "clip_ratio/high_max": 0.0002461538417264819,
+      "clip_ratio/high_mean": 0.0002461538417264819,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0002461538417264819,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 775.4,
+      "completions/mean_length": 716.8,
+      "completions/min_length": 658.2,
+      "epoch": 0.5,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.006750498432666063,
+      "kl": 0.0025351812597364186,
+      "learning_rate": 2.7064483636808313e-05,
+      "loss": 0.00016423141350969673,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 250
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 619.8,
+      "completions/mean_length": 591.4,
+      "completions/min_length": 563.0,
+      "epoch": 0.51,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.006483216769993305,
+      "kl": 0.0026113510597497226,
+      "learning_rate": 2.623959386683056e-05,
+      "loss": 0.00010410962859168649,
+      "reward": 0.8,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 255
+    },
+    {
+      "clip_ratio/high_max": 0.00032043447718024256,
+      "clip_ratio/high_mean": 0.00032043447718024256,
+      "clip_ratio/low_mean": 0.00020130849443376065,
+      "clip_ratio/low_min": 0.00020130849443376065,
+      "clip_ratio/region_mean": 0.0005217429948970676,
+      "completions/clipped_ratio": 0.3,
+      "completions/max_length": 782.6,
+      "completions/mean_length": 706.5,
+      "completions/min_length": 630.4,
+      "epoch": 0.52,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.003508440451696515,
+      "kl": 0.001330986130051315,
+      "learning_rate": 2.5413348619158967e-05,
+      "loss": -4.660175181925297e-05,
+      "reward": 0.6,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 260
+    },
+    {
+      "clip_ratio/high_max": 0.00013029315741732716,
+      "clip_ratio/high_mean": 0.00013029315741732716,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00013029315741732716,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 604.4,
+      "completions/mean_length": 536.1,
+      "completions/min_length": 467.8,
+      "epoch": 0.53,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.1873970776796341,
+      "kl": 0.009700851677916945,
+      "learning_rate": 2.458665138084104e-05,
+      "loss": 0.00033162124454975126,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 265
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 544.8,
+      "completions/mean_length": 517.5,
+      "completions/min_length": 490.2,
+      "epoch": 0.54,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0017767059616744518,
+      "kl": 0.002807480387855321,
+      "learning_rate": 2.3760406133169443e-05,
+      "loss": 0.00011274998541921377,
+      "reward": 0.8,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 270
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.00019474197179079056,
+      "clip_ratio/low_min": 0.00019474197179079056,
+      "clip_ratio/region_mean": 0.00019474197179079056,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 641.0,
+      "completions/mean_length": 620.8,
+      "completions/min_length": 600.6,
+      "epoch": 0.55,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.08176784217357635,
+      "kl": 0.0031962784822098913,
+      "learning_rate": 2.2935516363191693e-05,
+      "loss": 0.0002264779293909669,
+      "reward": 0.6,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 275
+    },
+    {
+      "clip_ratio/high_max": 0.0006584362126886845,
+      "clip_ratio/high_mean": 0.0006584362126886845,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0006584362126886845,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 638.0,
+      "completions/mean_length": 562.6,
+      "completions/min_length": 487.2,
+      "epoch": 0.56,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.007495217490941286,
+      "kl": 0.004752782918512821,
+      "learning_rate": 2.2112884075760347e-05,
+      "loss": 8.390162838622928e-05,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 280
+    },
+    {
+      "clip_ratio/high_max": 0.00011280316393822432,
+      "clip_ratio/high_mean": 0.00011280316393822432,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00011280316393822432,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 717.4,
+      "completions/mean_length": 650.2,
+      "completions/min_length": 583.0,
+      "epoch": 0.57,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.0028083010111004114,
+      "kl": 0.005990609969012439,
+      "learning_rate": 2.1293408807203947e-05,
+      "loss": 0.000368604133836925,
+      "reward": 0.4,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 285
+    },
+    {
+      "clip_ratio/high_max": 0.0001277955248951912,
+      "clip_ratio/high_mean": 0.0001277955248951912,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0001277955248951912,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 503.0,
+      "completions/mean_length": 472.3,
+      "completions/min_length": 441.6,
+      "epoch": 0.58,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.17317219078540802,
+      "kl": 0.003116553882136941,
+      "learning_rate": 2.047798664169726e-05,
+      "loss": 2.7030031196773054e-05,
+      "reward": 0.9,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.9,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 290
+    },
+    {
+      "clip_ratio/high_max": 0.00011068069143220783,
+      "clip_ratio/high_mean": 0.00011068069143220783,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00011068069143220783,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 701.6,
+      "completions/mean_length": 638.9,
+      "completions/min_length": 576.2,
+      "epoch": 0.59,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0019411866087466478,
+      "kl": 0.0012191154062747955,
+      "learning_rate": 1.9667509231406334e-05,
+      "loss": -5.089085607323795e-06,
+      "reward": 0.3,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.3,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 295
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 588.6,
+      "completions/mean_length": 567.2,
+      "completions/min_length": 545.8,
+      "epoch": 0.6,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0071156201884150505,
+      "kl": 0.003254280146211386,
+      "learning_rate": 1.8862862821480025e-05,
+      "loss": 0.00012772842310369015,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 300
+    },
+    {
+      "clip_ratio/high_max": 0.00033927056938409805,
+      "clip_ratio/high_mean": 0.00033927056938409805,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00033927056938409805,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 649.0,
+      "completions/mean_length": 625.9,
+      "completions/min_length": 602.8,
+      "epoch": 0.61,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.003073514671996236,
+      "kl": 0.002048709220252931,
+      "learning_rate": 1.806492728095389e-05,
+      "loss": 0.00016608801670372486,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 305
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 726.2,
+      "completions/mean_length": 684.3,
+      "completions/min_length": 642.4,
+      "epoch": 0.62,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.002632194198668003,
+      "kl": 0.004164765309542418,
+      "learning_rate": 1.7274575140626318e-05,
+      "loss": 0.0001662806374952197,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 310
+    },
+    {
+      "clip_ratio/high_max": 0.00016849199309945106,
+      "clip_ratio/high_mean": 0.00016849199309945106,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00016849199309945106,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 581.6,
+      "completions/mean_length": 551.1,
+      "completions/min_length": 520.6,
+      "epoch": 0.63,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.004793096799403429,
+      "kl": 0.006928782898467034,
+      "learning_rate": 1.6492670638958924e-05,
+      "loss": 0.0002944141859188676,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 315
+    },
+    {
+      "clip_ratio/high_max": 0.0006410256493836642,
+      "clip_ratio/high_mean": 0.0006410256493836642,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0006410256493836642,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 583.8,
+      "completions/mean_length": 559.5,
+      "completions/min_length": 535.2,
+      "epoch": 0.64,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.009927814826369286,
+      "kl": 0.005139771406538785,
+      "learning_rate": 1.5720068777044476e-05,
+      "loss": 2.336390898562968e-05,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 320
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 473.4,
+      "completions/mean_length": 433.2,
+      "completions/min_length": 393.0,
+      "epoch": 0.65,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.009365738369524479,
+      "kl": 0.010789648251375183,
+      "learning_rate": 1.495761438367577e-05,
+      "loss": 0.0004745126701891422,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 325
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.4,
+      "completions/max_length": 802.6,
+      "completions/mean_length": 755.4,
+      "completions/min_length": 708.2,
+      "epoch": 0.66,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.004060312174260616,
+      "kl": 0.0012661131797358394,
+      "learning_rate": 1.4206141191537682e-05,
+      "loss": 5.287175299599767e-05,
+      "reward": 0.2,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.2,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 330
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 697.0,
+      "completions/mean_length": 654.7,
+      "completions/min_length": 612.4,
+      "epoch": 0.67,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.003583623794838786,
+      "kl": 0.0017941199708729982,
+      "learning_rate": 1.346647092553281e-05,
+      "loss": 0.00011949921026825905,
+      "reward": 0.9,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.9,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 335
+    },
+    {
+      "clip_ratio/high_max": 0.00012666244292631746,
+      "clip_ratio/high_mean": 0.00012666244292631746,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00012666244292631746,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 601.2,
+      "completions/mean_length": 529.5,
+      "completions/min_length": 457.8,
+      "epoch": 0.68,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0036033187061548233,
+      "kl": 0.001624487293884158,
+      "learning_rate": 1.2739412404237306e-05,
+      "loss": -2.4389610916841776e-05,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 340
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 498.8,
+      "completions/mean_length": 485.2,
+      "completions/min_length": 471.6,
+      "epoch": 0.69,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.006147034000605345,
+      "kl": 0.0014556913753040134,
+      "learning_rate": 1.202576065546963e-05,
+      "loss": 5.921515985392034e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 345
+    },
+    {
+      "clip_ratio/high_max": 0.0003992015961557627,
+      "clip_ratio/high_mean": 0.0003992015961557627,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0003992015961557627,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 436.4,
+      "completions/mean_length": 421.5,
+      "completions/min_length": 406.6,
+      "epoch": 0.7,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.006787001620978117,
+      "kl": 0.0027022200636565687,
+      "learning_rate": 1.1326296046939333e-05,
+      "loss": 0.0002472905209288001,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 350
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 427.6,
+      "completions/mean_length": 404.0,
+      "completions/min_length": 380.4,
+      "epoch": 0.71,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0022182271350175142,
+      "kl": 0.002075655141379684,
+      "learning_rate": 1.064178343292641e-05,
+      "loss": 8.377792546525598e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 355
+    },
+    {
+      "clip_ratio/high_max": 0.0003294892841950059,
+      "clip_ratio/high_mean": 0.0003294892841950059,
+      "clip_ratio/low_mean": 0.00022050717379897832,
+      "clip_ratio/low_min": 0.00022050717379897832,
+      "clip_ratio/region_mean": 0.0005499964579939842,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 527.0,
+      "completions/mean_length": 431.4,
+      "completions/min_length": 335.8,
+      "epoch": 0.72,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.005938891787081957,
+      "kl": 0.003542816312983632,
+      "learning_rate": 9.972971317924374e-06,
+      "loss": -4.2559945723041895e-05,
+      "reward": 0.6,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 360
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 753.0,
+      "completions/mean_length": 721.9,
+      "completions/min_length": 690.8,
+      "epoch": 0.73,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0016718930564820766,
+      "kl": 0.0013213358353823424,
+      "learning_rate": 9.320591038161574e-06,
+      "loss": 5.248577799648047e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 365
+    },
+    {
+      "clip_ratio/high_max": 0.0006450645858421921,
+      "clip_ratio/high_mean": 0.0006450645858421921,
+      "clip_ratio/low_mean": 0.0001163467182777822,
+      "clip_ratio/low_min": 0.0001163467182777822,
+      "clip_ratio/region_mean": 0.0007614112924784422,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 653.6,
+      "completions/mean_length": 601.9,
+      "completions/min_length": 550.2,
+      "epoch": 0.74,
+      "frac_reward_zero_std": 0.4,
+      "grad_norm": 0.10866400599479675,
+      "kl": 0.001310768094845116,
+      "learning_rate": 8.685355961895784e-06,
+      "loss": -0.0001609708764590323,
+      "reward": 0.5,
+      "reward_std": 0.42426406145095824,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.42426406145095824,
+      "step": 370
+    },
+    {
+      "clip_ratio/high_max": 0.0003552397945895791,
+      "clip_ratio/high_mean": 0.0003552397945895791,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0003552397945895791,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 585.6,
+      "completions/mean_length": 557.0,
+      "completions/min_length": 528.4,
+      "epoch": 0.75,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.007570336107164621,
+      "kl": 0.0019031562842428684,
+      "learning_rate": 8.067960709356478e-06,
+      "loss": 0.00010980634251609445,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 375
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 827.2,
+      "completions/mean_length": 776.1,
+      "completions/min_length": 725.0,
+      "epoch": 0.76,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0032681154552847147,
+      "kl": 0.0015106519451364875,
+      "learning_rate": 7.469080393187786e-06,
+      "loss": 6.0344923986122013e-05,
+      "reward": 0.2,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.2,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 380
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 567.6,
+      "completions/mean_length": 527.8,
+      "completions/min_length": 488.0,
+      "epoch": 0.77,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0029238348361104727,
+      "kl": 0.001659035962074995,
+      "learning_rate": 6.889369880222776e-06,
+      "loss": 6.735894712619483e-05,
+      "reward": 0.8,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 385
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 665.2,
+      "completions/mean_length": 576.4,
+      "completions/min_length": 487.6,
+      "epoch": 0.78,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0015165195800364017,
+      "kl": 0.0015694351401180028,
+      "learning_rate": 6.329463075396161e-06,
+      "loss": 6.140409386716783e-05,
+      "reward": 0.4,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 390
+    },
+    {
+      "clip_ratio/high_max": 0.00020020019728690387,
+      "clip_ratio/high_mean": 0.00020020019728690387,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00020020019728690387,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 624.0,
+      "completions/mean_length": 603.8,
+      "completions/min_length": 583.6,
+      "epoch": 0.79,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.00223752879537642,
+      "kl": 0.0014426506008021534,
+      "learning_rate": 5.78997222857853e-06,
+      "loss": 9.092516265809535e-05,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 395
+    },
+    {
+      "clip_ratio/high_max": 0.0008895917097106576,
+      "clip_ratio/high_mean": 0.0008895917097106576,
+      "clip_ratio/low_mean": 0.0002871500328183174,
+      "clip_ratio/low_min": 0.0002871500328183174,
+      "clip_ratio/region_mean": 0.001176741742528975,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 674.0,
+      "completions/mean_length": 605.1,
+      "completions/min_length": 536.2,
+      "epoch": 0.8,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.006541391368955374,
+      "kl": 0.002322370233014226,
+      "learning_rate": 5.271487265090163e-06,
+      "loss": 0.00022997541818767787,
+      "reward": 0.4,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 400
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 405.2,
+      "completions/mean_length": 385.8,
+      "completions/min_length": 366.4,
+      "epoch": 0.81,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.006315944250673056,
+      "kl": 0.002596192993223667,
+      "learning_rate": 4.7745751406263165e-06,
+      "loss": 0.00014588373014703392,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 405
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 631.6,
+      "completions/mean_length": 595.8,
+      "completions/min_length": 560.0,
+      "epoch": 0.82,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0020425012335181236,
+      "kl": 0.0014671742217615246,
+      "learning_rate": 4.299779221299499e-06,
+      "loss": 5.95603312831372e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 410
+    },
+    {
+      "clip_ratio/high_max": 0.00021141648758202792,
+      "clip_ratio/high_mean": 0.00021141648758202792,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00021141648758202792,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 426.2,
+      "completions/mean_length": 407.7,
+      "completions/min_length": 389.2,
+      "epoch": 0.83,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.003026613499969244,
+      "kl": 0.0029249578481540086,
+      "learning_rate": 3.847618689476612e-06,
+      "loss": -3.8415665039792656e-05,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 415
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 556.2,
+      "completions/mean_length": 511.3,
+      "completions/min_length": 466.4,
+      "epoch": 0.84,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0027313604950904846,
+      "kl": 0.00170407232362777,
+      "learning_rate": 3.418587976060653e-06,
+      "loss": 6.951598916202784e-05,
+      "reward": 0.2,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.2,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 420
+    },
+    {
+      "clip_ratio/high_max": 0.00044247787445783615,
+      "clip_ratio/high_mean": 0.00044247787445783615,
+      "clip_ratio/low_mean": 0.00044247787445783615,
+      "clip_ratio/low_min": 0.00044247787445783615,
+      "clip_ratio/region_mean": 0.0008849557489156723,
+      "completions/clipped_ratio": 0.3,
+      "completions/max_length": 719.6,
+      "completions/mean_length": 680.4,
+      "completions/min_length": 641.2,
+      "epoch": 0.85,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0022066642995923758,
+      "kl": 0.004893560777418315,
+      "learning_rate": 3.013156219837776e-06,
+      "loss": 0.00014771391870453953,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 425
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 684.0,
+      "completions/mean_length": 615.6,
+      "completions/min_length": 547.2,
+      "epoch": 0.86,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.004402415361255407,
+      "kl": 0.0030957374721765516,
+      "learning_rate": 2.6317667544809134e-06,
+      "loss": 0.00012539406307041646,
+      "reward": 0.8,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 430
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 491.6,
+      "completions/mean_length": 429.4,
+      "completions/min_length": 367.2,
+      "epoch": 0.87,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0023908629082143307,
+      "kl": 0.005757934390567243,
+      "learning_rate": 2.2748366237709374e-06,
+      "loss": 0.00023677514400333166,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 435
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.4,
+      "completions/max_length": 941.0,
+      "completions/mean_length": 845.1,
+      "completions/min_length": 749.2,
+      "epoch": 0.88,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0024610680993646383,
+      "kl": 0.0023990799207240345,
+      "learning_rate": 1.9427561255653816e-06,
+      "loss": 8.56145576108247e-05,
+      "reward": 0.1,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.1,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 440
+    },
+    {
+      "clip_ratio/high_max": 0.0003554502269253135,
+      "clip_ratio/high_mean": 0.0003554502269253135,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0003554502269253135,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 717.8,
+      "completions/mean_length": 628.7,
+      "completions/min_length": 539.6,
+      "epoch": 0.89,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0033430650364607573,
+      "kl": 0.00213278106530197,
+      "learning_rate": 1.6358883850134816e-06,
+      "loss": 7.869623950682581e-05,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 445
+    },
+    {
+      "clip_ratio/high_max": 0.00014947683084756135,
+      "clip_ratio/high_mean": 0.00014947683084756135,
+      "clip_ratio/low_mean": 0.00014947683084756135,
+      "clip_ratio/low_min": 0.00014947683084756135,
+      "clip_ratio/region_mean": 0.0002989536616951227,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 561.2,
+      "completions/mean_length": 541.5,
+      "completions/min_length": 521.8,
+      "epoch": 0.9,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.00285865506157279,
+      "kl": 0.0031095960177481173,
+      "learning_rate": 1.3545689574841342e-06,
+      "loss": 0.00018892092630267142,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 450
+    },
+    {
+      "clip_ratio/high_max": 0.00017857142956927418,
+      "clip_ratio/high_mean": 0.00017857142956927418,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.00017857142956927418,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 673.4,
+      "completions/mean_length": 612.5,
+      "completions/min_length": 551.6,
+      "epoch": 0.91,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.004231106955558062,
+      "kl": 0.001816297578625381,
+      "learning_rate": 1.0991054616410589e-06,
+      "loss": 2.4761457461863755e-05,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 455
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 774.4,
+      "completions/mean_length": 719.6,
+      "completions/min_length": 664.8,
+      "epoch": 0.92,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.0025418533477932215,
+      "kl": 0.001939354185014963,
+      "learning_rate": 8.697772430662859e-07,
+      "loss": 7.737103151157498e-05,
+      "reward": 0.4,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 460
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 511.0,
+      "completions/mean_length": 471.9,
+      "completions/min_length": 432.8,
+      "epoch": 0.93,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.009396975859999657,
+      "kl": 0.0025742474826984107,
+      "learning_rate": 6.668350687998565e-07,
+      "loss": 0.00010292576625943184,
+      "reward": 0.8,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 465
+    },
+    {
+      "clip_ratio/high_max": 0.000551610765978694,
+      "clip_ratio/high_mean": 0.000551610765978694,
+      "clip_ratio/low_mean": 9.881423320621253e-05,
+      "clip_ratio/low_min": 9.881423320621253e-05,
+      "clip_ratio/region_mean": 0.0006504249759018421,
+      "completions/clipped_ratio": 0.2,
+      "completions/max_length": 705.0,
+      "completions/mean_length": 675.6,
+      "completions/min_length": 646.2,
+      "epoch": 0.94,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.1849575787782669,
+      "kl": 0.0012461108970455825,
+      "learning_rate": 4.905008531297661e-07,
+      "loss": 3.4519674954935907e-06,
+      "reward": 0.6,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 470
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0001932367100380361,
+      "clip_ratio/low_min": 0.0001932367100380361,
+      "clip_ratio/region_mean": 0.0001932367100380361,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 704.2,
+      "completions/mean_length": 656.1,
+      "completions/min_length": 608.0,
+      "epoch": 0.95,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.17005588114261627,
+      "kl": 0.0020033617503941057,
+      "learning_rate": 3.4096741493194197e-07,
+      "loss": 0.00018819719552993775,
+      "reward": 0.4,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.4,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 475
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 394.0,
+      "completions/mean_length": 366.8,
+      "completions/min_length": 339.6,
+      "epoch": 0.96,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.0033463104628026485,
+      "kl": 0.0017193612293340266,
+      "learning_rate": 2.1839826682562015e-07,
+      "loss": 0.00022480501793324946,
+      "reward": 0.5,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.5,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 480
+    },
+    {
+      "clip_ratio/high_max": 0.0005300353281199932,
+      "clip_ratio/high_mean": 0.0005300353281199932,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0005300353281199932,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 701.2,
+      "completions/mean_length": 642.3,
+      "completions/min_length": 583.4,
+      "epoch": 0.97,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.17250776290893555,
+      "kl": 0.0015719524584710599,
+      "learning_rate": 1.229274363747146e-07,
+      "loss": 1.5123013872653246e-05,
+      "reward": 0.8,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 485
+    },
+    {
+      "clip_ratio/high_max": 0.00020439447835087776,
+      "clip_ratio/high_mean": 0.00020439447835087776,
+      "clip_ratio/low_mean": 0.00025062656495720146,
+      "clip_ratio/low_min": 0.00025062656495720146,
+      "clip_ratio/region_mean": 0.00045502104330807924,
+      "completions/clipped_ratio": 0.1,
+      "completions/max_length": 558.2,
+      "completions/mean_length": 510.3,
+      "completions/min_length": 462.4,
+      "epoch": 0.98,
+      "frac_reward_zero_std": 0.6,
+      "grad_norm": 0.3301871418952942,
+      "kl": 0.0023324352921918036,
+      "learning_rate": 5.4659319530636633e-08,
+      "loss": 0.00013293407391756773,
+      "reward": 0.8,
+      "reward_std": 0.2828427076339722,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.8,
+      "rewards/MathAccuracy/std": 0.2828427076339722,
+      "step": 490
+    },
+    {
+      "clip_ratio/high_max": 0.0,
+      "clip_ratio/high_mean": 0.0,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 636.6,
+      "completions/mean_length": 614.3,
+      "completions/min_length": 592.0,
+      "epoch": 0.99,
+      "frac_reward_zero_std": 1.0,
+      "grad_norm": 0.005240611266344786,
+      "kl": 0.002130005625076592,
+      "learning_rate": 1.3668566476848777e-08,
+      "loss": 8.560363785363734e-05,
+      "reward": 0.6,
+      "reward_std": 0.0,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.6,
+      "rewards/MathAccuracy/std": 0.0,
+      "step": 495
+    },
+    {
+      "clip_ratio/high_max": 0.0001826483989134431,
+      "clip_ratio/high_mean": 0.0001826483989134431,
+      "clip_ratio/low_mean": 0.0,
+      "clip_ratio/low_min": 0.0,
+      "clip_ratio/region_mean": 0.0001826483989134431,
+      "completions/clipped_ratio": 0.0,
+      "completions/max_length": 640.0,
+      "completions/mean_length": 551.5,
+      "completions/min_length": 463.0,
+      "epoch": 1.0,
+      "frac_reward_zero_std": 0.8,
+      "grad_norm": 0.15998604893684387,
+      "kl": 0.002341361262369901,
+      "learning_rate": 0.0,
+      "loss": 0.00012646716786548495,
+      "reward": 0.7,
+      "reward_std": 0.1414213538169861,
+      "rewards/Format/mean": 0.0,
+      "rewards/Format/std": 0.0,
+      "rewards/MathAccuracy/mean": 0.7,
+      "rewards/MathAccuracy/std": 0.1414213538169861,
+      "step": 500
+    }
+  ],
+  "logging_steps": 5,
+  "max_steps": 500,
+  "num_input_tokens_seen": 0,
+  "num_train_epochs": 1,
+  "save_steps": 50,
+  "stateful_callbacks": {
+    "TrainerControl": {
+      "args": {
+        "should_epoch_stop": false,
+        "should_evaluate": false,
+        "should_log": false,
+        "should_save": true,
+        "should_training_stop": true
+      },
+      "attributes": {}
+    }
+  },
+  "total_flos": 0.0,
+  "train_batch_size": 2,
+  "trial_name": null,
+  "trial_params": null
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fe8cd3b554a99b402d9df5338b8b754a0bc0bd19dac781acfe9af54c1140038f
+size 10001