The purpose of OvO-R1 is to explore the influence of using end-to-end reinforcement learning and various reward functions on the reasoning capabilities of different base models (Qwen2.5-1.5B/Qwen2.5-1.5B-Math/Qwen2.5-1.5B-Instruct).
- Qwen2.5-1.5B/Qwen2.5-1.5B-Math/Qwen2.5-1.5B-Instruct scale model RL training
- We use 0.75k dataset for fast train loop, more experiments on large scale datasets is around the corner
- We release wandb log for comparison between difference base models using GRPO
- We are exploring the impact of various reward functions on these models
conda create -n ovo_r1 python=3.11
conda activate ovo_r1
and
pip install -r requirements.txt
Model | OvO-R1 | OvO-R1-Math | OvO-R1-Instruct |
---|---|---|---|
Base Model | Qwen2.5-1.5B | Qwen2.5-1.5B-Math | Qwen2.5-1.5B-Instruct |
Dataset_mini | X-R1-750 | X-R1-750 | X-R1-750 |
Dataset_middle | - | - | - |
Dataset_large | - | - | - |
Config: recipes | OvO_R1_config.yaml | OvO_R1_math_config.yaml | OvO_R1_instruct_config.yaml |
num_generations | 8 | 8 | 8 |
max_completion_length | 1024 | 1024 | 1024 |
num_train_epochs | 3 | 3 | 3 |
To train the proposed method, run the following commands:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/zero3.yaml --num_processes=3 src/ovo_r1/grpo.py --config recipes/OvO_R1_config.yaml > ./output/ovo_r1.log
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/zero3.yaml --num_processes=3 src/ovo_r1/grpo.py --config recipes/OvO_R1_math_config.yaml > ./output/ovo_r1_math.log
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/zero3.yaml --num_processes=3 src/ovo_r1/grpo.py --config recipes/OvO_R1_instruct_config.yaml > ./output/ovo_r1_instruct.log
Our email is xuzhaoli2001@gmail.com and xuchenli1030@gmail.com
Any discussions and suggestions are welcome!