-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new Post-training an LLM using GRPO with TRL
recipe 🧑🍳️
#278
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Ready to be reviewed 😄 |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! A few remarks:
- You can now use
trl==0.14
(latest, released today) to use GRPO. Plus, datasets, accelerate and transformers are trl deps:
- !pip install -U -q transformers trl datasets peft accelerate
+ !pip install -U -q trl peft
- You don't need to pass the tokenizer. GRPO will load it for you
- You write:
In the case of the DeepSeek-R1 training, they use an accuracy-based reward model to evaluate whether the response is correct, along with a format-based reward that ensures the model places its reasoning process between tags. You can find more details here.
Why don't you use a similar reward function as well? If you remove remove_unsused_columns=True
, you'll get access to the "solution" column of the dataset in the reward function.
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenAI o1-o3 models >> OpenAI o1 and o3 models
exclusively employs pure RL >> exclusively employs RL | employs pure RL
to handle more complex and nuanced tasks >> to handle complex and nuanced tasks
Maybe link to the diagram in the TRL [docs](https://huggingface.co/docs/trl/main/en/grpo_trainer)
Reply via ReviewNB
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #2. # Tested with transformers==4.48.1, trl==0.14.0.dev0, datasets==3.2.0, peft==0.14.0, accelerate==1.3.0
I think trl was just released so you can drop the dev release.
Reply via ReviewNB
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. !pip install git+https://github.com/huggingface/trl.git@main
As above, I think that means we can skip installing from main.
Reply via ReviewNB
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #1. print(train_dataset[0])
If you wanted, you could render this math with Ipython
from IPython.display import display, Math
Reply via ReviewNB
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should say what the 'baseline model' is in relation to the figure above.
Reply via ReviewNB
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this paragraph up and then introduce this notebooks implementation. By saying something like "We will simplify this slight..."
In the case of the DeepSeek-R1 training, they use an >> For training, the DeepSeek-R1 authors used an
Reply via ReviewNB
Looks really good. I left some small readability nits. On the title, have you thought about something that relates a bit more to the task. i.e. "Post training an LLM for reasoning with GRPO in TRL" . |
Can you add the notebook to https://huggingface.co/docs/trl/en/community_tutorials when it's merged? |
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider leaving only the most important text/things you want to highlight in bold so as to not make it too distracting. For example, I think text like Group Relative Policy Optimization can be bold but not necessary to bold Large Language Model.
Reply via ReviewNB
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To begin, we'll load the baseline model >> To begin, we'll load Qwen/Qwen2-0.5B-Instruct as the baseline model. With only 0.5 billion parameters, it is lightweight and fits within the available resources.
Reply via ReviewNB
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be easier to follow if we do something like:
In this case, we will use two reward functions. The first reward function assigns higher scores to longer completions.
<code for length_reward function here>
The second reward function ensures the generation follows a specific format, using ...
<code for format_reward function here>
Reply via ReviewNB
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we pass a list of reward functions to the trainer that we previously defined >>> we pass the two reward functions we previously defined to the trainer
Reply via ReviewNB
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,1304 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
Thanks a lot for the feedback @qgallouedec @burtenshaw @stevhliu!! Really interesting suggestions that I believe improve the overall quality a lot 😄 I've incorporated your feedback and updated the recipe accordingly. Following @qgallouedec's suggestion, I introduced a third reward function for accuracy, though the results and conclusions remain largely similar. I've also restructured some sections and expanded the final part with observations that I believe will be relevant to readers. Let me know your thoughts! 😊 |
@@ -0,0 +1,3950 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, even if in the R1 paper, the completion length increases, there is no incentive, like explicit reward for this.
I think you should remove this length reward (the completion length is logged anyway if you want to monitor the completion length)
Reply via ReviewNB
Thanks for the feedback @qgallouedec! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks again! 👏
What does this PR do?
Draft! Still in progress...
Fixes #277
Who can review?
@merveenoyan and @stevhliu