Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new Post-training an LLM using GRPO with TRL recipe 🧑‍🍳️ #278

Merged
merged 17 commits into from
Feb 5, 2025

Conversation

sergiopaniego
Copy link
Contributor

What does this PR do?

Draft! Still in progress...

Fixes #277

Who can review?

@merveenoyan and @stevhliu

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sergiopaniego sergiopaniego marked this pull request as ready for review January 29, 2025 14:47
@sergiopaniego
Copy link
Contributor Author

Ready to be reviewed 😄

@merveenoyan @stevhliu @qgallouedec

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A few remarks:

  • You can now use trl==0.14 (latest, released today) to use GRPO. Plus, datasets, accelerate and transformers are trl deps:
- !pip install  -U -q transformers trl datasets peft accelerate
+ !pip install  -U -q trl peft
  • You don't need to pass the tokenizer. GRPO will load it for you
  • You write:

In the case of the DeepSeek-R1 training, they use an accuracy-based reward model to evaluate whether the response is correct, along with a format-based reward that ensures the model places its reasoning process between tags. You can find more details here.

Why don't you use a similar reward function as well? If you remove remove_unsused_columns=True, you'll get access to the "solution" column of the dataset in the reward function.

@@ -0,0 +1,1304 @@
{
Copy link

@burtenshaw burtenshaw Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAI o1-o3 models >> OpenAI o1 and o3 models

exclusively employs pure RL >> exclusively employs RL | employs pure RL

to handle more complex and nuanced tasks >> to handle complex and nuanced tasks

Maybe link to the diagram in the TRL [docs](https://huggingface.co/docs/trl/main/en/grpo_trainer)


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link

@burtenshaw burtenshaw Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.    # Tested with transformers==4.48.1, trl==0.14.0.dev0, datasets==3.2.0, peft==0.14.0, accelerate==1.3.0

I think trl was just released so you can drop the dev release.


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link

@burtenshaw burtenshaw Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    !pip install git+https://github.com/huggingface/trl.git@main

As above, I think that means we can skip installing from main.


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link

@burtenshaw burtenshaw Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    print(train_dataset[0])

If you wanted, you could render this math with Ipython

from IPython.display import display, Math

Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link

@burtenshaw burtenshaw Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should say what the 'baseline model' is in relation to the figure above.


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link

@burtenshaw burtenshaw Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this paragraph up and then introduce this notebooks implementation. By saying something like "We will simplify this slight..."

In the case of the DeepSeek-R1 training, they use an >> For training, the DeepSeek-R1 authors used an


Reply via ReviewNB

@burtenshaw
Copy link

Looks really good. I left some small readability nits.

On the title, have you thought about something that relates a bit more to the task. i.e. "Post training an LLM for reasoning with GRPO in TRL" .

@qgallouedec
Copy link
Member

Can you add the notebook to https://huggingface.co/docs/trl/en/community_tutorials when it's merged?

@@ -0,0 +1,1304 @@
{
Copy link
Member

@stevhliu stevhliu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider leaving only the most important text/things you want to highlight in bold so as to not make it too distracting. For example, I think text like Group Relative Policy Optimization can be bold but not necessary to bold Large Language Model.


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link
Member

@stevhliu stevhliu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see how and example looks like >> Let's take a look at an example


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link
Member

@stevhliu stevhliu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To begin, we'll load the baseline model >> To begin, we'll load Qwen/Qwen2-0.5B-Instruct as the baseline model. With only 0.5 billion parameters, it is lightweight and fits within the available resources.


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link
Member

@stevhliu stevhliu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be easier to follow if we do something like:

In this case, we will use two reward functions. The first reward function assigns higher scores to longer completions.

<code for length_reward function here>

The second reward function ensures the generation follows a specific format, using ...

<code for format_reward function here>


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link
Member

@stevhliu stevhliu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we pass a list of reward functions to the trainer that we previously defined >>> we pass the two reward functions we previously defined to the trainer


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link
Member

@stevhliu stevhliu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> Time to train the model!


Reply via ReviewNB

@@ -0,0 +1,1304 @@
{
Copy link
Member

@stevhliu stevhliu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anser >>> answer


Reply via ReviewNB

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

@sergiopaniego
Copy link
Contributor Author

Thanks a lot for the feedback @qgallouedec @burtenshaw @stevhliu!! Really interesting suggestions that I believe improve the overall quality a lot 😄

I've incorporated your feedback and updated the recipe accordingly. Following @qgallouedec's suggestion, I introduced a third reward function for accuracy, though the results and conclusions remain largely similar. I've also restructured some sections and expanded the final part with observations that I believe will be relevant to readers.

Let me know your thoughts! 😊

@@ -0,0 +1,3950 @@
{
Copy link
Member

@qgallouedec qgallouedec Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, even if in the R1 paper, the completion length increases, there is no incentive, like explicit reward for this.

I think you should remove this length reward (the completion length is logged anyway if you want to monitor the completion length)


Reply via ReviewNB

@sergiopaniego
Copy link
Contributor Author

Thanks for the feedback @qgallouedec! 😄
I've removed the length reward function to better align with R1. Initially, I included slightly different reward functions to demonstrate that the trainer could handle them. However, since we've now added the two "official" reward functions, I agree that simplifying and sticking to just those makes more sense.

Copy link
Member

@stevhliu stevhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks again! 👏

@stevhliu stevhliu merged commit 063ca3e into huggingface:main Feb 5, 2025
1 check passed
@sergiopaniego sergiopaniego deleted the llm-grpo-trl branch February 5, 2025 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Post-training an LLM using GRPO with TRL recipe 🧑‍🍳️
5 participants