Added new `Post-training an LLM using GRPO with TRL` recipe 🧑‍🍳️ #278

sergiopaniego · 2025-01-22T18:11:53Z

What does this PR do?

Draft! Still in progress...

Fixes #277

Who can review?

@merveenoyan and @stevhliu

review-notebook-app · 2025-01-22T18:11:59Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

sergiopaniego · 2025-01-29T14:48:54Z

Ready to be reviewed 😄

@merveenoyan @stevhliu @qgallouedec

HuggingFaceDocBuilderDev · 2025-01-29T17:28:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec

Nice! A few remarks:

You can now use trl==0.14 (latest, released today) to use GRPO. Plus, datasets, accelerate and transformers are trl deps:

- !pip install  -U -q transformers trl datasets peft accelerate
+ !pip install  -U -q trl peft

You don't need to pass the tokenizer. GRPO will load it for you
You write:

In the case of the DeepSeek-R1 training, they use an accuracy-based reward model to evaluate whether the response is correct, along with a format-based reward that ensures the model places its reasoning process between tags. You can find more details here.

Why don't you use a similar reward function as well? If you remove remove_unsused_columns=True, you'll get access to the "solution" column of the dataset in the reward function.

burtenshaw · 2025-01-30T08:18:15Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


OpenAI o1-o3 models >> OpenAI o1 and o3 models
exclusively employs pure RL >> exclusively employs RL | employs pure RL
to handle more complex and nuanced tasks >> to handle complex and nuanced tasks

Maybe link to the diagram in the TRL [docs](https://huggingface.co/docs/trl/main/en/grpo_trainer)

Reply via ReviewNB

burtenshaw · 2025-01-30T08:18:16Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


Line #2. # Tested with transformers==4.48.1, trl==0.14.0.dev0, datasets==3.2.0, peft==0.14.0, accelerate==1.3.0
I think trl was just released so you can drop the dev release.

Reply via ReviewNB

burtenshaw · 2025-01-30T08:18:16Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


Line #1. !pip install git+https://github.com/huggingface/trl.git@main
As above, I think that means we can skip installing from main.

Reply via ReviewNB

burtenshaw · 2025-01-30T08:18:16Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


Line #1. print(train_dataset[0])
If you wanted, you could render this math with Ipython

from IPython.display import display, Math

Reply via ReviewNB

burtenshaw · 2025-01-30T08:18:16Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


I think this should say what the 'baseline model' is in relation to the figure above.

Reply via ReviewNB

burtenshaw · 2025-01-30T08:18:16Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


I would move this paragraph up and then introduce this notebooks implementation. By saying something like "We will simplify this slight..."

In the case of the DeepSeek-R1 training, they use an >> For training, the DeepSeek-R1 authors used an

Reply via ReviewNB

burtenshaw · 2025-01-30T08:22:56Z

Looks really good. I left some small readability nits.

On the title, have you thought about something that relates a bit more to the task. i.e. "Post training an LLM for reasoning with GRPO in TRL" .

qgallouedec · 2025-01-30T08:37:13Z

Can you add the notebook to https://huggingface.co/docs/trl/en/community_tutorials when it's merged?

stevhliu · 2025-01-30T18:22:09Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


I would consider leaving only the most important text/things you want to highlight in bold so as to not make it too distracting. For example, I think text like Group Relative Policy Optimization can be bold but not necessary to bold Large Language Model.

Reply via ReviewNB

stevhliu · 2025-01-30T18:22:09Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


Let's see how and example looks like >> Let's take a look at an example

Reply via ReviewNB

stevhliu · 2025-01-30T18:22:09Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


To begin, we'll load the baseline model >> To begin, we'll load Qwen/Qwen2-0.5B-Instruct as the baseline model. With only 0.5 billion parameters, it is lightweight and fits within the available resources.

Reply via ReviewNB

stevhliu · 2025-01-30T18:22:09Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


I think it'd be easier to follow if we do something like:

In this case, we will use two reward functions. The first reward function assigns higher scores to longer completions.

<code for length_reward function here>

The second reward function ensures the generation follows a specific format, using ...

<code for format_reward function here>

Reply via ReviewNB

stevhliu · 2025-01-30T18:22:09Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


we pass a list of reward functions to the trainer that we previously defined >>> we pass the two reward functions we previously defined to the trainer

Reply via ReviewNB

stevhliu · 2025-01-30T18:22:09Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


>>> Time to train the model!

Reply via ReviewNB

stevhliu · 2025-01-30T18:22:09Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,1304 @@
+{


anser >>> answer

Reply via ReviewNB

stevhliu

Great job!

…lm-grpo-trl

sergiopaniego · 2025-01-31T10:05:41Z

Thanks a lot for the feedback @qgallouedec @burtenshaw @stevhliu!! Really interesting suggestions that I believe improve the overall quality a lot 😄

I've incorporated your feedback and updated the recipe accordingly. Following @qgallouedec's suggestion, I introduced a third reward function for accuracy, though the results and conclusions remain largely similar. I've also restructured some sections and expanded the final part with observations that I believe will be relevant to readers.

Let me know your thoughts! 😊

qgallouedec · 2025-01-31T11:08:14Z

notebooks/en/fine_tuning_llm_grpo_trl.ipynb

@@ -0,0 +1,3950 @@
+{


Just to be clear, even if in the R1 paper, the completion length increases, there is no incentive, like explicit reward for this.
I think you should remove this length reward (the completion length is logged anyway if you want to monitor the completion length)

Reply via ReviewNB

sergiopaniego · 2025-01-31T15:00:05Z

Thanks for the feedback @qgallouedec! 😄
I've removed the length reward function to better align with R1. Initially, I included slightly different reward functions to demonstrate that the trainer could handle them. However, since we've now added the two "official" reward functions, I agree that simplifying and sticking to just those makes more sense.

…lm-grpo-trl

stevhliu

LGTM, thanks again! 👏

…lm-grpo-trl

sergiopaniego added 2 commits January 22, 2025 17:36

Added Fine-tuning LLM with TRL using GRPO recipe

3d4d769

New recipe version

f1da753

sergiopaniego added 6 commits January 26, 2025 22:26

Training procedure updated

34f8cc9

Updated notebook with 2 rew funcs

4e0075b

Updated 2 reward functions and check trained model

31b352f

Updated notebook working

c1cb31f

Text refined

f806c02

Updated toctree and index

1ffd4ba

sergiopaniego marked this pull request as ready for review January 29, 2025 14:47

qgallouedec reviewed Jan 29, 2025

View reviewed changes

burtenshaw reviewed Jan 30, 2025

View reviewed changes

stevhliu reviewed Jan 30, 2025

View reviewed changes

sergiopaniego added 3 commits January 31, 2025 10:57

Improved based on feedback

de1b93e

Merge branch 'main' of https://github.com/huggingface/cookbook into l…

9ca6dc7

…lm-grpo-trl

Removed unneeded output

b9f6d14

qgallouedec reviewed Jan 31, 2025

View reviewed changes

sergiopaniego added 3 commits January 31, 2025 15:42

Removed length reward function

ae8e63f

Added Tensorboard results

6037058

Typo fixed

0574739

Merge branch 'main' of https://github.com/huggingface/cookbook into l…

9300986

…lm-grpo-trl

stevhliu approved these changes Jan 31, 2025

View reviewed changes

Merge branch 'main' of https://github.com/huggingface/cookbook into l…

e1b14d4

…lm-grpo-trl

Updated recipe titles

0e45d15

stevhliu merged commit 063ca3e into huggingface:main Feb 5, 2025
1 check passed

sergiopaniego deleted the llm-grpo-trl branch February 5, 2025 21:47

sergiopaniego mentioned this pull request Feb 6, 2025

💡 Add 'Post training an LLM for reasoning with GRPO in TRL' tutorial huggingface/trl#2785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added new `Post-training an LLM using GRPO with TRL` recipe 🧑‍🍳️ #278

Added new `Post-training an LLM using GRPO with TRL` recipe 🧑‍🍳️ #278

sergiopaniego commented Jan 22, 2025

review-notebook-app bot commented Jan 22, 2025

sergiopaniego commented Jan 29, 2025

HuggingFaceDocBuilderDev commented Jan 29, 2025

qgallouedec left a comment •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw commented Jan 30, 2025

qgallouedec commented Jan 30, 2025

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu left a comment

sergiopaniego commented Jan 31, 2025

qgallouedec Jan 31, 2025 •

edited

Loading

sergiopaniego commented Jan 31, 2025

stevhliu left a comment

Added new Post-training an LLM using GRPO with TRL recipe 🧑‍🍳️ #278

Added new Post-training an LLM using GRPO with TRL recipe 🧑‍🍳️ #278

Conversation

sergiopaniego commented Jan 22, 2025

What does this PR do?

Who can review?

review-notebook-app bot commented Jan 22, 2025

sergiopaniego commented Jan 29, 2025

HuggingFaceDocBuilderDev commented Jan 29, 2025

qgallouedec left a comment • edited Loading

Choose a reason for hiding this comment

burtenshaw Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

burtenshaw Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

burtenshaw Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

burtenshaw Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

burtenshaw Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

burtenshaw Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

burtenshaw commented Jan 30, 2025

qgallouedec commented Jan 30, 2025

stevhliu Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu left a comment

Choose a reason for hiding this comment

sergiopaniego commented Jan 31, 2025

qgallouedec Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

sergiopaniego commented Jan 31, 2025

stevhliu left a comment

Choose a reason for hiding this comment

Added new `Post-training an LLM using GRPO with TRL` recipe 🧑‍🍳️ #278

Added new `Post-training an LLM using GRPO with TRL` recipe 🧑‍🍳️ #278

qgallouedec left a comment •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

burtenshaw Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

stevhliu Jan 30, 2025 •

edited

Loading

qgallouedec Jan 31, 2025 •

edited

Loading