Make Trainer compatible with sharded checkpoints #17053

sgugger · 2022-05-02T18:13:37Z

What does this PR do?

The Trainer is currently incompatible with the new sharded checkpoint feature in two places:

resuming from a checkpoint
loading the best model at the end of training

In both cases, the model state dict is loaded back inside the model but there is no model save file if the model was above the default size for sharding, resulting in errors (as was pointed out by #16976 ).

This PR addresses this by:

Creating a new function load_sharded_checkpoint that does the same thing as model.load_state_dict for regular model files, but loads a sharded checkpoint (and errors in case of missing/unexpected keys when strict=True).
Use that function inside the Trainer in the two places mentioned above.

A test is added to make sure resuming works from a sharded checkpoint.

Fixes #16976

HuggingFaceDocBuilderDev · 2022-05-02T18:29:03Z

The documentation is not available anymore as the PR was closed or merged.

LysandreJik

Looks great! Only added a few comments regarding docs as I think we haven't put the emphasis on what it is exactly and how it should be used for it to be easily used by new users.

LysandreJik · 2022-05-02T18:36:12Z

src/transformers/modeling_utils.py

+    """
+    This is the same as
+    [`torch.nn.Module.load_state_dict`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=load_state_dict#torch.nn.Module.load_state_dict)
+    but for a sharded checkpoint.


I think it would be nice to have a documentation for sharded checkpoints so that users understand what they are. For example here having the sharded checkpoint redirect to a small blurb mentioning things like the following:

What is it?

Weights files that are split in multiple checkpoint

Index showing how weights are linked

Why is it important?

Better to work with smaller files for memory

Simpler to push to the hub

How to work with it?

Showing how to use from_pretrained and save_pretrained for sharding

push to hub

now trainer

Let me know if that's something that already exists, and if not I'm happy to help contribute it (or to contribute it altogether).

LysandreJik · 2022-05-02T18:36:47Z

src/transformers/modeling_utils.py

@@ -327,6 +327,63 @@ def get_checkpoint_shard_files(
    return cached_filenames, sharded_metadata


+def load_sharded_checkpoint(model, folder, strict=True):


Should this be in the docs somewhere?

LysandreJik

LGTM!

* Make Trainer compatible with sharded checkpoints * Add doc

Make Trainer compatible with sharded checkpoints

47a263b

sgugger requested a review from LysandreJik May 2, 2022 18:13

LysandreJik approved these changes May 2, 2022

View reviewed changes

Add doc

b32d396

LysandreJik approved these changes May 3, 2022

View reviewed changes

sgugger merged commit a8fa2f9 into main May 3, 2022

sgugger deleted the resume_from_sharded_checkpoint branch May 3, 2022 13:55

stevhliu pushed a commit to stevhliu/transformers that referenced this pull request May 3, 2022

Make Trainer compatible with sharded checkpoints (huggingface#17053)

0d069b8

* Make Trainer compatible with sharded checkpoints * Add doc

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022

Make Trainer compatible with sharded checkpoints (huggingface#17053)

53e6897

* Make Trainer compatible with sharded checkpoints * Add doc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Trainer compatible with sharded checkpoints #17053

Make Trainer compatible with sharded checkpoints #17053

sgugger commented May 2, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 2, 2022 •

edited

Loading

LysandreJik left a comment

LysandreJik May 2, 2022

LysandreJik May 2, 2022

LysandreJik left a comment

		@@ -327,6 +327,63 @@ def get_checkpoint_shard_files(
		return cached_filenames, sharded_metadata


		def load_sharded_checkpoint(model, folder, strict=True):

Make Trainer compatible with sharded checkpoints #17053

Make Trainer compatible with sharded checkpoints #17053

Conversation

sgugger commented May 2, 2022 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented May 2, 2022 • edited Loading

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik May 2, 2022

Choose a reason for hiding this comment

LysandreJik May 2, 2022

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

sgugger commented May 2, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 2, 2022 •

edited

Loading