fixed crash when deleting older checkpoint and files with name f"{checkpoint_prefix}-*" exist #16686
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
I create an archive of older checkpoints during training the checkpoint has a name with
f"{checkpoint_prefix}-*.zip/.tar
previously
glob(f"{checkpoint_prefix}-*")
takes all files/folders starting with the name checkpoint, and latershutil.rmtree(checkpoint)
takes a folder name; since at some point it my get a zip file; it crashes training; adding thisif os.path.isdir(x)
allows only folders onglob_checkpoints
.let's say output folder structure is like: (with
save_limit=5
)then code attempts to remove oldest checkpoint
since we have a file (checkpoint-33000.zip) and pass the file to
shutil.rmtree(checkpoint)
to delete it will fail.by avoiding storing files on
glob_checkpoints
this will get fixed! ( checking everything is folder as checkpoints are folders not single files.)Before submitting:
Who can review?
@sgugger