-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RemBertTokenizerFast #16933
Fix RemBertTokenizerFast #16933
Conversation
I could provide the full code sample to reproduce the issue without this PR if necessary. |
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the fix! 🤗
This case emphasizes the fact that we're really missing the test file for RemBert's tokenizer (cf #16627) otherwise I think this behavior should have been caught by this common test:
transformers/tests/test_tokenization_common.py
Lines 3733 to 3751 in 32adbb2
def test_saving_tokenizer_trainer(self): | |
for tokenizer, pretrained_name, kwargs in self.tokenizers_list: | |
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): | |
with tempfile.TemporaryDirectory() as tmp_dir: | |
# Save the fast tokenizer files in a temporary directory | |
tokenizer_old = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs, use_fast=True) | |
tokenizer_old.save_pretrained(tmp_dir, legacy_format=False) # save only fast version | |
# Initialize toy model for the trainer | |
model = nn.Module() | |
# Load tokenizer from a folder without legacy files | |
tokenizer = self.rust_tokenizer_class.from_pretrained(tmp_dir) | |
training_args = TrainingArguments(output_dir=tmp_dir, do_train=True, no_cuda=True) | |
trainer = Trainer(model=model, args=training_args, tokenizer=tokenizer) | |
# Should not raise an error | |
trainer.save_model(os.path.join(tmp_dir, "checkpoint")) | |
self.assertIn("tokenizer.json", os.listdir(os.path.join(tmp_dir, "checkpoint"))) |
Great to know the test is there (in common) 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
What does this PR do?
RemBertTokenizer(Fast)
are similar toAlbertTokenizer(Fast)
, the slow versions are based onSentencePiece
.Unlike
AlbertTokenizerFast
, the fast tokenizerRemBertTokenizerFast
doesn't haveAnd I got error when I want to call
save_pretrained()
after doing something like(while working on the task for creating tiny random models/processor)
Error message without this PR