Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RemBertTokenizerFast #16933

Merged
merged 1 commit into from
Apr 25, 2022
Merged

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Apr 25, 2022

What does this PR do?

RemBertTokenizer(Fast) are similar to AlbertTokenizer(Fast), the slow versions are based on SentencePiece.

Unlike AlbertTokenizerFast, the fast tokenizer RemBertTokenizerFast doesn't have

self.can_save_slow_tokenizer = False if not self.vocab_file else True

And I got error when I want to call save_pretrained() after doing something like

tokenizer_fast.train_new_from_iterator(training_ds["text"], 1024)

(while working on the task for creating tiny random models/processor)

Error message without this PR

  File "/home/yih_dar_huggingface_co/transformers/create_dummy_models.py", line 457, in convert_processors
    p.save_pretrained(output_folder)
  File "/home/yih_dar_huggingface_co/transformers/src/transformers/tokenization_utils_base.py", line 2101, in save_pretrained
    save_files = self._save_pretrained(
  File "/home/yih_dar_huggingface_co/transformers/src/transformers/tokenization_utils_fast.py", line 591, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
  File "/home/yih_dar_huggingface_co/transformers/src/transformers/models/rembert/tokenization_rembert_fast.py", line 237, in save_vocabulary
    if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
  File "/home/yih_dar_huggingface_co/miniconda3/envs/py-3-9/lib/python3.9/posixpath.py", line 375, in abspath
    path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not NoneType

@ydshieh ydshieh requested review from SaulLu and sgugger April 25, 2022 17:02
@ydshieh
Copy link
Collaborator Author

ydshieh commented Apr 25, 2022

I could provide the full code sample to reproduce the issue without this PR if necessary.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 25, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the fix! 🤗

This case emphasizes the fact that we're really missing the test file for RemBert's tokenizer (cf #16627) otherwise I think this behavior should have been caught by this common test:

def test_saving_tokenizer_trainer(self):
for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
with tempfile.TemporaryDirectory() as tmp_dir:
# Save the fast tokenizer files in a temporary directory
tokenizer_old = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs, use_fast=True)
tokenizer_old.save_pretrained(tmp_dir, legacy_format=False) # save only fast version
# Initialize toy model for the trainer
model = nn.Module()
# Load tokenizer from a folder without legacy files
tokenizer = self.rust_tokenizer_class.from_pretrained(tmp_dir)
training_args = TrainingArguments(output_dir=tmp_dir, do_train=True, no_cuda=True)
trainer = Trainer(model=model, args=training_args, tokenizer=tokenizer)
# Should not raise an error
trainer.save_model(os.path.join(tmp_dir, "checkpoint"))
self.assertIn("tokenizer.json", os.listdir(os.path.join(tmp_dir, "checkpoint")))

@ydshieh
Copy link
Collaborator Author

ydshieh commented Apr 25, 2022

Great to know the test is there (in common) 😄

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

@ydshieh ydshieh merged commit f6210c4 into huggingface:main Apr 25, 2022
@ydshieh ydshieh deleted the fix_rembert_tokenizer branch April 25, 2022 17:51
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants