Fix RemBertTokenizerFast #16933

ydshieh · 2022-04-25T17:00:26Z

What does this PR do?

RemBertTokenizer(Fast) are similar to AlbertTokenizer(Fast), the slow versions are based on SentencePiece.

Unlike AlbertTokenizerFast, the fast tokenizer RemBertTokenizerFast doesn't have

self.can_save_slow_tokenizer = False if not self.vocab_file else True

And I got error when I want to call save_pretrained() after doing something like

tokenizer_fast.train_new_from_iterator(training_ds["text"], 1024)

(while working on the task for creating tiny random models/processor)

Error message without this PR

  File "/home/yih_dar_huggingface_co/transformers/create_dummy_models.py", line 457, in convert_processors
    p.save_pretrained(output_folder)
  File "/home/yih_dar_huggingface_co/transformers/src/transformers/tokenization_utils_base.py", line 2101, in save_pretrained
    save_files = self._save_pretrained(
  File "/home/yih_dar_huggingface_co/transformers/src/transformers/tokenization_utils_fast.py", line 591, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
  File "/home/yih_dar_huggingface_co/transformers/src/transformers/models/rembert/tokenization_rembert_fast.py", line 237, in save_vocabulary
    if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
  File "/home/yih_dar_huggingface_co/miniconda3/envs/py-3-9/lib/python3.9/posixpath.py", line 375, in abspath
    path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not NoneType

ydshieh · 2022-04-25T17:06:47Z

I could provide the full code sample to reproduce the issue without this PR if necessary.

HuggingFaceDocBuilderDev · 2022-04-25T17:17:41Z

The documentation is not available anymore as the PR was closed or merged.

SaulLu

Thanks a lot for the fix! 🤗

This case emphasizes the fact that we're really missing the test file for RemBert's tokenizer (cf #16627) otherwise I think this behavior should have been caught by this common test:

transformers/tests/test_tokenization_common.py

Lines 3733 to 3751 in 32adbb2

    
           def test_saving_tokenizer_trainer(self): 
        
               for tokenizer, pretrained_name, kwargs in self.tokenizers_list: 
        
                   with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): 
        
                       with tempfile.TemporaryDirectory() as tmp_dir: 
        
                           # Save the fast tokenizer files in a temporary directory 
        
                           tokenizer_old = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs, use_fast=True) 
        
                           tokenizer_old.save_pretrained(tmp_dir, legacy_format=False)  # save only fast version 
        
                           # Initialize toy model for the trainer 
        
                           model = nn.Module() 
        
                           # Load tokenizer from a folder without legacy files 
        
                           tokenizer = self.rust_tokenizer_class.from_pretrained(tmp_dir) 
        
                           training_args = TrainingArguments(output_dir=tmp_dir, do_train=True, no_cuda=True) 
        
                           trainer = Trainer(model=model, args=training_args, tokenizer=tokenizer) 
        
                           # Should not raise an error 
        
                           trainer.save_model(os.path.join(tmp_dir, "checkpoint")) 
        
                           self.assertIn("tokenizer.json", os.listdir(os.path.join(tmp_dir, "checkpoint")))

ydshieh · 2022-04-25T17:27:44Z

Great to know the test is there (in common) 😄

sgugger

Thanks for fixing!

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix

d4d3b4a

ydshieh requested review from SaulLu and sgugger April 25, 2022 17:02

SaulLu approved these changes Apr 25, 2022

View reviewed changes

sgugger approved these changes Apr 25, 2022

View reviewed changes

ydshieh merged commit f6210c4 into huggingface:main Apr 25, 2022

ydshieh deleted the fix_rembert_tokenizer branch April 25, 2022 17:51

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022

Fix RemBertTokenizerFast (huggingface#16933)

6cc8822

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RemBertTokenizerFast #16933

Fix RemBertTokenizerFast #16933

ydshieh commented Apr 25, 2022 •

edited

Loading

ydshieh commented Apr 25, 2022

HuggingFaceDocBuilderDev commented Apr 25, 2022 •

edited

Loading

SaulLu left a comment

ydshieh commented Apr 25, 2022

sgugger left a comment

	def test_saving_tokenizer_trainer(self):
	for tokenizer, pretrained_name, kwargs in self.tokenizers_list:
	with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"):
	with tempfile.TemporaryDirectory() as tmp_dir:
	# Save the fast tokenizer files in a temporary directory
	tokenizer_old = self.rust_tokenizer_class.from_pretrained(pretrained_name, **kwargs, use_fast=True)
	tokenizer_old.save_pretrained(tmp_dir, legacy_format=False) # save only fast version

	# Initialize toy model for the trainer
	model = nn.Module()

	# Load tokenizer from a folder without legacy files
	tokenizer = self.rust_tokenizer_class.from_pretrained(tmp_dir)
	training_args = TrainingArguments(output_dir=tmp_dir, do_train=True, no_cuda=True)
	trainer = Trainer(model=model, args=training_args, tokenizer=tokenizer)

	# Should not raise an error
	trainer.save_model(os.path.join(tmp_dir, "checkpoint"))
	self.assertIn("tokenizer.json", os.listdir(os.path.join(tmp_dir, "checkpoint")))

Fix RemBertTokenizerFast #16933

Fix RemBertTokenizerFast #16933

Conversation

ydshieh commented Apr 25, 2022 • edited Loading

What does this PR do?

Error message without this PR

ydshieh commented Apr 25, 2022

HuggingFaceDocBuilderDev commented Apr 25, 2022 • edited Loading

SaulLu left a comment

Choose a reason for hiding this comment

ydshieh commented Apr 25, 2022

sgugger left a comment

Choose a reason for hiding this comment

ydshieh commented Apr 25, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 25, 2022 •

edited

Loading