add a warning in `SpmConverter` for sentencepiece's model using the byte fallback feature #16629

SaulLu · 2022-04-06T13:05:31Z

What does this PR do?

This PR proposes 2 changes:

Updating the protobuf model of the sentencepiece models
Raise a warning when the sentencepiece tokenizer that is being converted by SpmConverter to a fast tokenizer uses the byte fallback feature - which is not yet implemented in HF tokenizers cf this issue.

For context, this PR is motivated by the PR #15529 for adding DebertaV2TokenizerFast because the deberta-v3 tokenizer uses the byte fallback feature and we need a way to warn the user that the fast tokenizer will not be equivalent to the slow sentencepiece tokenizer. Personally, I think this warning is beneficial in a more general case than deberta-v3 because the byte fallback feature was introduced a long time ago in the sentencepiece library.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Would love to have your feebacks @LysandreJik , @sgugger , @patrickvonplaten , @patil-suraj and/or @Narsil 🤗

This reverts commit b07f671.

This reverts commit 4610825.

SaulLu · 2022-04-06T13:07:08Z

src/transformers/convert_slow_tokenizer.py

@@ -429,6 +430,12 @@ def __init__(self, *args):
            m.ParseFromString(f.read())
        self.proto = m

+        if self.proto.trainer_spec.byte_fallback:


Without the src/transformers/utils/sentencepiece_model_pb2.py file update, the self.proto.trainer_spec.byte_fallback call would have returned a AttributeError with the message "byte_fallback"

SaulLu · 2022-04-06T13:17:30Z

src/transformers/utils/sentencepiece_model_pb2.py

@@ -2,7 +2,7 @@
 # Generated by the protocol buffer compiler.  DO NOT EDIT!


This file was a copy of a file in the sentencepiece library. We need it to load the sentencepiece model into a python object for the conversion of sentencepiece tokenizer into fast ones.

This file was updated in December 2020 and then in January 2021 as the sentencepiece model was augmented with several new features including byte fallback.

The changes you link are within the compiled proto file, I looked at the *.proto files and found this huge commit:

google/sentencepiece@329383b#diff-31916168ddf65a6b69364c6c819c13b2024f148d3f84f64f89bc7b3d1e8ddf74R146

Which contains more values to check against maybe:

split_digits
required_chars
byte_fallback
vocabulary_output_piece_score
train_extremely_large_corpus

Don't think they are all used at runtime, but maybe worth a look too (split_digits looks like a good candidate)

You are absolutely right!

I should have mentioned it more clearly in my PR. I didn't have the time to dig deeper into what all these attributes meant for sentencepiece or to see what the fix (I think about split_digits) would be on the Transformers side so it wouldn't be BC.

HuggingFaceDocBuilderDev · 2022-04-06T13:20:50Z

The documentation is not available anymore as the PR was closed or merged.

SaulLu · 2022-04-06T13:21:34Z

src/transformers/utils/sentencepiece_model_pb2.py

@@ -2,7 +2,7 @@
 # Generated by the protocol buffer compiler.  DO NOT EDIT!
 # source: sentencepiece_model.proto


Even if protobuf is normally well thought out for object updates, I still wanted to do some tests to check that this change would not disturb the current behaviour.

To do this I tokenized the XNLI dataset with 2 tokenizers - albert-base-v2 and microsoft/deberta-v3-large - by loading the fast version of these tokenizers from the slow version. I did this step twice with the 2 different versions of the src/transformers/utils/sentencepiece_model_pb2.py file. At the end I compared the tokenization produced between these 2 versions, and for both tokenizers no changes were identified. I tested with albert which is a tokenizer added in Nov 2019 thus before Dec 2020 and with deberta-v3 because it is known that this tokenizer uses the byte fallback feature and thus was produced with a recent version of sentencepiece.

Narsil

LGTM,

I took the opportunity to check for other variables which might be incorrectly used/cause issues.

Doing the necessary work for ByteFallback will not be small, the question is : Will that make it enough so that we can support all new tokenizers easily.
I definitely hope so, and the ability for offsets seem like the most demanded feature of tokenizers when the Fast tokenizer is not there, so it's definitely as desirable trait to be able to always have a Fast tokenizer.2

Thanks for taking care of this !

Not approving to make sure we have a core maintainer's opinion on this.

sgugger

LGTM, but the warning could be a bit more informative I believe. Thanks for working on this!

sgugger · 2022-04-06T18:20:20Z

src/transformers/convert_slow_tokenizer.py

+                "The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"
+                " which is not implemented in the fast tokenizers."


Could the warning explain a bit more to the user what to expect? If I read that message with no context, I have no idea what it means 😬

Also should we maybe use the logger here instead:

from .utils import logger ... logger = logging.get_logger(__name__) .... logger.warn(...)

@sgugger (@patrickvonplaten ) That's a good point! I just added an explanation in my last commit. Does this seem sufficient? 🙂

@patrickvonplaten That's actually how I did it initially but then I saw that the python warnings object was also used elsewhere in the transformers codebase so I changed it back. It seemed to me that warnings.warn was a warning that would probably be more read by the user.

I have no preference between the two, but I'm interested to know what makes you prefer the second option! 😄

From the official docs:

Issue a warning regarding a particular runtime event | warnings.warn() in library code if the issue is avoidable and the client application should be modified to eliminate the warning logging.warning() if there is nothing the client application can do about the situation, but the event should still be noted Issue a warning regarding a particular runtime event warnings.warn() in library code if the issue is avoidable and the client application should be modified to eliminate the warning if there is nothing the client application can do about the situation, but the event should still be noted

https://docs.python.org/3.8/howto/logging.html#when-to-use-logging

I think this is more of a warnings case, where user SHOULD change his code as it's likely to cause issues to him later on.
Not raising Exception here is mainly due to the fact that a alerted user might KNOW he's not going to trigger byte fallback, and wants the offsets.

Yes it's definitely not instance where we want to use the logger as, from what I understood, the user should always see this and adapt the code.

…piece-bytefallback

…yte fallback feature (huggingface#16629) * update proto sentencepiece model * Revert "update proto sentencepiece model" This reverts commit b07f671. * add check * add test * Revert "Revert "update proto sentencepiece model"" This reverts commit 4610825. * test for log level * test for log level 2 * warning at the warning level * clean * format * add explanation in docstring

SaulLu added 9 commits April 6, 2022 11:21

update proto sentencepiece model

b07f671

Revert "update proto sentencepiece model"

4610825

This reverts commit b07f671.

add check

1ee9f3d

add test

406ae44

Revert "Revert "update proto sentencepiece model""

06fb892

This reverts commit 4610825.

test for log level

e6c78b8

test for log level 2

552fd60

warning at the warning level

d91f34f

clean

4baf7ce

SaulLu commented Apr 6, 2022

View reviewed changes

format

c7adf5d

SaulLu changed the title ~~[WIP] add a warning in SpmConverter for sentencepiece's model using the byte fallback feature~~ add a warning in SpmConverter for sentencepiece's model using the byte fallback feature Apr 6, 2022

SaulLu requested review from LysandreJik, Narsil, patil-suraj, patrickvonplaten and sgugger April 6, 2022 13:58

Narsil reviewed Apr 6, 2022

View reviewed changes

sgugger approved these changes Apr 6, 2022

View reviewed changes

patrickvonplaten approved these changes Apr 7, 2022

View reviewed changes

SaulLu added 4 commits April 7, 2022 11:25

add explanation in docstring

8969211

Merge remote-tracking branch 'upstream/main' into LS/warning-sentence…

3bb407a

…piece-bytefallback

Merge remote-tracking branch 'upstream/main' into LS/warning-sentence…

061f3c3

…piece-bytefallback

Merge remote-tracking branch 'upstream/main' into LS/warning-sentence…

d7f854b

…piece-bytefallback

SaulLu merged commit 1025a9b into huggingface:main Apr 11, 2022

SaulLu mentioned this pull request Apr 14, 2022

feat: add debertav2 fast tokenizer #15529

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a warning in `SpmConverter` for sentencepiece's model using the byte fallback feature #16629

add a warning in `SpmConverter` for sentencepiece's model using the byte fallback feature #16629

SaulLu commented Apr 6, 2022 •

edited

Loading

SaulLu Apr 6, 2022

SaulLu Apr 6, 2022

Narsil Apr 6, 2022

SaulLu Apr 6, 2022

HuggingFaceDocBuilderDev commented Apr 6, 2022 •

edited

Loading

SaulLu Apr 6, 2022

Narsil left a comment

sgugger left a comment

sgugger Apr 6, 2022

patrickvonplaten Apr 7, 2022

SaulLu Apr 7, 2022

SaulLu Apr 7, 2022

Narsil Apr 7, 2022 •

edited

Loading

sgugger Apr 7, 2022

		@@ -2,7 +2,7 @@
		# Generated by the protocol buffer compiler. DO NOT EDIT!

		@@ -2,7 +2,7 @@
		# Generated by the protocol buffer compiler. DO NOT EDIT!
		# source: sentencepiece_model.proto

		"The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"
		" which is not implemented in the fast tokenizers."

add a warning in SpmConverter for sentencepiece's model using the byte fallback feature #16629

add a warning in SpmConverter for sentencepiece's model using the byte fallback feature #16629

Conversation

SaulLu commented Apr 6, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 6, 2022 • edited Loading

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Narsil Apr 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

add a warning in `SpmConverter` for sentencepiece's model using the byte fallback feature #16629

add a warning in `SpmConverter` for sentencepiece's model using the byte fallback feature #16629

SaulLu commented Apr 6, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 6, 2022 •

edited

Loading

Narsil Apr 7, 2022 •

edited

Loading