Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a warning in SpmConverter for sentencepiece's model using the byte fallback feature #16629

Merged
merged 14 commits into from
Apr 11, 2022

Conversation

SaulLu
Copy link
Contributor

@SaulLu SaulLu commented Apr 6, 2022

What does this PR do?

This PR proposes 2 changes:

  1. Updating the protobuf model of the sentencepiece models
  2. Raise a warning when the sentencepiece tokenizer that is being converted by SpmConverter to a fast tokenizer uses the byte fallback feature - which is not yet implemented in HF tokenizers cf this issue.

For context, this PR is motivated by the PR #15529 for adding DebertaV2TokenizerFast because the deberta-v3 tokenizer uses the byte fallback feature and we need a way to warn the user that the fast tokenizer will not be equivalent to the slow sentencepiece tokenizer. Personally, I think this warning is beneficial in a more general case than deberta-v3 because the byte fallback feature was introduced a long time ago in the sentencepiece library.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Would love to have your feebacks @LysandreJik , @sgugger , @patrickvonplaten , @patil-suraj and/or @Narsil 🤗

@@ -429,6 +430,12 @@ def __init__(self, *args):
m.ParseFromString(f.read())
self.proto = m

if self.proto.trainer_spec.byte_fallback:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the src/transformers/utils/sentencepiece_model_pb2.py file update, the self.proto.trainer_spec.byte_fallback call would have returned a AttributeError with the message "byte_fallback"

@@ -2,7 +2,7 @@
# Generated by the protocol buffer compiler. DO NOT EDIT!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was a copy of a file in the sentencepiece library. We need it to load the sentencepiece model into a python object for the conversion of sentencepiece tokenizer into fast ones.

This file was updated in December 2020 and then in January 2021 as the sentencepiece model was augmented with several new features including byte fallback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes you link are within the compiled proto file, I looked at the *.proto files and found this huge commit:

google/sentencepiece@329383b#diff-31916168ddf65a6b69364c6c819c13b2024f148d3f84f64f89bc7b3d1e8ddf74R146

Which contains more values to check against maybe:

split_digits
required_chars
byte_fallback
vocabulary_output_piece_score
train_extremely_large_corpus

Don't think they are all used at runtime, but maybe worth a look too (split_digits looks like a good candidate)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right!

I should have mentioned it more clearly in my PR. I didn't have the time to dig deeper into what all these attributes meant for sentencepiece or to see what the fix (I think about split_digits) would be on the Transformers side so it wouldn't be BC.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 6, 2022

The documentation is not available anymore as the PR was closed or merged.

@@ -2,7 +2,7 @@
# Generated by the protocol buffer compiler. DO NOT EDIT!
# source: sentencepiece_model.proto
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if protobuf is normally well thought out for object updates, I still wanted to do some tests to check that this change would not disturb the current behaviour.

To do this I tokenized the XNLI dataset with 2 tokenizers - albert-base-v2 and microsoft/deberta-v3-large - by loading the fast version of these tokenizers from the slow version. I did this step twice with the 2 different versions of the src/transformers/utils/sentencepiece_model_pb2.py file. At the end I compared the tokenization produced between these 2 versions, and for both tokenizers no changes were identified. I tested with albert which is a tokenizer added in Nov 2019 thus before Dec 2020 and with deberta-v3 because it is known that this tokenizer uses the byte fallback feature and thus was produced with a recent version of sentencepiece.

@SaulLu SaulLu changed the title [WIP] add a warning in SpmConverter for sentencepiece's model using the byte fallback feature add a warning in SpmConverter for sentencepiece's model using the byte fallback feature Apr 6, 2022
Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,

I took the opportunity to check for other variables which might be incorrectly used/cause issues.

Doing the necessary work for ByteFallback will not be small, the question is : Will that make it enough so that we can support all new tokenizers easily.
I definitely hope so, and the ability for offsets seem like the most demanded feature of tokenizers when the Fast tokenizer is not there, so it's definitely as desirable trait to be able to always have a Fast tokenizer.2

Thanks for taking care of this !

Not approving to make sure we have a core maintainer's opinion on this.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but the warning could be a bit more informative I believe. Thanks for working on this!

Comment on lines 435 to 436
"The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"
" which is not implemented in the fast tokenizers."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the warning explain a bit more to the user what to expect? If I read that message with no context, I have no idea what it means 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should we maybe use the logger here instead:

from .utils import logger
...

logger = logging.get_logger(__name__)

....

logger.warn(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgugger (@patrickvonplaten ) That's a good point! I just added an explanation in my last commit. Does this seem sufficient? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten That's actually how I did it initially but then I saw that the python warnings object was also used elsewhere in the transformers codebase so I changed it back. It seemed to me that warnings.warn was a warning that would probably be more read by the user.

I have no preference between the two, but I'm interested to know what makes you prefer the second option! 😄

Copy link
Contributor

@Narsil Narsil Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the official docs:

Issue a warning regarding a particular runtime event | warnings.warn() in library code if the issue is avoidable 
and the client application should be modified to eliminate the warning logging.warning() if there is nothing the
client application can do about the situation, but the event should still be noted


Issue a warning regarding a particular runtime event
	

warnings.warn() in library code if the issue is avoidable and the client application should be modified to eliminate the warning

if there is nothing the client application can do about the situation, but the event should still be noted

https://docs.python.org/3.8/howto/logging.html#when-to-use-logging

I think this is more of a warnings case, where user SHOULD change his code as it's likely to cause issues to him later on.
Not raising Exception here is mainly due to the fact that a alerted user might KNOW he's not going to trigger byte fallback, and wants the offsets.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's definitely not instance where we want to use the logger as, from what I understood, the user should always see this and adapt the code.

@SaulLu SaulLu merged commit 1025a9b into huggingface:main Apr 11, 2022
@SaulLu SaulLu mentioned this pull request Apr 14, 2022
5 tasks
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…yte fallback feature (huggingface#16629)

* update proto sentencepiece model

* Revert "update proto sentencepiece model"

This reverts commit b07f671.

* add check

* add test

* Revert "Revert "update proto sentencepiece model""

This reverts commit 4610825.

* test for log level

* test for log level 2

* warning at the warning level

* clean

* format

* add explanation in docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants