-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a warning in SpmConverter
for sentencepiece's model using the byte fallback feature
#16629
add a warning in SpmConverter
for sentencepiece's model using the byte fallback feature
#16629
Conversation
@@ -429,6 +430,12 @@ def __init__(self, *args): | |||
m.ParseFromString(f.read()) | |||
self.proto = m | |||
|
|||
if self.proto.trainer_spec.byte_fallback: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the src/transformers/utils/sentencepiece_model_pb2.py
file update, the self.proto.trainer_spec.byte_fallback
call would have returned a AttributeError
with the message "byte_fallback"
@@ -2,7 +2,7 @@ | |||
# Generated by the protocol buffer compiler. DO NOT EDIT! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file was a copy of a file in the sentencepiece library. We need it to load the sentencepiece model into a python object for the conversion of sentencepiece tokenizer into fast ones.
This file was updated in December 2020 and then in January 2021 as the sentencepiece model was augmented with several new features including byte fallback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes you link are within the compiled proto file, I looked at the *.proto
files and found this huge commit:
Which contains more values to check against maybe:
split_digits
required_chars
byte_fallback
vocabulary_output_piece_score
train_extremely_large_corpus
Don't think they are all used at runtime, but maybe worth a look too (split_digits
looks like a good candidate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right!
I should have mentioned it more clearly in my PR. I didn't have the time to dig deeper into what all these attributes meant for sentencepiece or to see what the fix (I think about split_digits
) would be on the Transformers side so it wouldn't be BC.
The documentation is not available anymore as the PR was closed or merged. |
@@ -2,7 +2,7 @@ | |||
# Generated by the protocol buffer compiler. DO NOT EDIT! | |||
# source: sentencepiece_model.proto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if protobuf is normally well thought out for object updates, I still wanted to do some tests to check that this change would not disturb the current behaviour.
To do this I tokenized the XNLI dataset with 2 tokenizers - albert-base-v2
and microsoft/deberta-v3-large
- by loading the fast version of these tokenizers from the slow version. I did this step twice with the 2 different versions of the src/transformers/utils/sentencepiece_model_pb2.py
file. At the end I compared the tokenization produced between these 2 versions, and for both tokenizers no changes were identified. I tested with albert which is a tokenizer added in Nov 2019 thus before Dec 2020 and with deberta-v3 because it is known that this tokenizer uses the byte fallback feature and thus was produced with a recent version of sentencepiece.
SpmConverter
for sentencepiece's model using the byte fallback feature SpmConverter
for sentencepiece's model using the byte fallback feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,
I took the opportunity to check for other variables which might be incorrectly used/cause issues.
Doing the necessary work for ByteFallback
will not be small, the question is : Will that make it enough so that we can support all new tokenizers easily.
I definitely hope so, and the ability for offsets
seem like the most demanded feature of tokenizers
when the Fast tokenizer is not there, so it's definitely as desirable trait to be able to always have a Fast tokenizer.2
Thanks for taking care of this !
Not approving to make sure we have a core maintainer's opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but the warning could be a bit more informative I believe. Thanks for working on this!
"The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option" | ||
" which is not implemented in the fast tokenizers." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the warning explain a bit more to the user what to expect? If I read that message with no context, I have no idea what it means 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should we maybe use the logger
here instead:
from .utils import logger
...
logger = logging.get_logger(__name__)
....
logger.warn(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgugger (@patrickvonplaten ) That's a good point! I just added an explanation in my last commit. Does this seem sufficient? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten That's actually how I did it initially but then I saw that the python warnings
object was also used elsewhere in the transformers codebase so I changed it back. It seemed to me that warnings.warn
was a warning that would probably be more read by the user.
I have no preference between the two, but I'm interested to know what makes you prefer the second option! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the official docs:
Issue a warning regarding a particular runtime event | warnings.warn() in library code if the issue is avoidable
and the client application should be modified to eliminate the warning logging.warning() if there is nothing the
client application can do about the situation, but the event should still be noted
Issue a warning regarding a particular runtime event
warnings.warn() in library code if the issue is avoidable and the client application should be modified to eliminate the warning
if there is nothing the client application can do about the situation, but the event should still be noted
https://docs.python.org/3.8/howto/logging.html#when-to-use-logging
I think this is more of a warnings
case, where user SHOULD change his code as it's likely to cause issues to him later on.
Not raising Exception here is mainly due to the fact that a alerted user might KNOW he's not going to trigger byte fallback, and wants the offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's definitely not instance where we want to use the logger
as, from what I understood, the user should always see this and adapt the code.
…piece-bytefallback
…piece-bytefallback
…piece-bytefallback
…yte fallback feature (huggingface#16629) * update proto sentencepiece model * Revert "update proto sentencepiece model" This reverts commit b07f671. * add check * add test * Revert "Revert "update proto sentencepiece model"" This reverts commit 4610825. * test for log level * test for log level 2 * warning at the warning level * clean * format * add explanation in docstring
What does this PR do?
This PR proposes 2 changes:
SpmConverter
to a fast tokenizer uses the byte fallback feature - which is not yet implemented in HF tokenizers cf this issue.For context, this PR is motivated by the PR #15529 for adding
DebertaV2TokenizerFast
because the deberta-v3 tokenizer uses the byte fallback feature and we need a way to warn the user that the fast tokenizer will not be equivalent to the slow sentencepiece tokenizer. Personally, I think this warning is beneficial in a more general case than deberta-v3 because the byte fallback feature was introduced a long time ago in the sentencepiece library.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Would love to have your feebacks @LysandreJik , @sgugger , @patrickvonplaten , @patil-suraj and/or @Narsil 🤗