Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a warning in SpmConverter for sentencepiece's model using the byte fallback feature #16629

Merged
merged 14 commits into from
Apr 11, 2022
7 changes: 7 additions & 0 deletions src/transformers/convert_slow_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
allow to make our dependency on SentencePiece optional.
"""

import warnings
from typing import Dict, List, Tuple

from tokenizers import Regex, Tokenizer, decoders, normalizers, pre_tokenizers, processors
Expand Down Expand Up @@ -429,6 +430,12 @@ def __init__(self, *args):
m.ParseFromString(f.read())
self.proto = m

if self.proto.trainer_spec.byte_fallback:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the src/transformers/utils/sentencepiece_model_pb2.py file update, the self.proto.trainer_spec.byte_fallback call would have returned a AttributeError with the message "byte_fallback"

warnings.warn(
"The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option"
" which is not implemented in the fast tokenizers."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the warning explain a bit more to the user what to expect? If I read that message with no context, I have no idea what it means 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should we maybe use the logger here instead:

from .utils import logger
...

logger = logging.get_logger(__name__)

....

logger.warn(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgugger (@patrickvonplaten ) That's a good point! I just added an explanation in my last commit. Does this seem sufficient? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten That's actually how I did it initially but then I saw that the python warnings object was also used elsewhere in the transformers codebase so I changed it back. It seemed to me that warnings.warn was a warning that would probably be more read by the user.

I have no preference between the two, but I'm interested to know what makes you prefer the second option! 😄

Copy link
Contributor

@Narsil Narsil Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the official docs:

Issue a warning regarding a particular runtime event | warnings.warn() in library code if the issue is avoidable 
and the client application should be modified to eliminate the warning logging.warning() if there is nothing the
client application can do about the situation, but the event should still be noted


Issue a warning regarding a particular runtime event
	

warnings.warn() in library code if the issue is avoidable and the client application should be modified to eliminate the warning

if there is nothing the client application can do about the situation, but the event should still be noted

https://docs.python.org/3.8/howto/logging.html#when-to-use-logging

I think this is more of a warnings case, where user SHOULD change his code as it's likely to cause issues to him later on.
Not raising Exception here is mainly due to the fact that a alerted user might KNOW he's not going to trigger byte fallback, and wants the offsets.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's definitely not instance where we want to use the logger as, from what I understood, the user should always see this and adapt the code.

)

def vocab(self, proto):
return [(piece.piece, piece.score) for piece in proto.pieces]

Expand Down
Loading