Skip to content

Commit e1386d1

Browse files
authored
HF tokenizers: initial base tokenizer support (pytorch#2350)
1 parent 0792bcf commit e1386d1

File tree

9 files changed

+12808
-0
lines changed

9 files changed

+12808
-0
lines changed

docs/source/api_ref_modules.rst

+1
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ model specific tokenizers.
5050

5151
transforms.tokenizers.SentencePieceBaseTokenizer
5252
transforms.tokenizers.TikTokenBaseTokenizer
53+
transforms.tokenizers.HuggingFaceBaseTokenizer
5354
transforms.tokenizers.ModelTokenizer
5455
transforms.tokenizers.BaseTokenizer
5556

docs/source/basics/tokenizers.rst

+24
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,30 @@ to do the actual encoding and decoding.
222222
print(sp_tokenizer.encode(text))
223223
# [1, 6312, 28709, 1526, 2]
224224
225+
.. _hf_tokenizers:
226+
227+
Using Hugging Face tokenizers
228+
-----------------------------
229+
230+
Sometimes tokenizers hosted on Hugging Face do not contain files compatible with one of torchtune's
231+
existing tokenizer classes. In this case, we provide :class:`~torchtune.modules.transforms.tokenizers.HuggingFaceBaseTokenizer`
232+
to parse the Hugging Face ``tokenizer.json`` file and define the correct ``encode`` and ``decode`` methods to
233+
match torchtune's other :class:`~torchtune.modules.transforms.tokenizers.BaseTokenizer` classes. You should also pass the path to
234+
either ``tokenizer_config.json`` or ``generation_config.json``, which will allow torchtune to infer BOS and EOS tokens.
235+
Continuing with the Mistral example:
236+
237+
.. code-block:: python
238+
239+
hf_tokenizer = HuggingFaceBaseTokenizer(
240+
tokenizer_json_path="/tmp/Mistral-7B-v0.1/tokenizer.json",
241+
tokenizer_config_json_path="/tmp/Mistral-7B-v0.1/tokenizer_config.json",
242+
)
243+
244+
text = "hello world"
245+
246+
print(hf_tokenizer.encode(text))
247+
# [1, 6312, 28709, 1526, 2]
248+
225249
.. _model_tokenizers:
226250

227251
Model tokenizers

pyproject.toml

+1
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ dependencies = [
2323
"sentencepiece",
2424
"tiktoken",
2525
"blobfile>=2",
26+
"tokenizers",
2627

2728
# Miscellaneous
2829
"numpy",

tests/assets/generation_config.json

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{"bos_token_id": 0, "eos_token_id": -1}

0 commit comments

Comments
 (0)