This documentation describes the usage of the tipa.py
script. The script processes a HuggingFace tokenizer's vocabulary, generates TIPA (Token Internal Position Awareness) mappings for each token, and outputs the results in a JSONL file. Each record in the JSONL file contains:
- token_id: The ID of the token in the tokenizer's vocabulary.
- token: The string representation of the token.
- tipa_forward: Forward character position mapping (left-to-right).
- tipa_reverse: Reverse character position mapping (right-to-left).
Before running the script, ensure the required libraries are installed:
pip install numpy transformers
-
Place
tipa.py
in Your Working Directory. -
Run the Script.
Run the following command in your terminal:
python tipa.py
-
Output File.
The script generates a JSONL file named
tipa_tokens.jsonl
(default name) in the working directory.- Each line in the file contains a JSON object with TIPA mappings for a single valid token.
-
Customize the Parameters.
You can customize the HuggingFace tokenizer and output file name by editing the script:
tokenizer_name = "Qwen/Qwen2.5-7B-Instruct" output_filename = "all_tipa/qwen2.5_tipa_tokens.jsonl"
Each line in the tipa_tokens.jsonl
file represents a token's TIPA information in JSON format. Example:
{"token_id": 123, "token": "你好", "tipa_forward": {"1": "你", "2": "好"}, "tipa_reverse": {"2": "好", "1": "你"}}
{"token_id": 124, "token": "世界", "tipa_forward": {"1": "世", "2": "界"}, "tipa_reverse": {"2": "界", "1": "世"}}
{"token_id": 125, "token": "Hi", "tipa_forward": {"1": "H", "2": "i"}, "tipa_reverse": {"2": "i", "1": "H"}}
-
TIPA Function
Generates character position mappings for strings in forward and reverse order.- Forward: Positions start from the beginning of the string.
- Reverse: Positions start from the end of the string.
-
Valid Token Extraction
- Loads the tokenizer vocabulary.
- Filters out invalid UTF-8 tokens.
-
JSONL Output
Saves each token's ID, string, and TIPA mappings into a JSONL file line-by-line.
To use a different tokenizer model, change the tokenizer name:
tokenizer_name = "bert-base-uncased"
To change the output file name:
output_filename = "custom_output.jsonl"
Run the script again to process the new configuration.
To load the generated JSONL file and process the records in Python:
import json
with open("all_tipa/qwen2.5_tipa_tokens.jsonl", "r", encoding="utf-8") as f:
for line in f:
record = json.loads(line)
print(record)
- Token Analysis: Analyze character-level structure of tokens for NLP tasks.
- Preprocessing: Use TIPA mappings to enhance token-level models.
- Dataset Preparation: Generate structured token information for downstream tasks.
If your domain is limited to a specific language, such as Chinese, and your dataset has extensive coverage (sufficient to include all tokens of the target language), you can prune the tokenizer's vocabulary using a set
of tokens obtained from your dataset.
By filtering the vocabulary through the tokens appearing in your dataset, you can create a pruned TIPA dataset. Training the model on this pruned TIPA data can significantly accelerate the training process without compromising performance.
Steps for Optimization:
- Extract all tokens from your dataset through segmentation (e.g., Chinese word segmentation).
- Use a
set
data structure to remove duplicate tokens. - Filter the tokenizer's vocabulary to include only the tokens in the pruned set.
- Generate the TIPA mappings (forward and reverse) only for the pruned tokens.
- Train the model using the resulting TIPA data.
This approach reduces the size of the TIPA dataset while ensuring it remains comprehensive for the specific language, resulting in faster and more efficient model training.
This method is especially effective for large datasets in Chinese or similar languages where the token space is finite and well-defined. 🚀
The tipa.py
script is a powerful tool for analyzing and processing tokenizer vocabularies. It generates detailed TIPA mappings and stores them efficiently in a JSONL format for further use.
For any additional questions or support, refer to the HuggingFace documentation for tokenizers. 🚀
Email: s231231076@stu.cqupt.edu.cn
Huggingface Models: MTIPA-E1 | TIPA-E2
Cite:
@misc{xu2024enhancing,
title={Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning},
author={Zhu Xu and Zhiqiang Zhao and Zihan Zhang and Yuchi Liu and Quanwei Shen and Fei Liu and Yu Kuang and Jian He and Conglin Liu},
year={2024},
eprint={2411.17679},
archivePrefix={arXiv},
primaryClass={cs.CL}
}