TIPA

Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

TIPA Usage Documentation

Overview

This documentation describes the usage of the tipa.py script. The script processes a HuggingFace tokenizer's vocabulary, generates TIPA (Token Internal Position Awareness) mappings for each token, and outputs the results in a JSONL file. Each record in the JSONL file contains:

token_id: The ID of the token in the tokenizer's vocabulary.
token: The string representation of the token.
tipa_forward: Forward character position mapping (left-to-right).
tipa_reverse: Reverse character position mapping (right-to-left).

Dependencies

Before running the script, ensure the required libraries are installed:

pip install numpy transformers

How to Use

Place tipa.py in Your Working Directory.
Run the Script.

Run the following command in your terminal:
```
python tipa.py
```
Output File.

The script generates a JSONL file named tipa_tokens.jsonl (default name) in the working directory.
- Each line in the file contains a JSON object with TIPA mappings for a single valid token.
Customize the Parameters.

You can customize the HuggingFace tokenizer and output file name by editing the script:
```
tokenizer_name = "Qwen/Qwen2.5-7B-Instruct"
output_filename = "all_tipa/qwen2.5_tipa_tokens.jsonl"
```

Example JSONL Output

Each line in the tipa_tokens.jsonl file represents a token's TIPA information in JSON format. Example:

{"token_id": 123, "token": "你好", "tipa_forward": {"1": "你", "2": "好"}, "tipa_reverse": {"2": "好", "1": "你"}}
{"token_id": 124, "token": "世界", "tipa_forward": {"1": "世", "2": "界"}, "tipa_reverse": {"2": "界", "1": "世"}}
{"token_id": 125, "token": "Hi", "tipa_forward": {"1": "H", "2": "i"}, "tipa_reverse": {"2": "i", "1": "H"}}

Code Description

TIPA Function
Generates character position mappings for strings in forward and reverse order.
- Forward: Positions start from the beginning of the string.
- Reverse: Positions start from the end of the string.
Valid Token Extraction
- Loads the tokenizer vocabulary.
- Filters out invalid UTF-8 tokens.
JSONL Output
Saves each token's ID, string, and TIPA mappings into a JSONL file line-by-line.

Advanced Usage

To use a different tokenizer model, change the tokenizer name:

tokenizer_name = "bert-base-uncased"

To change the output file name:

output_filename = "custom_output.jsonl"

Run the script again to process the new configuration.

File Loading Example

To load the generated JSONL file and process the records in Python:

import json

with open("all_tipa/qwen2.5_tipa_tokens.jsonl", "r", encoding="utf-8") as f:
   for line in f:
      record = json.loads(line)
      print(record)

Use Cases

Token Analysis: Analyze character-level structure of tokens for NLP tasks.
Preprocessing: Use TIPA mappings to enhance token-level models.
Dataset Preparation: Generate structured token information for downstream tasks.

Advanced Usage: Language-Specific Optimization

If your domain is limited to a specific language, such as Chinese, and your dataset has extensive coverage (sufficient to include all tokens of the target language), you can prune the tokenizer's vocabulary using a set of tokens obtained from your dataset.

By filtering the vocabulary through the tokens appearing in your dataset, you can create a pruned TIPA dataset. Training the model on this pruned TIPA data can significantly accelerate the training process without compromising performance.

Steps for Optimization:

Extract all tokens from your dataset through segmentation (e.g., Chinese word segmentation).
Use a set data structure to remove duplicate tokens.
Filter the tokenizer's vocabulary to include only the tokens in the pruned set.
Generate the TIPA mappings (forward and reverse) only for the pruned tokens.
Train the model using the resulting TIPA data.

This approach reduces the size of the TIPA dataset while ensuring it remains comprehensive for the specific language, resulting in faster and more efficient model training.

This method is especially effective for large datasets in Chinese or similar languages where the token space is finite and well-defined. 🚀

Conclusion

The tipa.py script is a powerful tool for analyzing and processing tokenizer vocabularies. It generates detailed TIPA mappings and stores them efficiently in a JSONL format for further use.

For any additional questions or support, refer to the HuggingFace documentation for tokenizers. 🚀

Email: s231231076@stu.cqupt.edu.cn

Huggingface Models: MTIPA-E1 | TIPA-E2

[Paper]

Cite:

@misc{xu2024enhancing,
    title={Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning},
    author={Zhu Xu and Zhiqiang Zhao and Zihan Zhang and Yuchi Liu and Quanwei Shen and Fei Liu and Yu Kuang and Jian He and Conglin Liu},
    year={2024},
    eprint={2411.17679},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.idea		.idea
all_tipa		all_tipa
experiment01/data		experiment01/data
experiment02/data		experiment02/data
source_data		source_data
train		train
.gitignore		.gitignore
README.md		README.md
img.png		img.png
main.py		main.py
mtipa.py		mtipa.py
tipa.png		tipa.png
tipa.py		tipa.py
tipa_tokens.jsonl		tipa_tokens.jsonl
use_mtipa_exp1.py		use_mtipa_exp1.py
use_mtipa_exp1_merge_version.py		use_mtipa_exp1_merge_version.py
use_tipa_exp2.py		use_tipa_exp2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TIPA

Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

TIPA Usage Documentation

Overview

Dependencies

How to Use

Example JSONL Output

Code Description

Advanced Usage

File Loading Example

Use Cases

Advanced Usage: Language-Specific Optimization

Conclusion

About

Releases

Packages

Languages

FloatFrank/TIPA

Folders and files

Latest commit

History

Repository files navigation

TIPA

Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

TIPA Usage Documentation

Overview

Dependencies

How to Use

Example JSONL Output

Code Description

Advanced Usage

File Loading Example

Use Cases

Advanced Usage: Language-Specific Optimization

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages