Skip to content

Add Byte Pair Encoding (BPE) class for subword tokenization #3056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 23, 2025

Conversation

Cydral
Copy link
Contributor

@Cydral Cydral commented Feb 15, 2025

Description:

This PR introduces a new bpe_tokenizer class to Dlib, implementing the Byte Pair Encoding (BPE) algorithm for subword tokenization. The BPE tokenizer is a widely used technique in natural language processing (NLP) for handling out-of-vocabulary words and reducing vocabulary size while maintaining text representation capabilities.

Key Features:

  • BPE Algorithm: Implements the BPE algorithm as described in Sennrich et al., 2016.
  • Special Tokens: Supports predefined special tokens (e.g., <text>, <url>, <image>) for marking specific elements in the text.
  • Training and Encoding: Provides methods for training the tokenizer on a text corpus and encoding/decoding text into subword tokens.
  • Serialization: Supports saving and loading the tokenizer model and vocabulary for reuse.
  • Thread-Safe: Utilizes multi-threading for efficient frequency statistics computation during training.

Usage:

dlib::bpe_tokenizer tokenizer;
tokenizer.train(corpus_text, target_vocab_size, true); // Train on a text corpus
std::vector<int> tokens = tokenizer.encode("Sample text to tokenize."); // Encode text
std::string decoded_text = tokenizer.decode(tokens); // Decode tokens back to text

- Implement BPE (Byte Pair Encoding) tokenization
- Add training and encoding methods
- Include unit tests
@Cydral Cydral changed the title Add Byte Pair Encoding Class for Subword Tokenization Add Byte Pair Encoding (BPE) class for subword tokenization Feb 15, 2025
@davisking
Copy link
Owner

Nice, this is great. Sorry it took so long for me to get back to this.

@davisking davisking merged commit 1cd0634 into davisking:master Mar 23, 2025
10 checks passed
Repository owner deleted a comment from dlib-issue-bot Mar 24, 2025
@Cydral
Copy link
Contributor Author

Cydral commented Mar 24, 2025

No problem. I think this is another great new feature for our library.

@davisking
Copy link
Owner

No problem. I think this is another great new feature for our library.

Indeed 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants