Skip to content

tiktoken-go/tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Mariano WahlmannMariano Wahlmann
Mariano Wahlmann
and
Mariano Wahlmann
Mar 7, 2025
bc8ba39 Β· Mar 7, 2025

History

34 Commits
Feb 12, 2025
Apr 11, 2023
Feb 25, 2025
Sep 13, 2024
Apr 6, 2023
Sep 13, 2024
Feb 25, 2025
Mar 7, 2025
Mar 7, 2025
Feb 13, 2025

Repository files navigation

Tests

Tokenizer

This is a pure go port of OpenAI's tokenizer.

Buy Me A Coffee

Usage

package main

import (
    "fmt"
    "github.com/tiktoken-go/tokenizer"
)

func main() {
    enc, err := tokenizer.Get(tokenizer.Cl100kBase)
    if err != nil {
        panic("oh oh")
    }

    // this should print a list of token ids
    ids, _, _ := enc.Encode("supercalifragilistic")
    fmt.Println(ids)

    // this should print the original string back
    text, _ := enc.Decode(ids)
    fmt.Println(text)
}

Alternatively you can use the included command-line tool

> tokenizer -h

Usage of tokenizer:
  -decode string
        tokens to decode
  -encode string
        text to encode
  -token string
        text to calculate token

> tokenizer -encode supercalifragilistic

Todo

  • βœ… port code
  • βœ… o200k_base encoding
  • βœ… cl100k_base encoding
  • βœ… r50k_base encoding
  • βœ… p50k_base encoding
  • βœ… p50k_edit encoding
  • βœ… tests
  • ❌ handle special tokens
  • ❌ gpt-2 model

Caveats

This library embeds OpenAI's vocabulariesβ€”which are not small (~4Mb)β€” as go maps. This is different than what the way python version of tiktoken works, which downloads the dictionaries and puts them in a cache folder.

However, since the dictionaries are compiled during the go build process the performance and start-up times should be better than downloading and loading them at runtime.

Alternatives

Here is a list of other libraries that do something similar.