Japanese eBook Tokenizer for Kindle WordWise

Purpose

A quick and dirty command line function (built in python) to tokenize Japanese text and insert spaces between words. It's designed to work on eBooks (.txt files) taken from Archive.org or Aozora Bunko repositories. "But why the need for spaces when Japanese isn't usually written with spaces between words?" you may ask; the only reason I wanted spaces is so that my kindle will be able to separate words and allow me to look them up when I'm reading. For just about every other use it is admittedly not ultra useful! I've only used it on a few books taken from each of the above sources, so if you find it useful but not quite effective for some books, let me know.

Usage

python3 archive_tokenize.py 'path_to_archive_org_txt_file'
python3 aozora_tokenize.py 'path_to_aozora_bunko_txt_file'
It will save the new text in the same directory as the script, and name it with meta info from the original file

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.python-version		.python-version
README.md		README.md
aozorabunko_tokenize.py		aozorabunko_tokenize.py
archive_tokenize.py		archive_tokenize.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Japanese eBook Tokenizer for Kindle WordWise

Purpose

Usage

About

Languages

ryancahildebrandt/ebook_tokenizer

Folders and files

Latest commit

History

Repository files navigation

Japanese eBook Tokenizer for Kindle WordWise

Purpose

Usage

About

Resources

Stars

Watchers

Forks

Languages