Skip to content

Add spaces between Japanese words in eBooks to work with Kindle WordWise

Notifications You must be signed in to change notification settings

ryancahildebrandt/ebook_tokenizer

Repository files navigation

Japanese eBook Tokenizer for Kindle WordWise



Purpose

A quick and dirty command line function (built in python) to tokenize Japanese text and insert spaces between words. It's designed to work on eBooks (.txt files) taken from Archive.org or Aozora Bunko repositories. "But why the need for spaces when Japanese isn't usually written with spaces between words?" you may ask; the only reason I wanted spaces is so that my kindle will be able to separate words and allow me to look them up when I'm reading. For just about every other use it is admittedly not ultra useful! I've only used it on a few books taken from each of the above sources, so if you find it useful but not quite effective for some books, let me know.


Usage

  • python3 archive_tokenize.py 'path_to_archive_org_txt_file'
  • python3 aozora_tokenize.py 'path_to_aozora_bunko_txt_file'
  • It will save the new text in the same directory as the script, and name it with meta info from the original file

About

Add spaces between Japanese words in eBooks to work with Kindle WordWise

Resources

Stars

Watchers

Forks

Languages