My repositories here are tutorials (with Python code) for pre-processing Japanese text datasets for use with common analysis software:
- Aozora Corpus Builder for Aozora Bunko HTML files
- Taiyō Corpus Tools for NINJAL's early-1900s Taiyō magazine XML corpus
The writeups are much longer than the code itself! I created them as a resource for getting started with the niche technical issues you'll often encounter with trying to use Japanese data sources. (Not-Unicode and no word boundaries are the main challenges.)
I no longer work in this area, so I'm sharing these as-is. Please freely reuse, fork, adapt, and/or steal for your own purposes -- that's why it's here!
Each of the projects above has their own dataset-specific resource section, but you might be interested in other resources listed at my East Asian Digital Humanities page (external link): semester-long course syllabus, weekend workshop materials, and previous blog posts about the Aozora project. It is not being actively updated, so be aware nothing is more recent than late 2019.
UPenn's annual Dream Lab digital humanities workshop series has included East Asian Digital Humanities for several years (co-taught by Paula Curtis and Paul Vierthaler). Paula has extensively taught Japanese text mining and digital methods in various workshops, and you can find more information on her website.
Digital Humanities Japan also maintains a wiki and mailing list to support resource-sharing and collaboration on Japanese-language digital projects and tech issues.