Skip to content
View mollydesjardin's full-sized avatar

Block or report mollydesjardin

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
mollydesjardin/README.md

Hi there ✌️ ようこそ

My repositories here are tutorials (with Python code) for pre-processing Japanese text datasets for use with common analysis software:

The writeups are much longer than the code itself! I created them as a resource for getting started with the niche technical issues you'll often encounter with trying to use Japanese data sources. (Not-Unicode and no word boundaries are the main challenges.)

I no longer work in this area, so I'm sharing these as-is. Please freely reuse, fork, adapt, and/or steal for your own purposes -- that's why it's here!

Other resources

Each of the projects above has their own dataset-specific resource section, but you might be interested in other resources listed at my East Asian Digital Humanities page (external link): semester-long course syllabus, weekend workshop materials, and previous blog posts about the Aozora project. It is not being actively updated, so be aware nothing is more recent than late 2019.

UPenn's annual Dream Lab digital humanities workshop series has included East Asian Digital Humanities for several years (co-taught by Paula Curtis and Paul Vierthaler). Paula has extensively taught Japanese text mining and digital methods in various workshops, and you can find more information on her website.

Digital Humanities Japan also maintains a wiki and mailing list to support resource-sharing and collaboration on Japanese-language digital projects and tech issues.

Popular repositories Loading

  1. taiyo-corpus-tools taiyo-corpus-tools Public

    Python 1

  2. aozora aozora Public

    Aozora Corpus Builder

    Python

  3. mollydesjardin mollydesjardin Public

  4. skrub skrub Public

    Forked from skrub-data/skrub

    Prepping tables for machine learning

    Python

  5. narwhals narwhals Public

    Forked from narwhals-dev/narwhals

    Lightweight and extensible compatibility layer between dataframe libraries!

    Python