This repository contains scripts and test data used for the development of a topic modeling pipeline in the context of the MiMoText project.
The pipeline is based on the following set of scripts by Christof Schöch: It is constantly being revised and developed.
- Extracting metadata
- Splitting texts
- Preprocessing: lemmatizing, POS-tagging, filtering by POS, stopword list and minimum word length
- Modeling with mallet (using the python wrapper of the gensim library)
- Postprocessing: statistics (different lists and matrices)
- Visualizing via pyLDAvis
- Generating heatmaps
- Generating wordclouds
Please install the following:
Python 3
Some additional libraries (with their respective dependencies):
"numpy", see:
"pandas", see:
"treetaggerwrapper", see:
"gensim", version 3.8.3, see:
- important note: Gensim 3.8.3 is the latest release to include the LDA mallet wrapper, which is essential for using the pipeline. So this gensim version is needed to run the pipeline.
"pyLDAvis", see:
"sklearn", see:
"seaborn", see:
"wordcloud", see: (Note: Trying to install wordcloud on Windows often leads to difficulties. It might help to install and run the library with Python version 3.7)
TreeTagger, see
Please note: Follow the installation instructions given here; consider the differences between the different operating systems. It isn't necessary to download any language parameter files. They are already included in this folder.
For the modeling you have to install the mallet implementation first:
- mallet:; download here: (here you can find a helpful installation guide:
- important: In order to run the scripts it is necessary to specify the path where you stored the mallet binary on your computer (see "mallet_path" in
Please make sure you have installed Python 3, TreeTagger, mallet and the desired libraries.
Download and save this repository.
Save your text files (TXT) in datasets/[name-of-your-dataset]/full.
Now you can run the scripts.
Set your parameters in
- It calls all required scripts in the correct order.
- You can change the following parameters:
- chunksize: size of text parts (number of tokens) into which the novels are split
- lang: language parameter to choose the model for POS-tagging; choose "fr" for modern French and "presto" for French of 16th/17th century.
- numtopics: number of topics created by the modeling
- passes: number of iterations
- modeling: Specify whether you want to perform the modelling with gensim or mallet.
- (only if chosen mallet:) optimize_interval: optimization of the topic model every "[chosen value]" iterations
- cats: category for which the most distinctive topics are visualized in heatmap
the splitted texts are saved in datasets/[name of dataset]/txt
the preprocessed texts are saved as lists of lemmas in results/[name of dataset]/pickles
the gensim model is saved in results/[name of dataset]/model
in results/[name of dataset]/ you also find statistical files, a file "visualization.html" and the heatmap visualizations
Files and script for preparing topic statements to feed into Wikibase.