Linguistically annotated corpora of modern French (16-18th c.) with Pie models
«Sisyphe portant CornMol» (Titian, Prado Museum, Madrid, Spain, Source: Wikipedia).
We provide:
- Several authority lists, two deriving from LGeRM.
- One list contains only propre nouns (
proper
) with the latest added at the end - One list contains all the other lemmas (
authority
) with the latest added at the end - One list contains all the foreign words (
foreign
) with the latest added at the end - Each file has a
_processed
version with all the entries in the alphabetical order, after controlling that there is not twice the same entry - On top of these three files,
numbers
contains latin and arabic numbers andalphabet
contains single latin letters.
- CornMol is a gold corpus to be published
- FranText is a corpus taken from the open data of FranText and aligned on our lemmatisation standards.
- presto_gold is a gold corpus used by the Presto project tro train their TreeTagger model, converted to CATTEX and lightly corrected to match our authority lists.
- presto_max have all the modern (16th-18th c.) texts of the Presto project, with lemmas heavily corrected. Each round of annotation/correction is numbered (
v2
,v3
…)
- Out-of-domain testing data for 16th, 17th, 18th, 19th and 20th c. French
- Data are separated: theatrical and non theatrical for historical reasons.
- The same data exist in two versions: normalised and original (19th and 20th remains the same, only 16th, 17th and 18th change).
- The Models folder contains all the models produced with our data.
|-Authority_list
|-authority_processed
|-authority
|-propres_processed
|-propres
|-foreign
|-Data
|-CornMol_gold
|-FranText
|-presto_max
|-presto_gold
|-Data_outOfDomain
|-Data_outOfDomain_normalised
|-theatre_normalised
|-varia_normalised
|-Data_outOfDomain_original
|-theatre_original
|-varia_original
|-Models
|-train_1
|-train_2
|-Models
|-lemma.tar
|-pos.tar
To use the model,
- Create a (
virtualenv env
) and activate it (source env/bin/activate
) - Install Pie-extended:
pip install pie-extended
- Download the freem model:
pie-extended download
- Use the
freem
model:pie-extended tag freem your_file.txt
Do note that pie-extended includes a tokeniser dedicated to (early-)modern French.
The morphology is provided but has not been carefully proofread.
Our work is licensed under a Creative Commons Attribution 4.0 International Licence.
Presto and LGeRM data are licensed under a Creative Commons Attribution 4.0 International Licence.
If you want to contribute, you can do so by cloning the repository and sending us a pull request, or by sending an email at simon.gabay[at]unige.ch.
Simon Gabay, Thibault Clérice, Matthias Gille-Levenson, Jean-Baptiste Camps, Jean-Baptiste Tanguy, LEM17: data and models for modern French (16-18th c.), Neuchâtel: Université de Neuchâtel, 2020, https://github.com/e-ditiones/LEM17.
Please keep me posted if you use this data! simon.gabay[at]unige.ch
simon.gabay[at]unige.ch