In case you want to contribute something to Lingua, then I encourage you to do so. Do you have ideas for improving the API? Are there some specific languages that you want to have supported early? Or have you found any bugs so far? Feel free to open an issue or send a pull request. It's very much appreciated.
For pull requests, please make sure that all unit tests pass and that the code is formatted according to
the official Kotlin style guide. You can check this by running the Kotlin linter ktlint
using ./gradlew ktlintCheck
. Most issues which the linter identifies can be fixed by running ./gradlew ktlintFormat
.
All other issues, especially lines which are longer than 120 characters, cannot be fixed automatically. In this case,
please format the respective lines by hand. You will notice that the build will fail if the formatting is not correct.
All kinds of pull requests are welcome. The pull requests I favor the most are new language additions. If you want to contribute new languages to Lingua, here comes a detailed manual explaining how to accomplish that.
Thank you very much in advance for all contributions, however small they may be.
An important notice beforehand:
In order to execute the steps below, you will need Java 8 or greater. Even though the library itself
runs on Java >= 6, the FilesWriter
classes make use of the java.nio api which was
introduced with Java 8.
- Clone Lingua's repository to your own computer as described in README's section 8.
- Open enums
IsoCode639_1
andIsoCode639_3
and add the language's iso codes. Among other sites, Wikipedia provides a comprehensive list. - Open enum
Language
and add a new entry for your language. If the language is written with a script that is not yet supported by Lingua'sAlphabet
enum, then add a new entry for it there as well. - If your language's script contains characters that are completely unique to it, then add them to the
respective entry in the
Language
enum. However, if the characters occur in more than one language but not in all languages, then add them to theCHARS_TO_LANGUAGES_MAPPING
constant in classLanguageDetector
instead. - Use
LanguageModelFilesWriter
to create the language model files. The training data file used for ngram probability estimation is not required to have a specific format other than to be a valid txt file. - Create a new subdirectory in
/src/main/resources/language-models
and put the generated language model files in there. Do not rename the language model files. The name of the subdirectory must be the language's ISO 639-1 code, completely lowercased. - Use
TestDataFilesWriter
to create the test data files used for accuracy report generation. The input file from which to create the test data should have each sentence on a separate line. - Put the generated test data files in
/src/test/resources/language-testdata
. Do not rename the test data files. - For accuracy report generation, create an abstract base class for the main logic in
/src/test/kotlin/com/github/pemistahl/lingua/report/config
. Look at the other languages' files in this directory to see how the class must look like. It should be pretty self-explanatory. - Create a concrete test class in
/src/test/kotlin/com/github/pemistahl/lingua/report/lingua
. Look at the other languages' files in this directory to see how the class must look like. It should be pretty self-explanatory. If one of the other language detector libraries supports your language already, you can add test classes for those as well. Each library has its own directory for this purpose. - Fix the existing unit tests by adding your new language.
- Add your new language to property
linguaSupportedLanguages
in/gradle.properties
. - Run
./gradlew writeAccuracyReports
and add the updated accuracy reports to your pull request. - Run
./gradlew drawAccuracyPlots
and add the updated plots to your pull request. - Run
./gradlew writeAccuracyTable
and add the updated accuracy table to your pull request. - Be happy! :-) You have successfully contributed a new language and have thereby significantly widened this library's fields of application.