Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Thai word list from Volubilis dictionary #870

Merged
merged 10 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions pythainlp/corpus/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,31 +20,32 @@
"""

__all__ = [
"corpus_path",
"corpus_db_path",
"corpus_db_url",
"corpus_path",
"countries",
"download",
"get_corpus",
"get_corpus_db",
"get_corpus_db_detail",
"get_corpus_default_db",
"get_corpus_path",
"get_path_folder_corpus",
"path_pythainlp_corpus",
"provinces",
"remove",
"thai_dict",
"thai_family_names",
"thai_female_names",
"thai_male_names",
"thai_negations",
"thai_synonym",
"thai_orst_words",
"thai_stopwords",
"thai_syllables",
"thai_synonym",
"thai_words",
"thai_wsd_dict",
"thai_orst_words",
"path_pythainlp_corpus",
"get_path_folder_corpus",
"volubilis",
]

import os
Expand Down Expand Up @@ -119,3 +120,4 @@ def corpus_db_path() -> str:
thai_dict,
thai_wsd_dict
)
from pythainlp.corpus.volubilis import volubilis
75 changes: 48 additions & 27 deletions pythainlp/corpus/corpus_license.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,31 +10,30 @@ The following word lists are created by the PyThaiNLP project and released under
**Creative Commons Zero 1.0 Universal Public Domain Dedication License**
https://creativecommons.org/publicdomain/zero/1.0/

Filename | Description
---------|------------
countries_th.txt | List of countries in Thai
etcc.txt List of | Enhanced Thai Character Clusters
negations_th.txt | Negation word list
stopwords_th.txt | Stop word list
syllables_th.txt | List of Thai syllables
thailand_provinces_th.csv | List of Thailand provinces in Thai
tnc_freq.txt | Words and their frequencies, from Thai National Corpus
ttc_freq.txt | Words and their frequencies, from Thai Textbook Corpus
words_th.txt | List of Thai words
words_th_thai2fit_201810.txt | List of Thai words (frozen for thai2fit)
| Filename | Description |
| ---------------------------- | ------------------------------------------------------ |
| countries_th.txt | List of countries in Thai |
| etcc.txt List of | Enhanced Thai Character Clusters |
| negations_th.txt | Negation word list |
| stopwords_th.txt | Stop word list |
| syllables_th.txt | List of Thai syllables |
| thailand_provinces_th.csv | List of Thailand provinces in Thai |
| tnc_freq.txt | Words and their frequencies, from Thai National Corpus |
| ttc_freq.txt | Words and their frequencies, from Thai Textbook Corpus |
| words_th.txt | List of Thai words |
| words_th_thai2fit_201810.txt | List of Thai words (frozen for thai2fit) |

The following word lists are from **Thai Male and Female Names Corpus**
https://github.com/korkeatw/thai-names-corpus/ by Korkeat Wannapat
and released under their original licenses which are
**Creative Commons Attribution-ShareAlike 4.0 International Public License**
https://creativecommons.org/licenses/by-sa/4.0/

Filename | Description
---------|------------
family_names_th.txt | List of family names in Thailand
person_names_female_th.txt | List of female names in Thailand
person_names_male_th.txt | List of male names in Thailand

| Filename | Description |
| -------------------------- | -------------------------------- |
| family_names_th.txt | List of family names in Thailand |
| person_names_female_th.txt | List of female names in Thailand |
| person_names_male_th.txt | List of male names in Thailand |

## Models

Expand All @@ -43,14 +42,13 @@ and released under
**Creative Commons Attribution 4.0 International Public License**
https://creativecommons.org/licenses/by/4.0/

Filename | Description
---------|------------
pos_orchid_perceptron.pkl | Part-of-speech tagging model, trained from ORCHID data, using perceptron
pos_orchid_unigram.json | Part-of-speech tagging model, trained from ORCHID data, using unigram
pos_ud_perceptron.pkl | Part-of-speech tagging model, trained from Parallel Universal Dependencies treebank, using perceptron
pos_ud_unigram.json | Part-of-speech tagging model, trained from Parallel Universal Dependencies treebank, using unigram
sentenceseg_crfcut.model | Sentence segmentation model, trained from TED subtitles, using CRF

| Filename | Description |
| ------------------------- | ----------------------------------------------------------------------------------------------------- |
| pos_orchid_perceptron.pkl | Part-of-speech tagging model, trained from ORCHID data, using perceptron |
| pos_orchid_unigram.json | Part-of-speech tagging model, trained from ORCHID data, using unigram |
| pos_ud_perceptron.pkl | Part-of-speech tagging model, trained from Parallel Universal Dependencies treebank, using perceptron |
| pos_ud_unigram.json | Part-of-speech tagging model, trained from Parallel Universal Dependencies treebank, using unigram |
| sentenceseg_crfcut.model | Sentence segmentation model, trained from TED subtitles, using CRF |

## Thai WordNet

Expand Down Expand Up @@ -100,4 +98,27 @@ For more information about Thai WordNet, see
S. Thoongsup et al., ‘Thai WordNet construction’,
in Proceedings of the 7th Workshop on Asian Language Resources,
Suntec, Singapore, Aug. 2009, pp. 139–144.
https://www.aclweb.org/anthology/W09-3420.pdf
https://www.aclweb.org/anthology/W09-3420.pdf

## Volubilis

Corpus of Thai words registered in Volubilis (volubilis.txt) which was processed by konbraphat51 (https://github.com/konbraphat51/Thai_Dictionary_Cleaner/tree/main)

The original data is VOLUBILIS 23.1 (Mar. 2023) Database from [Volubilis](https://belisan-volubilis.blogspot.com/) which Francis Bastien has created.

```
VOLUBILIS MULTILINGUAL THAI DICT. & DATABASE by Francis Bastien (Belisan) is licensed under CC BY-SA 4.0

This is a human-readable summary of (and not a substitute for) the license below.
You are free:
to Share—copy and redistribute the material in any medium or format
to Adapt—remix, transform, and build upon the material
for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
Attribution—You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
Share Alike—If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
No additional restrictions—You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation. No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.
```
43 changes: 43 additions & 0 deletions pythainlp/corpus/volubilis.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# -*- coding: utf-8 -*-
# Copyright (C) 2016-2023 PyThaiNLP Project
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Provides an optional word list from the Volubilis dictionary.
"""
from typing import FrozenSet

from pythainlp.corpus.common import get_corpus

_VOLUBILIS = None
_VOLUBILIS_FILENAME = "volubilis_modified.txt"


def volubilis() -> FrozenSet[str]:
"""
Return a frozenset of words from the Volubilis dictionary.

The data is at pythainlp/corpus/volubilis_modified.txt
The word list has beed prepared by the code at:
https://github.com/konbraphat51/Thai_Dictionary_Cleaner
Based Volubilis dictionary 23.1 (March 2023):
https://belisan-volubilis.blogspot.com/

:return: :class:`frozenset` containing words in the Volubilis dictionary.
:rtype: :class:`frozenset`
"""
global _VOLUBILIS
if not _VOLUBILIS:
_VOLUBILIS = get_corpus(_VOLUBILIS_FILENAME)

return _VOLUBILIS
Loading