Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eKoNLPy): this is a BREAKING CHANGE. #8

Merged
merged 9 commits into from
Jul 24, 2023
11 changes: 11 additions & 0 deletions .envrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
export PROJECT_NAME="eKoNLPy"
export PROJECT_ROOT="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
VIRTUAL_ENV="${WORKON_HOME}/${PROJECT_NAME}"
if [ -e "${VIRTUAL_ENV}/bin/activate" ]; then
source "${VIRTUAL_ENV}/bin/activate"
else
python3 -m venv "${VIRTUAL_ENV}"
source "${VIRTUAL_ENV}/bin/activate"
pip install --upgrade pip setuptools wheel
fi
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ repos:
rev: v1.10.0
hooks:
- id: python-check-blanket-noqa
- id: python-check-blanket-type-ignore
# - id: python-check-blanket-type-ignore
- id: python-check-mock-methods
- id: python-no-eval
- id: python-no-log-warn
Expand Down
17 changes: 12 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
[codecov-url]: https://codecov.io/gh/entelecheia/eKoNLPy
[zenodo-image]: https://zenodo.org/badge/DOI/10.5281/zenodo.7809447.svg
[zenodo-url]: https://doi.org/10.5281/zenodo.7809447

[repo-url]: https://github.com/entelecheia/eKoNLPy
[pypi-url]: https://pypi.org/project/ekonlpy
[docs-url]: https://ekonlpy.entelecheia.ai
Expand All @@ -34,7 +33,7 @@
`eKoNLPy` is a Korean Natural Language Processing (NLP) Python library specifically designed for economic analysis. It extends the functionality of the `MeCab` tagger from KoNLPy to improve the handling of economic terms, financial institutions, and company names, classifying them as single nouns. Additionally, it incorporates sentiment analysis features to determine the tone of monetary policy statements, such as Hawkish or Dovish.

> **Note**
>
>
> eKoNLPy is built on the [fugashi](https://github.com/polm/fugashi) and [mecab-ko-dic](https://github.com/LuminosoInsight/mecab-ko-dic) libraries. For more information on using the `Mecab` tagger, please refer to the [fugashi documentation](https://github.com/polm/fugashi). As eKoNLPy no longer relies on the [KoNLPy](https://konlpy.org) library, Java is not required for its use. This makes eKoNLPy compatible with Windows, Linux, and Mac OS, without the need for Java installation. You can also use eKoNLPy on Google Colab.

If you wish to tokenize general Korean text with eKoNLPy, you do not need to install the `KoNLPy` library. Instead, utilize `ekonlpy.mecab.MeCab` as a replacement for `ekonlpy.tag.Mecab`.
Expand All @@ -56,20 +55,28 @@ pip install ekonlpy
To use the part of speech tagging feature, input `Mecab.pos(phrase)` just like KoNLPy. First, the input is processed using KoNLPy's Mecab morpheme analyzer. Then, if a combination of consecutive tokens matches a term in the user dictionary, the phrase is separated into compound nouns.

```python
from ekonlpy.tag import Mecab
from ekonlpy import Mecab

mecab = Mecab()
mecab.pos('금통위는 따라서 물가안정과 병행, 경기상황에 유의하는 금리정책을 펼쳐나가기로 했다고 밝혔다.')
```

> [('금', 'MAJ'), ('통', 'MAG'), ('위', 'NNG'), ('는', 'JX'), ('따라서', 'MAJ'), ('물가', 'NNG'), ('안정', 'NNG'), ('과', 'JC'), ('병행', 'NNG'), (',', 'SC'), ('경기', 'NNG'), ('상황', 'NNG'), ('에', 'JKB'), ('유의', 'NNG'), ('하', 'XSV'), ('는', 'ETM'), ('금리', 'NNG'), ('정책', 'NNG'), ('을', 'JKO'), ('펼쳐', 'VV+EC'), ('나가', 'VX'), ('기', 'ETN'), ('로', 'JKB'), ('했', 'VV+EP'), ('다고', 'EC'), ('밝혔', 'VV+EP'), ('다', 'EF'), ('.', 'SF')]

You can also use the Command Line Interface (CLI) to perform part of speech tagging:

```bash
ekonlpy --input "안녕하세요"
```

> [('안녕', 'NNG'), ('하', 'XSV'), ('세요', 'EP')]

### cf. MeCab POS Tagging (fugashi)

```python
from ekonlpy.mecab import MeCab # Be careful! `C` is capital.
from ekonlpy import Mecab

mecab = MeCab()
mecab = Mecab(use_original_tagger=True) # set use_original_tagger=True
mecab.pos('금통위는 따라서 물가안정과 병행, 경기상황에 유의하는 금리정책을 펼쳐나가기로 했다고 밝혔다.')
```

Expand Down
514 changes: 276 additions & 238 deletions poetry.lock

Large diffs are not rendered by default.

10 changes: 7 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ include = [
"src/ekonlpy/data/model/*",
]

[tool.poetry.scripts]
ekonlpy = "ekonlpy.__cli__:main"

[tool.poe]
include = [".tasks.toml", ".tasks-extra.toml"]

Expand All @@ -27,11 +30,12 @@ mecab-ko-dic = "^1.0.0"
nltk = "^3.8.1"
scipy = "^1.10.1"
pandas = "^1.5.3"
click = "^8.1.6"

[tool.poetry.group.dev.dependencies]
python-semantic-release = "^7.33.1"
isort = "^5.12.0"
black = "^23.1.0"
black = ">=23.0.0,<=23.3.0"
flake8 = "^6.0.0"
mypy = "^1.0.0"
flake8-pyproject = "^1.2.2"
Expand Down Expand Up @@ -134,7 +138,7 @@ branch = "master"
version_toml = "pyproject.toml:tool.poetry.version"
version_variable = "src/ekonlpy/_version.py:__version__"
version_source = "tag"
commit_version_number = true # required for version_source = "tag"
commit_version_number = true # required for version_source = "tag"
commit_subject = "chore(release): :rocket: {version} [skip ci]"
prerelease_tag = "rc"
major_on_zero = true
Expand All @@ -143,7 +147,7 @@ changelog_file = "CHANGELOG.md"
upload_to_repository = true
upload_to_release = true
build_command = "poetry build --no-cache"
hvcs = "github" # hosting version control system, gitlab is also supported
hvcs = "github" # hosting version control system, gitlab is also supported

[build-system]
requires = ["poetry-core>=1.0.0"]
Expand Down
45 changes: 45 additions & 0 deletions src/ekonlpy/__cli__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
"""Command line interface for eKoNLPy"""

# Importing the libraries

import click

from ._version import __version__

CONTEXT_SETTINGS = dict(help_option_names=["-h", "--help"])


@click.command(context_settings=CONTEXT_SETTINGS)
@click.version_option(__version__)
@click.option(
"--tagger",
"-t",
show_default=True,
default="ekonlpy",
help="The tagger to use. [ekonlpy|mecab]]",
)
@click.option("--input", "-i", help="The input text to tag.")
@click.pass_context
def main(ctk, tagger, input):
"""This is the command line interface for eKoNLPy.

It is used to tag Korean text with a Korean morphological analyzer.
"""
# Print a message to the user.
if input:
click.echo(tag(tagger, input))
else:
# Print usage message to the user.
click.echo(ctk.get_help())


def tag(tagger: str, text: str) -> list:
from ekonlpy import Mecab

mecab = Mecab(use_original_tagger=tagger == "mecab")
return mecab.pos(text)


# main function for the main module
if __name__ == "__main__":
main()
14 changes: 12 additions & 2 deletions src/ekonlpy/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
from . import etag, tag
from ._version import __version__
from .mecab import MecabDicConfig
from .tag import Mecab
from .utils.dictionary import TermDictionary
from .utils.io import installpath, load_dictionary, load_txt
from .utils.io import installpath


def get_version() -> str:
"""This function returns the version of ekonlpy."""
return __version__


__all__ = [
"Mecab",
"MecabDicConfig",
"TermDictionary",
"installpath",
"get_version",
]
3 changes: 3 additions & 0 deletions src/ekonlpy/base/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .base import BaseMecab

__all__ = ["BaseMecab"]
46 changes: 46 additions & 0 deletions src/ekonlpy/base/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import logging
from typing import List, Optional, Tuple

logger = logging.getLogger(__name__)


class BaseMecab:
"""Abstract class for MeCab tagger"""

def parse(
self,
text: str,
) -> List[Tuple[str, str]]:
raise NotImplementedError

def pos(
self,
text: str,
) -> List[Tuple[str, str]]:
return self.parse(text)

def tokenize(
self,
text: str,
strip_pos: bool = False,
postag_delim: str = "/",
) -> List[str]:
tokens = self.parse(text)

return [
token_pos[0] if strip_pos else f"{token_pos[0]}{postag_delim}{token_pos[1]}"
for token_pos in tokens
]

def morphs(self, text: str) -> List[str]:
return self.tokenize(text, strip_pos=True)

def nouns(
self,
text: str,
flatten: bool = True,
noun_pos: Optional[List[str]] = None,
) -> List[str]:
if not noun_pos:
noun_pos = []
return [surface for surface, pos in self.pos(text) if pos in noun_pos]
2 changes: 2 additions & 0 deletions src/ekonlpy/etag/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
from ._template import ExtTagger

__all__ = ["ExtTagger"]
20 changes: 11 additions & 9 deletions src/ekonlpy/etag/_template.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from ..data.tagset import nouns_tags, pass_tags, skip_chk_tags, skip_tags
from ekonlpy.data.tagset import nouns_tags, pass_tags, skip_chk_tags, skip_tags


class ExtTagger:
Expand All @@ -20,13 +20,13 @@ def add_skip_tags(self, tags):

def pos(self, tokens):
def ctagger(
ctokens,
max_ngram,
cnouns_tags,
cpass_tags,
cskip_chk_tags,
cskip_tags,
cdictionary,
ctokens,
max_ngram,
cnouns_tags,
cpass_tags,
cskip_chk_tags,
cskip_tags,
cdictionary,
):
tokens_org = ctokens
num_tokens = len(ctokens)
Expand Down Expand Up @@ -65,7 +65,9 @@ def ctagger(
for j in range(ngram):
if tmp_tags[j] not in cskip_tags:
new_word += tokens_org[ipos + j][0]
num_word += "n" if tmp_tags[j] == "SN" else tokens_org[ipos + j][0]
num_word += (
"n" if tmp_tags[j] == "SN" else tokens_org[ipos + j][0]
)
dict_tag = cdictionary.get_tags(num_word.lower())
if dict_tag:
new_word = num_word
Expand Down
4 changes: 3 additions & 1 deletion src/ekonlpy/mecab/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
from ._mecab import MeCab
from ._mecab import Mecab
from ._userdic import MecabDicConfig

__all__ = ["Mecab", "MecabDicConfig"]
Loading