Going far in NLP with open source tools

This repo contains demos of Open Source NLP tools

Note on the Streamlit dashboard

If you'd like to run it locally, you'd need to download the dataset first (see the next section) and also run the Sentence Transformers notebook to get the embeddings locally.

Dataset

The dataset used is the Recipe Box dataset, collected by Ryan Lee. For the demos, I've kept only recipes with more than 20 words and cleaned the data lightly (see the preprocess_data.py script).

Run the following to download the recipes locally:

chmod +x dataload.sh
./dataload.sh

Setup

Dev Container (preferred)

The repo could run locally on a virtual environment, but I recommend using the Dev Container setup.

For a dev container setup in VScode, you'd need

Docker Desktop.
The Python and Dev Containers VSCode extensions.

Once installed, check that you see a new icon at the bottom-left of the screen, it should looks like this: >< with the right bracket a bit higher than the left bracket.
Make sure that you have a .env file in root.
Open the repo in the container.

The next thing to do is to run the Docker container specified in Dockerfile (with Python) and open this repository in that container. To do this, click on the >< icon bottom-left of the screen and select "Reopen in Container". Once all requirements defined in requirements.txt are installed, the environment is set and you can code forward.

Virtual Environment

If you prefer to work on a virtual environment, you can do your usual routine, for example.

python3 -m venv nlp_tools
source nlp_tools/bin/activate
pip install -r requirements.txt

In all cases, run the below once either the Dev Container or the virtual environment is activated, so that the imports work (I hate python imports)

sudo python setup.py develop

Tools

SentenceTransformers - document embeddings creation
Brief TfIDF cameo - document embeddings creation
Bulk - data viz, data EDA & initial labelling.
BERTopic - topic modelling & initial labelling
Pigeon - simple annotation in Jupyter
Langchain & Chroma - many NLP goodies, but in this case, indexing, vector storing & search.
Simsity - lightweight indexing, storing and search
SetFit - few-shot learning and classification
Streamlit - simple deployment and data apps

Notebooks

Topic	Tool	Notebook Link
Embeddings	Sentence Transformers	here
Embeddings	TF IDF	here
Embeddings visualization	Bulk	here + `python -m bulk text data/bulk_st.csv` in the terminal
Topic modelling	BERTopic	here
Simple annotations	Pigeon	here
Few shot classification	SetFit	here
Simple Indexing and Search	Simsity	here
Indexing & Search with ChromaDB & Langchain	ChromaDB & Langchain	here
Streamlit search app	Streamlit	here, then `streamlit run streamlit_app/search_app.py`

Fun, absolutely not production level outcomes

A Huggingface model - https://huggingface.co/krumeto/setfit-recipe-classifer/blob/main/README.md
The Streamlit app needs small changes to be deployable:
- Files data/recipes_raw.zip and embeddings/st_embeddings.joblib need to be read from a data storage or handled via Git LFS. Might do it at some point in time.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.devcontainer		.devcontainer
.vscode		.vscode
data		data
images		images
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataload.sh		dataload.sh
requirements.txt		requirements.txt
search_app.py		search_app.py
setup.py		setup.py
slides.html		slides.html
slides.qmd		slides.qmd
styles.css		styles.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Going far in NLP with open source tools

Note on the Streamlit dashboard

Dataset

Setup

Dev Container (preferred)

Virtual Environment

Tools

Notebooks

Fun, absolutely not production level outcomes

About

Releases

Packages

Contributors 2

Languages

License

krumeto/oss_nlp_tools_demos

Folders and files

Latest commit

History

Repository files navigation

Going far in NLP with open source tools

Note on the Streamlit dashboard

Dataset

Setup

Dev Container (preferred)

Virtual Environment

Tools

Notebooks

Fun, absolutely not production level outcomes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages