This repo contains demos of Open Source NLP tools
If you'd like to run it locally, you'd need to download the dataset first (see the next section) and also run the Sentence Transformers notebook to get the embeddings locally.
The dataset used is the Recipe Box dataset, collected by Ryan Lee. For the demos, I've kept only recipes with more than 20 words and cleaned the data lightly (see the preprocess_data.py script).
Run the following to download the recipes locally:
chmod +x dataload.sh
./dataload.sh
The repo could run locally on a virtual environment, but I recommend using the Dev Container setup.
For a dev container setup in VScode, you'd need
-
The Python and Dev Containers VSCode extensions.
Once installed, check that you see a new icon at the bottom-left of the screen, it should looks like this:
><
with the right bracket a bit higher than the left bracket. -
Make sure that you have a
.env
file in root. -
Open the repo in the container.
The next thing to do is to run the Docker container specified in Dockerfile (with Python) and open this repository in that container. To do this, click on the
><
icon bottom-left of the screen and select "Reopen in Container". Once all requirements defined in requirements.txt are installed, the environment is set and you can code forward.
If you prefer to work on a virtual environment, you can do your usual routine, for example.
python3 -m venv nlp_tools
source nlp_tools/bin/activate
pip install -r requirements.txt
In all cases, run the below once either the Dev Container or the virtual environment is activated, so that the imports work (I hate python imports)
sudo python setup.py develop
- SentenceTransformers - document embeddings creation
- Brief TfIDF cameo - document embeddings creation
- Bulk - data viz, data EDA & initial labelling.
- BERTopic - topic modelling & initial labelling
- Pigeon - simple annotation in Jupyter
- Langchain & Chroma - many NLP goodies, but in this case, indexing, vector storing & search.
- Simsity - lightweight indexing, storing and search
- SetFit - few-shot learning and classification
- Streamlit - simple deployment and data apps
Topic | Tool | Notebook Link |
---|---|---|
Embeddings | Sentence Transformers | here |
Embeddings | TF IDF | here |
Embeddings visualization | Bulk | here + python -m bulk text data/bulk_st.csv in the terminal |
Topic modelling | BERTopic | here |
Simple annotations | Pigeon | here |
Few shot classification | SetFit | here |
Simple Indexing and Search | Simsity | here |
Indexing & Search with ChromaDB & Langchain | ChromaDB & Langchain | here |
Streamlit search app | Streamlit | here, then streamlit run streamlit_app/search_app.py |
-
A Huggingface model - https://huggingface.co/krumeto/setfit-recipe-classifer/blob/main/README.md
-
The Streamlit app needs small changes to be deployable:
- Files
data/recipes_raw.zip
andembeddings/st_embeddings.joblib
need to be read from a data storage or handled via Git LFS. Might do it at some point in time.
- Files