Skip to content

Commit

Permalink
johnsnowlabs embeddings support (#11271)
Browse files Browse the repository at this point in the history
- **Description:** Introducing the
[JohnSnowLabsEmbeddings](https://www.johnsnowlabs.com/)
  - **Dependencies:** johnsnowlabs
  - **Tag maintainer:** @C-K-Loan
- **Twitter handle:** https://twitter.com/JohnSnowLabs
https://twitter.com/ChristianKasimL

---------

Co-authored-by: Bagatur <baskaryan@gmail.com>
  • Loading branch information
C-K-Loan and baskaryan authored Oct 27, 2023
1 parent c08b622 commit a35445c
Show file tree
Hide file tree
Showing 5 changed files with 415 additions and 0 deletions.
117 changes: 117 additions & 0 deletions docs/docs/integrations/providers/johnsnowlabs.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Johnsnowlabs

Gain access to the [johnsnowlabs](https://www.johnsnowlabs.com/) ecosystem of enterprise NLP libraries
with over 21.000 enterprise NLP models in over 200 languages with the open source `johnsnowlabs` library.
For all 24.000+ models, see the [John Snow Labs Model Models Hub](https://nlp.johnsnowlabs.com/models)

## Installation and Setup


```bash
pip install johnsnowlabs
```

To [install enterprise features](https://nlp.johnsnowlabs.com/docs/en/jsl/install_licensed_quick, run:
```python
# for more details see https://nlp.johnsnowlabs.com/docs/en/jsl/install_licensed_quick
nlp.install()
```


You can embed your queries and documents with either `gpu`,`cpu`,`apple_silicon`,`aarch` based optimized binaries.
By default cpu binaries are used.
Once a session is started, you must restart your notebook to switch between GPU or CPU, or changes will not take effect.

## Embed Query with CPU:
```python
document = "foo bar"
embedding = JohnSnowLabsEmbeddings('embed_sentence.bert')
output = embedding.embed_query(document)
```


## Embed Query with GPU:


```python
document = "foo bar"
embedding = JohnSnowLabsEmbeddings('embed_sentence.bert','gpu')
output = embedding.embed_query(document)
```




## Embed Query with Apple Silicon (M1,M2,etc..):

```python
documents = ["foo bar", 'bar foo']
embedding = JohnSnowLabsEmbeddings('embed_sentence.bert','apple_silicon')
output = embedding.embed_query(document)
```



## Embed Query with AARCH:

```python
documents = ["foo bar", 'bar foo']
embedding = JohnSnowLabsEmbeddings('embed_sentence.bert','aarch')
output = embedding.embed_query(document)
```






## Embed Document with CPU:
```python
documents = ["foo bar", 'bar foo']
embedding = JohnSnowLabsEmbeddings('embed_sentence.bert','gpu')
output = embedding.embed_documents(documents)
```



## Embed Document with GPU:

```python
documents = ["foo bar", 'bar foo']
embedding = JohnSnowLabsEmbeddings('embed_sentence.bert','gpu')
output = embedding.embed_documents(documents)
```





## Embed Document with Apple Silicon (M1,M2,etc..):

```python

```python
documents = ["foo bar", 'bar foo']
embedding = JohnSnowLabsEmbeddings('embed_sentence.bert','apple_silicon')
output = embedding.embed_documents(documents)
```



## Embed Document with AARCH:

```python

```python
documents = ["foo bar", 'bar foo']
embedding = JohnSnowLabsEmbeddings('embed_sentence.bert','aarch')
output = embedding.embed_documents(documents)
```




Models are loaded with [nlp.load](https://nlp.johnsnowlabs.com/docs/en/jsl/load_api) and spark session is started with [nlp.start()](https://nlp.johnsnowlabs.com/docs/en/jsl/start-a-sparksession) under the hood.



184 changes: 184 additions & 0 deletions docs/docs/integrations/text_embedding/johnsnowlabs_embedding.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Johnsnowlabs Embedding\n",
"\n",
"### Loading the Johnsnowlabs embedding class to generate and query embeddings\n",
"\n",
"Models are loaded with [nlp.load](https://nlp.johnsnowlabs.com/docs/en/jsl/load_api) and spark session is started with [nlp.start()](https://nlp.johnsnowlabs.com/docs/en/jsl/start-a-sparksession) under the hood.\n",
"For all 24.000+ models, see the [John Snow Labs Model Models Hub](https://nlp.johnsnowlabs.com/models)\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"! pip install johnsnowlabs\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# If you have a enterprise license, you can run this to install enterprise features\n",
"# from johnsnowlabs import nlp\n",
"# nlp.install()"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"source": [
"#### Import the necessary classes"
],
"metadata": {
"collapsed": false
},
"execution_count": 1,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Found existing installation: langchain 0.0.189\n",
"Uninstalling langchain-0.0.189:\n",
" Successfully uninstalled langchain-0.0.189\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from langchain.embeddings.johnsnowlabs import JohnSnowLabsEmbeddings"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"#### Initialize Johnsnowlabs Embeddings and Spark Session"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"embedder = JohnSnowLabsEmbeddings('en.embed_sentence.biobert.clinical_base_cased')"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"#### Define some example texts . These could be any documents that you want to analyze - for example, news articles, social media posts, or product reviews."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"texts = [\"Cancer is caused by smoking\", \"Antibiotics aren't painkiller\"]"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"#### Generate and print embeddings for the texts . The JohnSnowLabsEmbeddings class generates an embedding for each document, which is a numerical representation of the document's content. These embeddings can be used for various natural language processing tasks, such as document similarity comparison or text classification."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"embeddings = embedder.embed_documents(texts)\n",
"for i, embedding in enumerate(embeddings):\n",
" print(f\"Embedding for document {i+1}: {embedding}\")"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"#### Generate and print an embedding for a single piece of text. You can also generate an embedding for a single piece of text, such as a search query. This can be useful for tasks like information retrieval, where you want to find documents that are similar to a given query."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"query = \"Cancer is caused by smoking\"\n",
"query_embedding = embedder.embed_query(query)\n",
"print(f\"Embedding for query: {query_embedding}\")"
],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
2 changes: 2 additions & 0 deletions libs/langchain/langchain/embeddings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
from langchain.embeddings.huggingface_hub import HuggingFaceHubEmbeddings
from langchain.embeddings.javelin_ai_gateway import JavelinAIGatewayEmbeddings
from langchain.embeddings.jina import JinaEmbeddings
from langchain.embeddings.johnsnowlabs import JohnSnowLabsEmbeddings
from langchain.embeddings.llamacpp import LlamaCppEmbeddings
from langchain.embeddings.localai import LocalAIEmbeddings
from langchain.embeddings.minimax import MiniMaxEmbeddings
Expand Down Expand Up @@ -113,6 +114,7 @@
"JavelinAIGatewayEmbeddings",
"OllamaEmbeddings",
"QianfanEmbeddingsEndpoint",
"JohnSnowLabsEmbeddings",
]


Expand Down
Loading

0 comments on commit a35445c

Please sign in to comment.