Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added jina-embeddings-v2-base-code #301

Merged
merged 5 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 41 additions & 32 deletions docs/examples/Supported_Models.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2024-05-31T18:13:25.863008Z",
Expand Down Expand Up @@ -106,16 +106,16 @@
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>sentence-transformers/all-MiniLM-L6-v2</td>\n",
" <td>snowflake/snowflake-arctic-embed-xs</td>\n",
" <td>384</td>\n",
" <td>Sentence Transformer model, MiniLM-L6-v2</td>\n",
" <td>Based on all-MiniLM-L6-v2 model with only 22m ...</td>\n",
" <td>0.090</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>snowflake/snowflake-arctic-embed-xs</td>\n",
" <td>sentence-transformers/all-MiniLM-L6-v2</td>\n",
" <td>384</td>\n",
" <td>Based on all-MiniLM-L6-v2 model with only 22m ...</td>\n",
" <td>Sentence Transformer model, MiniLM-L6-v2</td>\n",
" <td>0.090</td>\n",
" </tr>\n",
" <tr>\n",
Expand Down Expand Up @@ -190,16 +190,16 @@
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>jinaai/jina-embeddings-v2-base-en</td>\n",
" <td>nomic-ai/nomic-embed-text-v1.5</td>\n",
" <td>768</td>\n",
" <td>English embedding model supporting 8192 sequen...</td>\n",
" <td>8192 context length english model</td>\n",
" <td>0.520</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>nomic-ai/nomic-embed-text-v1.5</td>\n",
" <td>jinaai/jina-embeddings-v2-base-en</td>\n",
" <td>768</td>\n",
" <td>8192 context length english model</td>\n",
" <td>English embedding model supporting 8192 sequen...</td>\n",
" <td>0.520</td>\n",
" </tr>\n",
" <tr>\n",
Expand All @@ -225,34 +225,41 @@
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>jinaai/jina-embeddings-v2-base-code</td>\n",
" <td>768</td>\n",
" <td>Source code embedding model supporting 8192 se...</td>\n",
" <td>0.640</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>sentence-transformers/paraphrase-multilingual-...</td>\n",
" <td>768</td>\n",
" <td>Sentence-transformers model for tasks like clu...</td>\n",
" <td>1.000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <th>21</th>\n",
" <td>snowflake/snowflake-arctic-embed-l</td>\n",
" <td>1024</td>\n",
" <td>Based on intfloat/e5-large-unsupervised, large...</td>\n",
" <td>1.020</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <th>22</th>\n",
" <td>thenlper/gte-large</td>\n",
" <td>1024</td>\n",
" <td>Large general text embeddings model</td>\n",
" <td>1.200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <th>23</th>\n",
" <td>BAAI/bge-large-en-v1.5</td>\n",
" <td>1024</td>\n",
" <td>Large English model, v1.5</td>\n",
" <td>1.200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <th>24</th>\n",
" <td>intfloat/multilingual-e5-large</td>\n",
" <td>1024</td>\n",
" <td>Multilingual model, e5-large. Recommend using ...</td>\n",
Expand All @@ -266,8 +273,8 @@
" model dim \\\n",
"0 BAAI/bge-small-en-v1.5 384 \n",
"1 BAAI/bge-small-zh-v1.5 512 \n",
"2 sentence-transformers/all-MiniLM-L6-v2 384 \n",
"3 snowflake/snowflake-arctic-embed-xs 384 \n",
"2 snowflake/snowflake-arctic-embed-xs 384 \n",
"3 sentence-transformers/all-MiniLM-L6-v2 384 \n",
"4 jinaai/jina-embeddings-v2-small-en 512 \n",
"5 BAAI/bge-small-en 384 \n",
"6 snowflake/snowflake-arctic-embed-s 384 \n",
Expand All @@ -278,22 +285,23 @@
"11 jinaai/jina-embeddings-v2-base-de 768 \n",
"12 BAAI/bge-base-en 768 \n",
"13 snowflake/snowflake-arctic-embed-m 768 \n",
"14 jinaai/jina-embeddings-v2-base-en 768 \n",
"15 nomic-ai/nomic-embed-text-v1.5 768 \n",
"14 nomic-ai/nomic-embed-text-v1.5 768 \n",
"15 jinaai/jina-embeddings-v2-base-en 768 \n",
"16 nomic-ai/nomic-embed-text-v1 768 \n",
"17 snowflake/snowflake-arctic-embed-m-long 768 \n",
"18 mixedbread-ai/mxbai-embed-large-v1 1024 \n",
"19 sentence-transformers/paraphrase-multilingual-... 768 \n",
"20 snowflake/snowflake-arctic-embed-l 1024 \n",
"21 thenlper/gte-large 1024 \n",
"22 BAAI/bge-large-en-v1.5 1024 \n",
"23 intfloat/multilingual-e5-large 1024 \n",
"19 jinaai/jina-embeddings-v2-base-code 768 \n",
"20 sentence-transformers/paraphrase-multilingual-... 768 \n",
"21 snowflake/snowflake-arctic-embed-l 1024 \n",
"22 thenlper/gte-large 1024 \n",
"23 BAAI/bge-large-en-v1.5 1024 \n",
"24 intfloat/multilingual-e5-large 1024 \n",
"\n",
" description size_in_GB \n",
"0 Fast and Default English model 0.067 \n",
"1 Fast and recommended Chinese model 0.090 \n",
"2 Sentence Transformer model, MiniLM-L6-v2 0.090 \n",
"3 Based on all-MiniLM-L6-v2 model with only 22m ... 0.090 \n",
"2 Based on all-MiniLM-L6-v2 model with only 22m ... 0.090 \n",
"3 Sentence Transformer model, MiniLM-L6-v2 0.090 \n",
"4 English embedding model supporting 8192 sequen... 0.120 \n",
"5 Fast English model 0.130 \n",
"6 Based on infloat/e5-small-unsupervised, does n... 0.130 \n",
Expand All @@ -304,19 +312,20 @@
"11 German embedding model supporting 8192 sequenc... 0.320 \n",
"12 Base English model 0.420 \n",
"13 Based on intfloat/e5-base-unsupervised model, ... 0.430 \n",
"14 English embedding model supporting 8192 sequen... 0.520 \n",
"15 8192 context length english model 0.520 \n",
"14 8192 context length english model 0.520 \n",
"15 English embedding model supporting 8192 sequen... 0.520 \n",
"16 8192 context length english model 0.520 \n",
"17 Based on nomic-ai/nomic-embed-text-v1-unsuperv... 0.540 \n",
"18 MixedBread Base sentence embedding model, does... 0.640 \n",
"19 Sentence-transformers model for tasks like clu... 1.000 \n",
"20 Based on intfloat/e5-large-unsupervised, large... 1.020 \n",
"21 Large general text embeddings model 1.200 \n",
"22 Large English model, v1.5 1.200 \n",
"23 Multilingual model, e5-large. Recommend using ... 2.240 "
"19 Source code embedding model supporting 8192 se... 0.640 \n",
"20 Sentence-transformers model for tasks like clu... 1.000 \n",
"21 Based on intfloat/e5-large-unsupervised, large... 1.020 \n",
"22 Large general text embeddings model 1.200 \n",
"23 Large English model, v1.5 1.200 \n",
"24 Multilingual model, e5-large. Recommend using ... 2.240 "
]
},
"execution_count": 3,
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
Expand Down
8 changes: 8 additions & 0 deletions fastembed/text/pooled_normalized_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@
"sources": {"hf": "jinaai/jina-embeddings-v2-base-de"},
"model_file": "onnx/model_fp16.onnx",
},
{
"model": "jinaai/jina-embeddings-v2-base-code",
"dim": 768,
"description": "Source code embedding model supporting 8192 sequence length",
"size_in_GB": 0.64,
"sources": {"hf": "jinaai/jina-embeddings-v2-base-code"},
"model_file": "onnx/model.onnx",
},
]


Expand Down
3 changes: 3 additions & 0 deletions tests/test_text_onnx_embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@
"jinaai/jina-embeddings-v2-base-de": np.array(
[-0.0085, 0.0417, 0.0342, 0.0309, -0.0149]
),
"jinaai/jina-embeddings-v2-base-code": np.array(
[0.0145, -0.0164, 0.0136, -0.0170, 0.0734]
),
"nomic-ai/nomic-embed-text-v1": np.array(
[0.3708 , 0.2031, -0.3406, -0.2114, -0.3230]
),
Expand Down
Loading