Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] Search a released partition, expecting error partition or collection not loaded, but instead getting error channel not found #40077

Open
1 task done
wangting0128 opened this issue Feb 21, 2025 · 2 comments
Assignees
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.5-20250220-b7c631f0-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc124
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: fouramf-4nl7c

server:

NAME                                                          READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
verify-neo-7d-3-etcd-0                                        1/1     Running     0                22h     10.104.34.224   4am-node37   <none>           <none>
verify-neo-7d-3-etcd-1                                        1/1     Running     0                22h     10.104.33.9     4am-node36   <none>           <none>
verify-neo-7d-3-etcd-2                                        1/1     Running     0                22h     10.104.26.250   4am-node32   <none>           <none>
verify-neo-7d-3-milvus-datanode-587d4745dc-5csp9              1/1     Running     3 (22h ago)      22h     10.104.20.167   4am-node22   <none>           <none>
verify-neo-7d-3-milvus-indexnode-77fc78fb44-79mhl             1/1     Running     3 (22h ago)      22h     10.104.23.203   4am-node27   <none>           <none>
verify-neo-7d-3-milvus-mixcoord-69644b4bf-jlnkq               1/1     Running     3 (22h ago)      22h     10.104.16.37    4am-node21   <none>           <none>
verify-neo-7d-3-milvus-proxy-864f5997c7-qqcxz                 1/1     Running     3 (22h ago)      22h     10.104.30.247   4am-node38   <none>           <none>
verify-neo-7d-3-milvus-querynode-5d5996889d-djx62             1/1     Running     3 (22h ago)      22h     10.104.21.12    4am-node24   <none>           <none>
verify-neo-7d-3-minio-0                                       1/1     Running     0                22h     10.104.33.12    4am-node36   <none>           <none>
verify-neo-7d-3-minio-1                                       1/1     Running     0                22h     10.104.30.2     4am-node38   <none>           <none>
verify-neo-7d-3-minio-2                                       1/1     Running     0                22h     10.104.34.226   4am-node37   <none>           <none>
verify-neo-7d-3-minio-3                                       1/1     Running     0                22h     10.104.26.253   4am-node32   <none>           <none>
verify-neo-7d-3-pulsarv3-bookie-0                             1/1     Running     0                22h     10.104.27.254   4am-node31   <none>           <none>
verify-neo-7d-3-pulsarv3-bookie-1                             1/1     Running     0                22h     10.104.16.41    4am-node21   <none>           <none>
verify-neo-7d-3-pulsarv3-bookie-2                             1/1     Running     0                22h     10.104.33.14    4am-node36   <none>           <none>
verify-neo-7d-3-pulsarv3-bookie-init-jzw26                    0/1     Completed   0                22h     10.104.30.246   4am-node38   <none>           <none>
verify-neo-7d-3-pulsarv3-broker-0                             1/1     Running     0                22h     10.104.27.248   4am-node31   <none>           <none>
verify-neo-7d-3-pulsarv3-broker-1                             1/1     Running     0                22h     10.104.9.47     4am-node14   <none>           <none>
verify-neo-7d-3-pulsarv3-proxy-0                              1/1     Running     0                22h     10.104.16.38    4am-node21   <none>           <none>
verify-neo-7d-3-pulsarv3-proxy-1                              1/1     Running     0                22h     10.104.30.252   4am-node38   <none>           <none>
verify-neo-7d-3-pulsarv3-pulsar-init-xfv7p                    0/1     Completed   0                22h     10.104.34.221   4am-node37   <none>           <none>
verify-neo-7d-3-pulsarv3-recovery-0                           1/1     Running     0                22h     10.104.30.248   4am-node38   <none>           <none>
verify-neo-7d-3-pulsarv3-zookeeper-0                          1/1     Running     0                22h     10.104.33.11    4am-node36   <none>           <none>
verify-neo-7d-3-pulsarv3-zookeeper-1                          1/1     Running     0                22h     10.104.16.44    4am-node21   <none>           <none>
verify-neo-7d-3-pulsarv3-zookeeper-2                          1/1     Running     0                22h     10.104.26.2     4am-node32   <none>           <none>

search_partition_scene_test_partition_88whTTM6.log

Image

client logs:

, 'scene_test_partition_88whTTM6', ''], kwargs: {}, [requestId: 553bb28a-efba-11ef-be72-928475caf2ca] (api_request.py:77)
[2025-02-20 18:41:44,481 - DEBUG - fouram]: (api_response) : [Partition] {"name":"scene_test_partition_88whTTM6","collection_name":"fouram_SDgSxSjd","description":""}, [requestId: 553bb28a-efba-11ef-be72-928475caf2ca] (api_request.py:44)
[2025-02-20 18:41:44,481 - DEBUG - fouram]: [Base] Create partition scene_test_partition_88whTTM6 of collection(fouram_SDgSxSjd) (base.py:821)
[2025-02-20 18:41:44,660 - DEBUG - fouram]: (api_request)  : [Collection.insert] args: <Collection.insert fields: 15, length: 3000, content: [ [ `type<class 'int'>, dtype<>` 0 ... ], [ `type<class 'list'>, dtype<>` [0.11158542115767522, 0. ... ], [ `type<class 'list'>, dtype<>` [0.6957680272428064, 0.8 ... ], [ `type<class 'int'>, dtype<>` 0 ... ], [ `type<class 'int'>, dtype<>` 0 ... ], [ `type<class 'int'>, dtype<>` 0 ... ], [ `type<class 'int'>, dtype<>` 0 ... ], [ `type<class 'str'>, dtype<>` 0 ... ], [ `type<class 'bool'>, dtype<>` False ... ], [ `type<class 'list'>, dtype<>` [0, 0, 0, 0, 0, 0, 0, 0, ... ], [ `type<class 'list'>, dtype<>` [0, 0, 0, 0, 0, 0, 0, 0, ... ], [ `type<class 'list'>, dtype<>` [0, 0, 0, 0, 0, 0, 0, 0, ... ], [ `type<class 'list'>, dtype<>` [0, 0, 0, 0, 0, 0, 0, 0, ... ], [ `type<class 'list'>, dtype<>` ['0', '0', '0', '0', '0' ... ], [ `type<class 'list'>, dtype<>` [False, False, False, Fa ... ] ]>, ['scene_test_partition_88whTTM6'], kwargs: {'timeout': 600}, [requestId: 5558995e-efba-11ef-be72-928475caf2ca] (api_request.py:77)
[2025-02-20 18:41:44,957 - DEBUG - fouram]: [Base] Start flush partition scene_test_partition_88whTTM6, kwargs: {} (base.py:836)
[2025-02-20 18:42:15,171 - DEBUG - fouram]: [Base] Partition scene_test_partition_88whTTM6 num entities: (3000) (base.py:832)
[2025-02-20 18:52:23,311 - DEBUG - fouram]: [Base] Start load partition scene_test_partition_88whTTM6, replica_number:1, kwargs:{} (base.py:842)
[2025-02-20 18:52:25,423 - DEBUG - fouram]: [Base] Params of partition:scene_test_partition_88whTTM6 search: nq:1, anns_field:float_vector, param:{'nprobe': 64}, limit:1, expr:"None", kwargs:{'check_task': 'check_response', 'output_fields': ['*'], 'guarantee_timestamp': None} (base.py:860)
[2025-02-20 18:52:25,852 - DEBUG - fouram]: [Base] Start release partition scene_test_partition_88whTTM6 (base.py:848)
[2025-02-20 18:52:29,105 - DEBUG - fouram]: [Base] Params of partition:scene_test_partition_88whTTM6 search: nq:1, anns_field:float_vector, param:{'nprobe': 64}, limit:1, expr:"None", kwargs:{'check_task': 'check_error_response', 'check_items': {'code': 65535, 'message': 'not loaded'}, 'output_fields': ['*'], 'guarantee_timestamp': None} (base.py:860)
[2025-02-20 18:52:29,708 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=500, message=fail to search on QueryNode 1: channel not found[channel=by-dev-rootcoord-dml_2_456140379615920502v2])>, <Time:{'RPC start': '2025-02-20 18:52:29.105593', 'RPC error': '2025-02-20 18:52:29.708741'}> (decorators.py:140)
pymilvus.exceptions.MilvusException: <MilvusException: (code=500, message=fail to search on QueryNode 1: channel not found[channel=by-dev-rootcoord-dml_2_456140379615920502v2])>
[2025-02-20 18:52:29,709 - ERROR - fouram]: (api_response) : [Partition.search] <MilvusException: (code=500, message=fail to search on QueryNode 1: channel not found[channel=by-dev-rootcoord-dml_2_456140379615920502v2])>, [requestId: d5772aaa-efbb-11ef-be72-928475caf2ca] (api_request.py:57)
ValueError: [CheckFunc] Check `search` response error failed: (65535 == 500 or not loaded in fail to search on QueryNode 1: channel not found[channel=by-dev-rootcoord-dml_2_456140379615920502v2])
[2025-02-20 18:52:29,709 - ERROR - fouram]: [func_time_catch] : [CheckFunc] Check `search` response error failed: (65535 == 500 or not loaded in fail to search on QueryNode 1: channel not found[channel=by-dev-rootcoord-dml_2_456140379615920502v2]) (api_request.py:127)
Image

Expected Behavior

No response

Steps To Reproduce

1. create a collection with fields: 'id', 'float_vector'(128dim), "float_vector_1"(128dim),"int8_1","int16_1","int32_1","int64_1","varchar_1","bool_1","array_int8_1","array_int16_1","array_int32_1","array_int64_1","array_varchar_1","array_bool_1"
2. build index
   - IVF_SQ8: float_vector
   - HNSW: float_vector_1
3. insert 5m data into 10 partitions
4. flush
5. rebuild index
6. load collection
7. concurrent requests:
   - scene_insert_partition
     (partition: create->insert->flush->release->drop)
   - scene_test_partition
     (partition: create->insert->flush->index again->load->search->release->search failed->drop)  <- search raies not expected error
   - scene_test_partition_hybrid_search
     (partition: create->insert->flush->index again->load->hybrid_search->release->hybrid_search failed->drop)
   - release_partitions
   - upsert
   - scene_test
     (collection: create->insert->flush->index->drop)
   - scene_search_test
     (collection: create->insert->flush->index->load->search->drop)
   - scene_hybrid_search_test
     (collection: create->insert->flush->index->load->hybrid_search->drop)

Milvus Log

No response

Anything else?

server config:

{
     "queryNode": {
          "resources": {
               "limits": {
                    "cpu": "16.0",
                    "memory": "32Gi"
               },
               "requests": {
                    "cpu": "9.0",
                    "memory": "20Gi"
               }
          }
     },
     "indexNode": {
          "resources": {
               "limits": {
                    "cpu": "8.0",
                    "memory": "16Gi"
               },
               "requests": {
                    "cpu": "5.0",
                    "memory": "9Gi"
               }
          }
     },
     "dataNode": {
          "resources": {
               "limits": {
                    "cpu": "8.0",
                    "memory": "16Gi"
               },
               "requests": {
                    "cpu": "5.0",
                    "memory": "9Gi"
               }
          }
     },
     "cluster": {
          "enabled": true
     },
     "pulsarv3": {},
     "kafka": {},
     "minio": {
          "metrics": {
               "podMonitor": {
                    "enabled": true
               }
          }
     },
     "etcd": {
          "metrics": {
               "enabled": true,
               "podMonitor": {
                    "enabled": true
               }
          },
          "image": {
               "registry": "harbor.milvus.io",
               "repository": "milvus/etcd",
               "tag": "v3.5.18-r1"
          }
     },
     "metrics": {
          "serviceMonitor": {
               "enabled": true
          }
     },
     "log": {
          "level": "debug"
     },
     "image": {
          "all": {
               "repository": "harbor.milvus.io/milvus/milvus",
               "tag": "2.5-20250220-b7c631f0-amd64"
          }
     }
}

client config:

{
     "dataset_params": {
          "metric_type": "L2",
          "dim": 128,
          "max_length": 100,
          "vectors_index": {
               "float_vector_1": {
                    "index_type": "HNSW",
                    "index_param": {
                         "M": 8,
                         "efConstruction": 200
                    },
                    "metric_type": "L2"
               }
          },
          "extra_partitions": {
               "partitions": [
                    "_default",
                    "partition_1",
                    "partition_2",
                    "partition_3",
                    "partition_4",
                    "partition_5",
                    "partition_6",
                    "partition_7",
                    "partition_8",
                    "partition_9"
               ],
               "data_repeated": false
          },
          "dataset_name": "sift",
          "dataset_size": "5m",
          "ni_per": 5000
     },
     "collection_params": {
          "other_fields": [
               "float_vector_1",
               "int8_1",
               "int16_1",
               "int32_1",
               "int64_1",
               "varchar_1",
               "bool_1",
               "array_int8_1",
               "array_int16_1",
               "array_int32_1",
               "array_int64_1",
               "array_varchar_1",
               "array_bool_1"
          ],
          "shards_num": 16
     },
     "index_params": {
          "index_type": "IVF_SQ8",
          "index_param": {
               "nlist": 1024
          }
     },
     "concurrent_params": {
          "concurrent_number": 20,
          "during_time": "7d",
          "interval": 20
     },
     "concurrent_tasks": [
          {
               "type": "scene_insert_partition",
               "weight": 1,
               "params": {
                    "data_size": 3000,
                    "ni": 1000,
                    "with_flush": true,
                    "timeout": 600
               }
          },
          {
               "type": "scene_test_partition",
               "weight": 1,
               "params": {
                    "data_size": 3000,
                    "ni": 3000,
                    "nq": 1,
                    "search_param": {
                         "nprobe": 64
                    },
                    "limit": 1,
                    "output_fields": [
                         "*"
                    ],
                    "timeout": 600,
                    "search_counts": 1
               }
          },
          {
               "type": "scene_test_partition_hybrid_search",
               "weight": 1,
               "params": {
                    "nq": 1,
                    "top_k": 1,
                    "reqs": [
                         {
                              "search_param": {
                                   "nprobe": 128
                              },
                              "anns_field": "float_vector",
                              "top_k": 100
                         },
                         {
                              "search_param": {
                                   "ef": 64
                              },
                              "anns_field": "float_vector_1",
                              "top_k": 10
                         }
                    ],
                    "rerank": {
                         "RRFRanker": []
                    },
                    "output_fields": [
                         "*"
                    ],
                    "timeout": 600,
                    "random_data": true,
                    "hybrid_search_counts": 1,
                    "data_size": 3000,
                    "ni": 3000
               }
          },
          {
               "type": "release_partitions",
               "weight": 1,
               "params": {
                    "partitions": [
                         "_default",
                         "partition_1",
                         "partition_2",
                         "partition_3",
                         "partition_4",
                         "partition_5",
                         "partition_6",
                         "partition_7",
                         "partition_8",
                         "partition_9"
                    ],
                    "timeout": 180,
                    "check_task": "check_response",
                    "check_items": null
               }
          },
          {
               "type": "upsert",
               "weight": 1,
               "params": {
                    "nb": 1,
                    "timeout": 30,
                    "random_id": true,
                    "random_vector": true,
                    "check_task": "check_response",
                    "check_items": null
               }
          },
          {
               "type": "scene_test",
               "weight": 1,
               "params": {
                    "dim": 128,
                    "data_size": 3000,
                    "nb": 3000,
                    "index_type": "IVF_SQ8",
                    "index_param": {
                         "nlist": 2048
                    },
                    "metric_type": "L2"
               }
          },
          {
               "type": "scene_search_test",
               "weight": 1,
               "params": {
                    "dataset": "local",
                    "dim": 128,
                    "shards_num": 2,
                    "data_size": 3000,
                    "nb": 3000,
                    "index_type": "IVF_SQ8",
                    "index_param": {
                         "nlist": 2048
                    },
                    "metric_type": "L2",
                    "other_fields": [
                         "array_int64_1",
                         "array_bool_1",
                         "array_varchar_1"
                    ],
                    "replica_number": 1,
                    "nq": 1,
                    "top_k": 10,
                    "search_param": {
                         "nprobe": 16
                    },
                    "search_counts": 10,
                    "scalars_index": {
                         "array_int64_1": {
                              "index_type": "BITMAP"
                         },
                         "array_bool_1": {
                              "index_type": "BITMAP"
                         },
                         "array_varchar_1": {
                              "index_type": "BITMAP"
                         }
                    }
               }
          },
          {
               "type": "scene_hybrid_search_test",
               "weight": 1,
               "params": {
                    "nq": 1,
                    "top_k": 1,
                    "reqs": [
                         {
                              "search_param": {
                                   "nprobe": 128
                              },
                              "anns_field": "float_vector",
                              "expr": "bool_1 == True",
                              "top_k": 100
                         },
                         {
                              "search_param": {
                                   "nprobe": 32
                              },
                              "anns_field": "binary_vector_scene_hybrid_search_test_1",
                              "expr": "bool_1 != True",
                              "top_k": 10
                         },
                         {
                              "search_param": {
                                   "search_list": 30
                              },
                              "anns_field": "float16_vector_scene_hybrid_search_test_2",
                              "expr": "int64_1 >= 1500",
                              "top_k": 5
                         },
                         {
                              "search_param": {
                                   "drop_ratio_search": 0.1
                              },
                              "anns_field": "sparse_float_vector_scene_hybrid_search_test_3",
                              "expr": "varchar_1 like \"1%\"",
                              "top_k": 10
                         }
                    ],
                    "rerank": {
                         "RRFRanker": []
                    },
                    "output_fields": [
                         "*"
                    ],
                    "timeout": 600,
                    "random_data": true,
                    "dataset": "local",
                    "dim": 128,
                    "shards_num": 2,
                    "data_size": 3000,
                    "nb": 3000,
                    "index_type": "IVF_SQ8",
                    "index_param": {
                         "nlist": 2048
                    },
                    "metric_type": "L2",
                    "other_fields": [
                         "binary_vector_scene_hybrid_search_test_1",
                         "float16_vector_scene_hybrid_search_test_2",
                         "sparse_float_vector_scene_hybrid_search_test_3",
                         "int64_1",
                         "bool_1",
                         "varchar_1"
                    ],
                    "replica_number": 1,
                    "scalars_params": {
                         "binary_vector_scene_hybrid_search_test_1": {
                              "params": {
                                   "dim": 512
                              },
                              "other_params": {
                                   "dataset": "binary"
                              }
                         },
                         "float16_vector_scene_hybrid_search_test_2": {
                              "params": {
                                   "dim": 64
                              }
                         }
                    },
                    "scalars_index": {
                         "int64_1": {},
                         "bool_1": {
                              "index_type": "BITMAP"
                         },
                         "varchar_1": {
                              "index_type": "BITMAP"
                         }
                    },
                    "vectors_index": {
                         "binary_vector_scene_hybrid_search_test_1": {
                              "index_type": "BIN_IVF_FLAT",
                              "index_param": {
                                   "nlist": 2048
                              },
                              "metric_type": "JACCARD"
                         },
                         "float16_vector_scene_hybrid_search_test_2": {
                              "index_type": "DISKANN",
                              "index_param": {},
                              "metric_type": "IP"
                         },
                         "sparse_float_vector_scene_hybrid_search_test_3": {
                              "index_type": "SPARSE_WAND",
                              "index_param": {
                                   "drop_ratio_build": 0.2
                              },
                              "metric_type": "IP"
                         }
                    },
                    "hybrid_search_counts": 10
               }
          }
     ]
}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Feb 21, 2025
@wangting0128 wangting0128 added this to the 2.5.5 milestone Feb 21, 2025
Copy link

dosubot bot commented Feb 21, 2025

I found two similar closed issues related to the 'channel not found' error in Milvus:

  1. No solution found in the comments [1].
  2. No solution found in the comments [2].

The 'channel not found' error can occur due to a bug in loading primary field operations during the upgrade process, which has been fixed in Milvus PR #39922. Additionally, this error might be caused by an improper shutdown sequence during the upgrade process, where Milvus and etcd are stopped simultaneously. To resolve this, ensure that you first stop the Milvus standalone instance and then stop the etcd cluster. This sequence allows Milvus to properly persist its state to etcd before shutdown, preventing the collection from entering a recovery state after the upgrade. Upgrading to the latest Milvus image that includes the fix is also recommended [3].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@weiliu1031
Copy link
Contributor

after collection has been released, the shard location cache doesn't be cleaned. then the search on released collection will be send to query node, then failed by channel not found

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 21, 2025
@yanliang567 yanliang567 removed their assignment Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants