-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example raft-kv-memstore hangs after printing change-membership #550
Comments
👋 Thanks for opening this issue! Get help or engage by:
|
Did some debugging. As soon as node-2 gets added (via If I change the config in this example (see below) to same as rocksdb, things work fine. I still see
Why does it go wrong with default config of |
The logs will be helpful. Attach these logs, please and let me look at what's happening. Thanks for letting me know about this issue:D |
Attached logs |
Well... I did not see anything wrong in the logs, except a lot timeout:
It looks just a simple timeout issue. Were you using a slow computer running this test? 🤔 And as far as I know, extending the heartbeat interval will just solve this issue, as you mentioned.:) |
Heartbeat interval affects the timeout: an append-entries RPC should not take longer than the interval between 2 heartbeats. |
No. I'm running on i7-12700H with 14 cores. |
I did not yet find what was going on from the log. I updated the main branch to let examples output a more detailed log. May you rerun the test with the latest main branch ac48309? And let's see what's going on. And what's your OS and the rust-toolchain? And may you attach the |
Added all the information in the attached tar. |
Do you have a proxy installed on your machine? 🤔
The first log is printed by openraft at: The second log is printed by You can see that every log with cat n1.log | grep 'starting new connection' -B1 | grep -v '^\-\-' | awk '{v=$1; gsub("2022-09-13T..:..:", "", v); gsub("Z", "", v); print (v-a)*1000 "ms "; a=v}' Output:
|
No. |
It looks like it is not an issue with openraft, but an issue at the network layer. I'm not an expert on Can you do a single-step debugging of it on your machine? Running this test does almost the same as Another concern is whether a normal curl will be delayed by 35ms on your machine. |
For now, I ran with Based on the results I see that
|
When adding a learner, the leader will block until the new learner catches up with the logs, thus it takes a longer time, which should be expected behavior. |
Could not find much with additional debugging. |
Can I do something else to help ? Is this problem specific to my environment ? |
The problem happens after calling As I mentioned in a previous comment: I still can not reproduce the 35ms delay on my laptop or in a CI. Thank you in advance! |
I am experiencing the same issue, the test just hangs after showing the message of membership changing, there is 0ms of delay as I have no proxies or anything configured in this computer. The As @vishwaspai if I increase the Attached logs of the three nodes plus my Cargo.lock in case it can be of any help |
Definitely is something wrong with the environment. I ran it in another environment with zero issues, this are my details: Legend:
I will try in another computer with also Artix Linux kernel 6.0.x and AMD 3950X CPU in a few minutes |
Third test environment:
No issues. Probably something with Ubuntu, will try to run on this last env with a live distro or VM if I find the time |
Thank you so much @DamnWidget ! |
It happened recently in a CI session:
Update: |
In all of my tests it happened if the hb was configured to any value lower than 101 milliseconds, it also only happened in Pop_OS! when I tried, unfortunately I didn't find the time to try with other Ubuntu or Debían based distributions, it definitely wasn't a problem in Artix |
This problem is caused by an issue of cargo that affects crate openssl.
Finally, every time a I created a mini demo showing that on linux, On linux:
On my m1 mac:
Related issues:
|
On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which cause `reqwest` spending ~50 ms loading the certificates for every RPC. We just extend the RPC timeout to work around. - Fix: databendlabs#550
Nice catch, in Artix don't happens probably because it uses a more modern version of OpenSSL. Will try to check as soon as I can get into the computer |
On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which cause `reqwest` spending ~50 ms loading the certificates for every RPC. We just extend the RPC timeout to work around. - Fix: databendlabs#550
On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which cause `reqwest` spending ~50 ms loading the certificates for every RPC. We just extend the RPC timeout to work around. - Fix: databendlabs#550
On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which cause `reqwest` spending ~50 ms loading the certificates for every RPC. We just extend the RPC timeout to work around. - Fix: databendlabs#550
On Linux: command `cargo run` pollutes environment variables: It leaks `SSL_CERT_FILE` and `SSL_CERT_DIR` to the testing sub progress it runs. Which cause `reqwest` spending ~50 ms loading the certificates for every RPC. We just extend the RPC timeout to work around. - Fix: databendlabs#550
Took a latest checkout of main (commit hash: 347aca1) and ran
raft-kv-memstore$ ./test-cluster.sh
But it hangs after reaching this place..
I see that 3 process are running and first process is using 100% CPU and other 2 are almost idle. Could not access discord channel (for some reason it does not open) hence reporting it here. If required I can share any specific logs.
p.s: Example - raft-kv-rocksdb works fine.
The text was updated successfully, but these errors were encountered: