-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: deterministic recovery test failure in main cron #7561
Comments
For 1, Scaling test also faces the same issue on 2022-02-08, 2022-02-09. https://buildkite.com/risingwavelabs/main-cron/builds/337#018633cd-2be8-4f49-a0bc-e195340a9625 |
According to @huangjw806 , it may be caused by "spot instance being actively recycled" and subsequently leads to buildkite lost the agent. Can we confirm it somehow? |
What is meant by "spot instance being actively recycled"? Thanks @zwang28 for clarifying: it is https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html |
In order to save money, we now use the spot instance to run CI. If the resources are insufficient, the spot instance will be recycled by aws, which will lead to agent lost. We've added retries, but only up to 2 retries. |
Yes +1 for @zwang28's suggestion for confirming if it is recycling spot instance. I think something like this: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html will help. |
In the performance test and the longevity test, we use the demand instance, and there has been no agent lost so far, so it should be credible. |
More tests failed today. https://buildkite.com/risingwavelabs/main-cron/builds/344#0186485b-9e8d-417b-bdba-f779cdaba886 |
What's the spec/model of the EC2 instances for running tests? I am also suspecting OOM. |
https://buildkite.com/risingwavelabs/main-cron/builds/360 Two runs are on the same machine, so the EC2 is alive, but the agent is lost. So
sounds reasonable? |
Who have some insights about the information above? 🤔 You can go to CloudWatch logs/metrics to get more information. |
That matches my guess:
In this case, memory is running out so the system starts to frequently swap in/out pages of code segment. That’s why I suggested to use a larger machine. Is it possible to get the memory usage? |
Wait, the failure above is scaling test, not recovery test... And it seems scaling tests run in serial. |
|
Should be resolve by #9168 , will reopen it if there're any other issues. |
Describe the bug
Two issues found in https://buildkite.com/risingwavelabs/main-cron/builds/310#0185bd0f-6cd7-454d-b511-bc5cf7f3c074.
agent.lost
:An agent has been marked as lost. This happens when Buildkite stops receiving pings from the agent
, see https://buildkite.com/docs/apis/webhooks/agent-events for more details. @zwang28 guessed that the ping lost because of recovery-test draining CPU.To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: