Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: deterministic recovery test failure in main cron #7561

Closed
yezizp2012 opened this issue Jan 28, 2023 · 20 comments
Closed

bug: deterministic recovery test failure in main cron #7561

yezizp2012 opened this issue Jan 28, 2023 · 20 comments
Assignees
Labels
type/bug Something isn't working

Comments

@yezizp2012
Copy link
Member

Describe the bug

Two issues found in https://buildkite.com/risingwavelabs/main-cron/builds/310#0185bd0f-6cd7-454d-b511-bc5cf7f3c074.

  1. The deterministic-recovery-test seems flaky. The test exted with status -1 (agent lost) in buildkite:
    agent.lost: An agent has been marked as lost. This happens when Buildkite stops receiving pings from the agent, see https://buildkite.com/docs/apis/webhooks/agent-events for more details. @zwang28 guessed that the ping lost because of recovery-test draining CPU.
  2. cannot allocate memory:
fatal error: runtime: cannot allocate memory
runtime stack:
runtime.throw({0xd0c93c, 0x2030000})
	/usr/local/go/src/runtime/panic.go:1198 +0x71
runtime.persistentalloc1(0x4000, 0xc000700800, 0x14e61a8)
	/usr/local/go/src/runtime/malloc.go:1417 +0x24f
runtime.persistentalloc.func1()
	/usr/local/go/src/runtime/malloc.go:1371 +0x2e
runtime.persistentalloc(0x14cafc8, 0xc000180000, 0x40)
	/usr/local/go/src/runtime/malloc.go:1370 +0x6f
runtime.(*fixalloc).alloc(0x14e1808)
	/usr/local/go/src/runtime/mfixalloc.go:80 +0x85
runtime.(*mheap).allocMSpanLocked(0x14cafc0)
	/usr/local/go/src/runtime/mheap.go:1078 +0xa5
runtime.(*mheap).allocSpan(0x14cafc0, 0x4, 0x1, 0x0)
	/usr/local/go/src/runtime/mheap.go:1192 +0x1b7
runtime.(*mheap).allocManual(0x0, 0x0, 0x0)
	/usr/local/go/src/runtime/mheap.go:949 +0x1f
runtime.stackalloc(0x8000)
	/usr/local/go/src/runtime/stack.go:409 +0x151
runtime.malg.func1()
	/usr/local/go/src/runtime/proc.go:4224 +0x25
runtime.persistentalloc.func1()

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@yezizp2012 yezizp2012 added the type/bug Something isn't working label Jan 28, 2023
@github-actions github-actions bot added this to the release-0.1.16 milestone Jan 28, 2023
@yezizp2012
Copy link
Member Author

Cc @wangrunji0408 @shanicky

@kwannoel
Copy link
Contributor

kwannoel commented Feb 9, 2023

For 1, Scaling test also faces the same issue on 2022-02-08, 2022-02-09.

https://buildkite.com/risingwavelabs/main-cron/builds/337#018633cd-2be8-4f49-a0bc-e195340a9625

@kwannoel
Copy link
Contributor

kwannoel commented Feb 9, 2023

Maybe we can increase agent timeout (assuming agent was not killed):
Screenshot 2023-02-09 at 11 00 50 AM

https://buildkite.com/docs/agent/v3/configuration

@zwang28
Copy link
Contributor

zwang28 commented Feb 9, 2023

assuming agent was not killed

According to @huangjw806 , it may be caused by "spot instance being actively recycled" and subsequently leads to buildkite lost the agent. Can we confirm it somehow?

@kwannoel
Copy link
Contributor

kwannoel commented Feb 9, 2023

assuming agent was not killed

According to @huangjw806 , it may be caused by "spot instance being actively recycled" and subsequently leads to buildkite lost the agent. Can we confirm it somehow?

What is meant by "spot instance being actively recycled"?

Thanks @zwang28 for clarifying: it is https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html

@huangjw806
Copy link
Contributor

huangjw806 commented Feb 9, 2023

What is meant by "spot instance being actively recycled"?

In order to save money, we now use the spot instance to run CI. If the resources are insufficient, the spot instance will be recycled by aws, which will lead to agent lost. We've added retries, but only up to 2 retries.

@kwannoel
Copy link
Contributor

kwannoel commented Feb 9, 2023

Yes +1 for @zwang28's suggestion for confirming if it is recycling spot instance. I think something like this: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html will help.

@huangjw806
Copy link
Contributor

In the performance test and the longevity test, we use the demand instance, and there has been no agent lost so far, so it should be credible.

@fuyufjh
Copy link
Member

fuyufjh commented Feb 13, 2023

More tests failed today. https://buildkite.com/risingwavelabs/main-cron/builds/344#0186485b-9e8d-417b-bdba-f779cdaba886

@fuyufjh
Copy link
Member

fuyufjh commented Feb 13, 2023

What's the spec/model of the EC2 instances for running tests? I am also suspecting OOM.

@yezizp2012 yezizp2012 removed this from the release-0.1.17 milestone Feb 27, 2023
@xxchan
Copy link
Member

xxchan commented Mar 2, 2023

https://buildkite.com/risingwavelabs/main-cron/builds/360

Two runs are on the same machine, so the EC2 is alive, but the agent is lost.

So

@zwang28 guessed that the ping lost because of recovery-test draining CPU.

sounds reasonable?

image

@xxchan
Copy link
Member

xxchan commented Mar 2, 2023

i-0ad7ac59a80c3acc8 CloudWatch metrics

CPU looks ok

image

EBS read looks strange?

image

@xxchan
Copy link
Member

xxchan commented Mar 2, 2023

CloudWatch buildkite agent logs

The first job is stopped 25min later 🤔

image

image

@xxchan
Copy link
Member

xxchan commented Mar 2, 2023

Who have some insights about the information above? 🤔 You can go to CloudWatch logs/metrics to get more information.

@fuyufjh
Copy link
Member

fuyufjh commented Mar 2, 2023

EBS read looks strange?

That matches my guess:

which reminds me of when I test RW manually, sometimes it runs out of memory and the EC2 will become irresponsive.

In this case, memory is running out so the system starts to frequently swap in/out pages of code segment. That’s why I suggested to use a larger machine.

Is it possible to get the memory usage?

@xxchan
Copy link
Member

xxchan commented Mar 2, 2023

Wait, the failure above is scaling test, not recovery test... And it seems scaling tests run in serial.

@xxchan
Copy link
Member

xxchan commented Mar 6, 2023

Scaling test doesn't "agent lost" anymore (but timeouts), recovery test still "agent lost" (not every day, only once theses 5 days)

image

@xxchan
Copy link
Member

xxchan commented Mar 6, 2023

The latter part is recovery test (after 21:10), and the former part is scaling test 🤔😇

image

@BugenZhao
Copy link
Member

Scaling test doesn't "agent lost" anymore (but timeouts)

#8374

@yezizp2012
Copy link
Member Author

Should be resolve by #9168 , will reopen it if there're any other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants