bug: deterministic recovery test failure in main cron #7561

yezizp2012 · 2023-01-28T07:05:09Z

Describe the bug

Two issues found in https://buildkite.com/risingwavelabs/main-cron/builds/310#0185bd0f-6cd7-454d-b511-bc5cf7f3c074.

The deterministic-recovery-test seems flaky. The test exted with status -1 (agent lost) in buildkite:
agent.lost: An agent has been marked as lost. This happens when Buildkite stops receiving pings from the agent, see https://buildkite.com/docs/apis/webhooks/agent-events for more details. @zwang28 guessed that the ping lost because of recovery-test draining CPU.
cannot allocate memory:

fatal error: runtime: cannot allocate memory
runtime stack:
runtime.throw({0xd0c93c, 0x2030000})
	/usr/local/go/src/runtime/panic.go:1198 +0x71
runtime.persistentalloc1(0x4000, 0xc000700800, 0x14e61a8)
	/usr/local/go/src/runtime/malloc.go:1417 +0x24f
runtime.persistentalloc.func1()
	/usr/local/go/src/runtime/malloc.go:1371 +0x2e
runtime.persistentalloc(0x14cafc8, 0xc000180000, 0x40)
	/usr/local/go/src/runtime/malloc.go:1370 +0x6f
runtime.(*fixalloc).alloc(0x14e1808)
	/usr/local/go/src/runtime/mfixalloc.go:80 +0x85
runtime.(*mheap).allocMSpanLocked(0x14cafc0)
	/usr/local/go/src/runtime/mheap.go:1078 +0xa5
runtime.(*mheap).allocSpan(0x14cafc0, 0x4, 0x1, 0x0)
	/usr/local/go/src/runtime/mheap.go:1192 +0x1b7
runtime.(*mheap).allocManual(0x0, 0x0, 0x0)
	/usr/local/go/src/runtime/mheap.go:949 +0x1f
runtime.stackalloc(0x8000)
	/usr/local/go/src/runtime/stack.go:409 +0x151
runtime.malg.func1()
	/usr/local/go/src/runtime/proc.go:4224 +0x25
runtime.persistentalloc.func1()

To Reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

yezizp2012 · 2023-01-28T07:05:30Z

Cc @wangrunji0408 @shanicky

kwannoel · 2023-02-09T02:45:39Z

For 1, Scaling test also faces the same issue on 2022-02-08, 2022-02-09.

https://buildkite.com/risingwavelabs/main-cron/builds/337#018633cd-2be8-4f49-a0bc-e195340a9625

kwannoel · 2023-02-09T03:04:39Z

Maybe we can increase agent timeout (assuming agent was not killed):

https://buildkite.com/docs/agent/v3/configuration

zwang28 · 2023-02-09T03:10:57Z

assuming agent was not killed

According to @huangjw806 , it may be caused by "spot instance being actively recycled" and subsequently leads to buildkite lost the agent. Can we confirm it somehow?

kwannoel · 2023-02-09T03:13:08Z

assuming agent was not killed

According to @huangjw806 , it may be caused by "spot instance being actively recycled" and subsequently leads to buildkite lost the agent. Can we confirm it somehow?

What is meant by "spot instance being actively recycled"?

Thanks @zwang28 for clarifying: it is https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html

huangjw806 · 2023-02-09T03:17:21Z

What is meant by "spot instance being actively recycled"?

In order to save money, we now use the spot instance to run CI. If the resources are insufficient, the spot instance will be recycled by aws, which will lead to agent lost. We've added retries, but only up to 2 retries.

kwannoel · 2023-02-09T03:18:53Z

Yes +1 for @zwang28's suggestion for confirming if it is recycling spot instance. I think something like this: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html will help.

huangjw806 · 2023-02-09T03:21:47Z

In the performance test and the longevity test, we use the demand instance, and there has been no agent lost so far, so it should be credible.

fuyufjh · 2023-02-13T02:44:32Z

More tests failed today. https://buildkite.com/risingwavelabs/main-cron/builds/344#0186485b-9e8d-417b-bdba-f779cdaba886

fuyufjh · 2023-02-13T02:51:20Z

What's the spec/model of the EC2 instances for running tests? I am also suspecting OOM.

xxchan · 2023-03-02T09:21:58Z

https://buildkite.com/risingwavelabs/main-cron/builds/360

Two runs are on the same machine, so the EC2 is alive, but the agent is lost.

So

@zwang28 guessed that the ping lost because of recovery-test draining CPU.

sounds reasonable?

xxchan · 2023-03-02T09:39:23Z

i-0ad7ac59a80c3acc8 CloudWatch metrics

CPU looks ok

EBS read looks strange?

xxchan · 2023-03-02T09:50:21Z

CloudWatch buildkite agent logs

The first job is stopped 25min later 🤔

xxchan · 2023-03-02T09:54:18Z

Who have some insights about the information above? 🤔 You can go to CloudWatch logs/metrics to get more information.

fuyufjh · 2023-03-02T10:04:47Z

EBS read looks strange?

That matches my guess:

which reminds me of when I test RW manually, sometimes it runs out of memory and the EC2 will become irresponsive.

In this case, memory is running out so the system starts to frequently swap in/out pages of code segment. That’s why I suggested to use a larger machine.

Is it possible to get the memory usage?

Try to mitigate #7561

xxchan · 2023-03-02T12:00:16Z

Wait, the failure above is scaling test, not recovery test... And it seems scaling tests run in serial.

xxchan · 2023-03-06T22:22:03Z

Scaling test doesn't "agent lost" anymore (but timeouts), recovery test still "agent lost" (not every day, only once theses 5 days)

xxchan · 2023-03-06T22:35:52Z

The latter part is recovery test (after 21:10), and the former part is scaling test 🤔😇

BugenZhao · 2023-03-07T04:55:20Z

Scaling test doesn't "agent lost" anymore (but timeouts)

#8374

yezizp2012 · 2023-04-21T06:28:09Z

Should be resolve by #9168 , will reopen it if there're any other issues.

yezizp2012 added the type/bug Something isn't working label Jan 28, 2023

github-actions bot added this to the release-0.1.16 milestone Jan 28, 2023

fuyufjh modified the milestones: release-0.1.16, release-0.1.17 Jan 30, 2023

fuyufjh assigned yezizp2012 Feb 6, 2023

yezizp2012 removed this from the release-0.1.17 milestone Feb 27, 2023

xxchan added a commit that referenced this issue Mar 2, 2023

ci: limit parallel to run 32 jobs each time

05c9d11

Try to mitigate #7561

xxchan mentioned this issue Mar 2, 2023

ci: use parallel instead of MADSIM_TEST_NUM for scale test #8308

Merged

1 task

yezizp2012 closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: deterministic recovery test failure in main cron #7561

bug: deterministic recovery test failure in main cron #7561

yezizp2012 commented Jan 28, 2023

yezizp2012 commented Jan 28, 2023

kwannoel commented Feb 9, 2023 •

edited

Loading

kwannoel commented Feb 9, 2023 •

edited

Loading

zwang28 commented Feb 9, 2023 •

edited

Loading

kwannoel commented Feb 9, 2023 •

edited

Loading

huangjw806 commented Feb 9, 2023 •

edited

Loading

kwannoel commented Feb 9, 2023 •

edited

Loading

huangjw806 commented Feb 9, 2023

fuyufjh commented Feb 13, 2023

fuyufjh commented Feb 13, 2023

xxchan commented Mar 2, 2023

xxchan commented Mar 2, 2023

xxchan commented Mar 2, 2023

xxchan commented Mar 2, 2023

fuyufjh commented Mar 2, 2023

xxchan commented Mar 2, 2023 •

edited

Loading

xxchan commented Mar 6, 2023 •

edited

Loading

xxchan commented Mar 6, 2023

BugenZhao commented Mar 7, 2023

yezizp2012 commented Apr 21, 2023

bug: deterministic recovery test failure in main cron #7561

bug: deterministic recovery test failure in main cron #7561

Comments

yezizp2012 commented Jan 28, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

yezizp2012 commented Jan 28, 2023

kwannoel commented Feb 9, 2023 • edited Loading

kwannoel commented Feb 9, 2023 • edited Loading

zwang28 commented Feb 9, 2023 • edited Loading

kwannoel commented Feb 9, 2023 • edited Loading

huangjw806 commented Feb 9, 2023 • edited Loading

kwannoel commented Feb 9, 2023 • edited Loading

huangjw806 commented Feb 9, 2023

fuyufjh commented Feb 13, 2023

fuyufjh commented Feb 13, 2023

xxchan commented Mar 2, 2023

xxchan commented Mar 2, 2023

xxchan commented Mar 2, 2023

xxchan commented Mar 2, 2023

fuyufjh commented Mar 2, 2023

xxchan commented Mar 2, 2023 • edited Loading

xxchan commented Mar 6, 2023 • edited Loading

xxchan commented Mar 6, 2023

BugenZhao commented Mar 7, 2023

yezizp2012 commented Apr 21, 2023

kwannoel commented Feb 9, 2023 •

edited

Loading

kwannoel commented Feb 9, 2023 •

edited

Loading

zwang28 commented Feb 9, 2023 •

edited

Loading

kwannoel commented Feb 9, 2023 •

edited

Loading

huangjw806 commented Feb 9, 2023 •

edited

Loading

kwannoel commented Feb 9, 2023 •

edited

Loading

xxchan commented Mar 2, 2023 •

edited

Loading

xxchan commented Mar 6, 2023 •

edited

Loading