Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postponed a segment because no coordinator was reachable #103

Closed
RolandOtta opened this issue May 15, 2017 · 10 comments
Closed

Postponed a segment because no coordinator was reachable #103

RolandOtta opened this issue May 15, 2017 · 10 comments

Comments

@RolandOtta
Copy link

Hi folks,

we sometimes get error message "Postponed a segment because no coordinator was reachable" when using imcremental repairs in our cassandra 3.10 production cluster.

the repair does not recover from that point. we have to stop the incremental repair and start a new one .. the new repair then normally works without any issues

when having this error we can see the following in the creaper log

DEBUG [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.c.JmxConnectionFactory - Unreachable host
com.spotify.reaper.ReaperException: Null host given to JmxProxy.connect()
at com.spotify.reaper.cassandra.JmxProxy.connect(JmxProxy.java:110) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.cassandra.JmxConnectionFactory.connect(JmxConnectionFactory.java:50) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:69) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
WARN [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.s.SegmentRunner - Failed to connect to a coordinator node for segment 61445
com.spotify.reaper.ReaperException: no host could be reached through JMX
at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:75) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]

according to nodetool status all cluster nodes are in state up/normal

br,
roland

@ostefano
Copy link

ostefano commented Jul 5, 2017

Seeing exactly the same thing (version 3.0.14). More details here #92

@ostefano
Copy link

ostefano commented Jul 7, 2017

Happened again, table repair_run shows one coordinator being null

92165cf0-631d-11e7-a844-dbfd17b7d833 | 9216d229-631d-11e7-a844-dbfd17b7d833 | no cause specified | cassandracluster | 2017-07-07 14:07:11+0000 | 2017-07-07 14:07:14+0000 |       0.9 | Postponed a segment because no coordinator was reachable | Stefano | 2017-07-07 14:07:14+0000 |           parallel | 9206a580-631d-11e7-a844-dbfd17b7d833 |            13 | 2017-07-07 14:07:14+0000 | RUNNING |             null | -6403394277111986699 |         91 | 2017-07-07 14:07:14+0000 |                     null |             0 | -6414212751690768970

@ostefano
Copy link

@adejanovski, I am trying to understand why that might happen in the context of incremental repairs.

During my tests, the coordinator is never null when starting the repair, but only after a number of steps have been completed (and necessarily, after a segment has been postponed at least once).

Based on what I see SegmentRunner.postpone should never set the coordinator to null when postponing a segment, so I am a bit lost.

Ideas where to look further?

@adejanovski
Copy link
Contributor

Using incremental repairs, we should indeed never ever set the coordinator to null so if that happens then there's a code path that still allows to null it.

I'll inspect the code shortly and come up with a proper patch.

@ostefano
Copy link

Cool! Let me know the branch and I will test it right away. Thx!

@ostefano
Copy link

@adejanovski , did you manage to give it a look by any chance? Thx a lot!

@adejanovski
Copy link
Contributor

Hi @ostefano,

sorry for the time it took but I was able to reproduce and fix the issue.
I've created PR #146 with the fix.

If a segment cannot get repaired within the timeout, abort() is called but fails to provide the RepairUnit : https://github.com/thelastpickle/cassandra-reaper/blob/master/src/main/java/com/spotify/reaper/service/SegmentRunner.java#L123

The PR provides the RepairUnit to abort() which then detects it's an incremental repair and doesn't void coordinator_host no more.

Could you test the branch and tell us if it works ?

Thanks

@ostefano
Copy link

Hi @adejanovski, thx a lot!

I have been testing ft-reaper-improvements-final in the meanwhile. Do you think I can just cherry-pick that commit and run ft-reaper-improvements-final + PR 146?

Thanks

@adejanovski
Copy link
Contributor

Hi @ostefano,

yes totally, and we'll soon rebase ft-reaper-improvements-final over master anyway.
Also, I've just recently added proper support for incremental repair in ft-reaper-improvements-final when running multiple reaper instances.

@ostefano
Copy link

Hi @adejanovski , been testing the patch and all seems good. Thx for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants