-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postponed a segment because no coordinator was reachable #103
Comments
Seeing exactly the same thing (version 3.0.14). More details here #92 |
Happened again, table
|
@adejanovski, I am trying to understand why that might happen in the context of incremental repairs. During my tests, the coordinator is never null when starting the repair, but only after a number of steps have been completed (and necessarily, after a segment has been postponed at least once). Based on what I see Ideas where to look further? |
Using incremental repairs, we should indeed never ever set the coordinator to null so if that happens then there's a code path that still allows to null it. I'll inspect the code shortly and come up with a proper patch. |
Cool! Let me know the branch and I will test it right away. Thx! |
@adejanovski , did you manage to give it a look by any chance? Thx a lot! |
Hi @ostefano, sorry for the time it took but I was able to reproduce and fix the issue. If a segment cannot get repaired within the timeout, abort() is called but fails to provide the RepairUnit : https://github.com/thelastpickle/cassandra-reaper/blob/master/src/main/java/com/spotify/reaper/service/SegmentRunner.java#L123 The PR provides the RepairUnit to abort() which then detects it's an incremental repair and doesn't void Could you test the branch and tell us if it works ? Thanks |
Hi @adejanovski, thx a lot! I have been testing Thanks |
Hi @ostefano, yes totally, and we'll soon rebase |
Hi @adejanovski , been testing the patch and all seems good. Thx for fixing this! |
Hi folks,
we sometimes get error message "Postponed a segment because no coordinator was reachable" when using imcremental repairs in our cassandra 3.10 production cluster.
the repair does not recover from that point. we have to stop the incremental repair and start a new one .. the new repair then normally works without any issues
when having this error we can see the following in the creaper log
DEBUG [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.c.JmxConnectionFactory - Unreachable host
com.spotify.reaper.ReaperException: Null host given to JmxProxy.connect()
at com.spotify.reaper.cassandra.JmxProxy.connect(JmxProxy.java:110) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.cassandra.JmxConnectionFactory.connect(JmxConnectionFactory.java:50) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:69) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
WARN [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.s.SegmentRunner - Failed to connect to a coordinator node for segment 61445
com.spotify.reaper.ReaperException: no host could be reached through JMX
at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:75) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
according to nodetool status all cluster nodes are in state up/normal
br,
roland
The text was updated successfully, but these errors were encountered: