Postponed a segment because no coordinator was reachable #103

RolandOtta · 2017-05-15T05:35:30Z

Hi folks,

we sometimes get error message "Postponed a segment because no coordinator was reachable" when using imcremental repairs in our cassandra 3.10 production cluster.

the repair does not recover from that point. we have to stop the incremental repair and start a new one .. the new repair then normally works without any issues

when having this error we can see the following in the creaper log

DEBUG [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.c.JmxConnectionFactory - Unreachable host
com.spotify.reaper.ReaperException: Null host given to JmxProxy.connect()
at com.spotify.reaper.cassandra.JmxProxy.connect(JmxProxy.java:110) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.cassandra.JmxConnectionFactory.connect(JmxConnectionFactory.java:50) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:69) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
WARN [2017-05-15 07:34:05,770] [productioncluster:93:61445] c.s.r.s.SegmentRunner - Failed to connect to a coordinator node for segment 61445
com.spotify.reaper.ReaperException: no host could be reached through JMX
at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:75) ~[creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:148) [creaper.jar:0.5.1-SNAPSHOT]
at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:93) [creaper.jar:0.5.1-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_77]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_77]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]

according to nodetool status all cluster nodes are in state up/normal

br,
roland

ostefano · 2017-07-05T11:14:28Z

Seeing exactly the same thing (version 3.0.14). More details here #92

ostefano · 2017-07-07T17:56:06Z

Happened again, table repair_run shows one coordinator being null

92165cf0-631d-11e7-a844-dbfd17b7d833 | 9216d229-631d-11e7-a844-dbfd17b7d833 | no cause specified | cassandracluster | 2017-07-07 14:07:11+0000 | 2017-07-07 14:07:14+0000 |       0.9 | Postponed a segment because no coordinator was reachable | Stefano | 2017-07-07 14:07:14+0000 |           parallel | 9206a580-631d-11e7-a844-dbfd17b7d833 |            13 | 2017-07-07 14:07:14+0000 | RUNNING |             null | -6403394277111986699 |         91 | 2017-07-07 14:07:14+0000 |                     null |             0 | -6414212751690768970

ostefano · 2017-07-11T16:07:22Z

@adejanovski, I am trying to understand why that might happen in the context of incremental repairs.

During my tests, the coordinator is never null when starting the repair, but only after a number of steps have been completed (and necessarily, after a segment has been postponed at least once).

Based on what I see SegmentRunner.postpone should never set the coordinator to null when postponing a segment, so I am a bit lost.

Ideas where to look further?

adejanovski · 2017-07-11T16:40:15Z

Using incremental repairs, we should indeed never ever set the coordinator to null so if that happens then there's a code path that still allows to null it.

I'll inspect the code shortly and come up with a proper patch.

ostefano · 2017-07-11T17:45:45Z

Cool! Let me know the branch and I will test it right away. Thx!

ostefano · 2017-07-19T13:18:59Z

@adejanovski , did you manage to give it a look by any chance? Thx a lot!

adejanovski · 2017-07-28T16:57:42Z

Hi @ostefano,

sorry for the time it took but I was able to reproduce and fix the issue.
I've created PR #146 with the fix.

If a segment cannot get repaired within the timeout, abort() is called but fails to provide the RepairUnit : https://github.com/thelastpickle/cassandra-reaper/blob/master/src/main/java/com/spotify/reaper/service/SegmentRunner.java#L123

The PR provides the RepairUnit to abort() which then detects it's an incremental repair and doesn't void coordinator_host no more.

Could you test the branch and tell us if it works ?

Thanks

ostefano · 2017-07-28T17:12:35Z

Hi @adejanovski, thx a lot!

I have been testing ft-reaper-improvements-final in the meanwhile. Do you think I can just cherry-pick that commit and run ft-reaper-improvements-final + PR 146?

Thanks

adejanovski · 2017-07-29T09:18:22Z

Hi @ostefano,

yes totally, and we'll soon rebase ft-reaper-improvements-final over master anyway.
Also, I've just recently added proper support for incremental repair in ft-reaper-improvements-final when running multiple reaper instances.

ostefano · 2017-07-31T14:55:36Z

Hi @adejanovski , been testing the patch and all seems good. Thx for fixing this!

michaelsembwever closed this as completed Aug 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postponed a segment because no coordinator was reachable #103

Postponed a segment because no coordinator was reachable #103

RolandOtta commented May 15, 2017

ostefano commented Jul 5, 2017

ostefano commented Jul 7, 2017

ostefano commented Jul 11, 2017

adejanovski commented Jul 11, 2017

ostefano commented Jul 11, 2017

ostefano commented Jul 19, 2017

adejanovski commented Jul 28, 2017

ostefano commented Jul 28, 2017

adejanovski commented Jul 29, 2017

ostefano commented Jul 31, 2017

Postponed a segment because no coordinator was reachable #103

Postponed a segment because no coordinator was reachable #103

Comments

RolandOtta commented May 15, 2017

ostefano commented Jul 5, 2017

ostefano commented Jul 7, 2017

ostefano commented Jul 11, 2017

adejanovski commented Jul 11, 2017

ostefano commented Jul 11, 2017

ostefano commented Jul 19, 2017

adejanovski commented Jul 28, 2017

ostefano commented Jul 28, 2017

adejanovski commented Jul 29, 2017

ostefano commented Jul 31, 2017