creaper wont repair last segment #92

RolandOtta · 2017-05-05T06:34:04Z

hi there,

i have the following problem in my production cluster.

i started a full repair which took days. now creaper has done 4095 segments out of 4096 but it seems it refuse to do the last segment ... i couldnt find something suspicious in the creaper or casssandra logs ... there are also no pending repairs in the cluster.

i paused the repair and reactivated it ... but it did not solve the problem. also restarting creaper did not help
the only thing that creaper is currently writing to its log is the following line every 30 seconds:

INFO [2017-05-05 08:27:41,929] [productioncluster:36] c.s.r.s.RepairRunner - Repair amount done 4095

adejanovski · 2017-05-09T14:51:08Z

Hi @RolandOtta,

which version of Cassandra are you using ?
Could you restart Reaper (if I'm not too late) with DEBUG as log level (in your reaper yaml file) and share the logs ?
It's odd that Reaper doesn't even try to process that last segment and it should output something about it at the very least.

Thanks,

RolandOtta · 2017-05-09T20:34:01Z

Hi @adejanovski,

thank you for your response.
we are using cassandra 3.7

after switching to log level debug i get dozens of that messages between the earlier mentioned "Repair amount done ..." messages

DEBUG [2017-05-09 22:30:15,565] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9113347751948521685,9133161811851011073]
DEBUG [2017-05-09 22:30:15,565] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9133161811851011073,9157259674207723043]
DEBUG [2017-05-09 22:30:15,566] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9157259674207723043,9177171256206987672]
DEBUG [2017-05-09 22:30:15,566] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9177171256206987672,9203061361411792250]
DEBUG [2017-05-09 22:30:15,567] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9203061361411792250,-9216219198908204042]

thanks,
roland

adejanovski · 2017-05-12T13:28:20Z

Hi @RolandOtta,

for some reason (race condition?) it looks like all your segments were repaired but Reaper still thinks it should repair the last one, and the whole job wasn't marked as DONE.
Are you using the Cassandra backend for Reaper ?

Could you send me (alex[at]thelastpickle[dot]com) or link to a dump of the repair_run and repair_segment tables ?

This way I could check if both tables are in an inconsistent state.

Thanks !

adejanovski · 2017-06-06T09:01:20Z

Hi @RolandOtta,

sorry it took a while to address that issue.
Looking at the data you sent me, there is clearly one segment that Reaper leaves aside and I've been able to reproduce the problem with the Cassandra backend.
It is fixed in the latest 0.6.0 release, along with huge performance improvements to Reaper using this backend.

Can you try with this release and tell me if that fixes your issue ?

Thanks !

RolandOtta · 2017-06-06T11:40:04Z

hi @adejanovski,

it seems that the fix does not work for me.
i upgraded to the latest 0.6.0 and still get those messages in the log

DEBUG [2017-06-06 13:37:06,437] [productioncluster:36] c.s.r.s.RepairRunner - run() called for repair run #36 with run state RUNNING
DEBUG [2017-06-06 13:37:06,438] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (-9216219198908204042,-9198855260881653671]
DEBUG [2017-06-06 13:37:06,439] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (-9198855260881653671,-9189681434865597821]
....
....
some more No repair segment available messages
...
...
DEBUG [2017-06-06 13:37:06,660] [productioncluster:36] c.s.r.s.RepairRunner - run() exiting for repair run #36

and its trying again and again to repair the same segment every 30 seconds.

br,
roland

adejanovski · 2017-06-15T12:48:28Z

Hi Roland,

I'm back on that issue after a little while.
Which backend are you using for Reaper ?
v0.6.0 moved to uuids for repair run and segment ids, so the log above would suggest that you're using Postgres or H2.

Could you confirm?

Thanks

ostefano · 2017-07-05T10:35:59Z

Hi @adejanovski,

I might be stumbling on the same issue (do tell me if you want me to open a new one though) but I am not that sure. Again, backend cassandra (3.0.14), I have the last segment continuously postponed. UI was showing the following status:

Postponed a segment because no coordinator was reachable

After enabling the logs I get the following:

! com.spotify.reaper.ReaperException: Null host given to JmxProxy.connect()
! at com.spotify.reaper.cassandra.JmxProxy.connect(JmxProxy.java:121)
! at com.spotify.reaper.cassandra.JmxConnectionFactory.connect(JmxConnectionFactory.java:53)
! at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:72)
! at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:156)
! at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:101)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
! at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
! at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
! at java.lang.Thread.run(Thread.java:748)
WARN  [2017-07-05 10:32:32,254] com.spotify.reaper.service.SegmentRunner: Failed to connect to a coordinator node for segment cb465ee3-6164-11e7-9d07-735ff78ae49c
! com.spotify.reaper.ReaperException: no host could be reached through JMX
! at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:78)
! at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:156)
! at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:101)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
! at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
! at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
! at java.lang.Thread.run(Thread.java:748)

ostefano · 2017-07-05T10:43:24Z

For some strange reason the table repair_run had a null coordinator for one of the segments. Running it again...

Could it be that because of no coordinator reachable, reaper didn't replace it successfully (hence the null)? Note that the cluster I am repairing is under quite some load.

beevee · 2017-07-06T13:39:04Z

I'm having the same problem right now with Reaper version 0.6.1, Cassandra backend, Cassandra version 2.2.9. I'm using the same Cassandra cluster that I'm repairing as a backend.

Every 30 seconds Reaper writes "Repair amount done 1536" in log, while web UI shows that cluster contains 1537 segments. I don't see any errors.

pacorulo · 2017-09-12T14:53:47Z

Hi,

we have Cassandra v3.11 and we had the same issue and having 1600 repairs splitted it stopped and hung on repair number 1599. It was solved without restarting Cassandra or Reaper, we only used jmxterm to execute a forceTerminateAllRepairSessions on all nodes of the data center and after it restarted Reaper and it by itself started the last repair and it ended successfully.... hope it will help.

marshallking · 2017-09-14T17:51:26Z

Has this issue been fixed and verified?

ostefano · 2017-09-18T14:54:10Z

@marshallking afaik it has been fixed. Never had the issue again.

sergeyyegournov · 2018-01-24T19:34:09Z

I still have seen the same behavior - although likely the reason is different. Running the latest 1.1.0-SNAPSHOT as of writing this. Backend is Cassandra. So in my case repair did not proceed when 1333/1334 segments were repaired. Sadly I did not check repair_run table and segment_state column value for the problematic segment, but looking into the logs and source code I think the reason was that for some reason the last segment was in the RUNNING state - and not in the NOT_STARTED state for some reason.

Relevant part of the log:
INFO [2018-01-22 09:26:18,936] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Running segment for range (-9221357256650545873,-4662614694803227943] INFO [2018-01-22 09:26:18,961] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Running segment for range (-4662614694803227943,-153033555439306155] INFO [2018-01-22 09:26:18,986] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Running segment for range (-153033555439306155,4340134915807037722] INFO [2018-01-22 09:26:19,016] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Running segment for range (4340134915807037722,-9221357256650545873] INFO [2018-01-22 09:26:19,097] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Repair amount done 1333

And then it starts over again with the same entries being logged. Looking into the source I believe that in the startNextSegment method call to the getNextFreeSegmentInRange did not return a segment because it's state was not NOT_STARTED:
LOG.info("Running segment for range {}", parallelRanges.get(rangeIndex)); Optional<RepairSegment> nextRepairSegment = context.storage.getNextFreeSegmentInRange(repairRunId, Optional.of(parallelRanges.get(rangeIndex)));
On the other hand, the state of segment in question could not be DONE because in that case the following code would mark repair as successful and would not re-schedule repair anymore:

if (segmentsDone == segmentsTotal) { endRepairRun(); scheduleRetry = false; }

So likely the state of a segment was RUNNING.

What is really interesting is that after few restarts of the reaper (unrelated to this issue) it finally got repaired - and before it got repaired Cassandra backend was down for about a minute and after connection was re-established by Reaper the following happened:
INFO [2018-01-23 22:40:21,774] [pool-1-thread-1] i.c.s.SegmentRunner - Postponing segment 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 INFO [2018-01-23 22:40:21,777] [pool-1-thread-1] i.c.s.SegmentRunner - Aborting repair on segment with id 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 on coordinator 10.10.123.115 INFO [2018-01-23 22:40:23,638] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Next segment to run : 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 INFO [2018-01-23 22:40:26,415] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308:39fdfb22-fddf-11e7-baa0-c70b9b7ef308] i.c.s.SegmentRunner - It is ok to repair segment '39fdfb22-fddf-11e7-baa0-c70b9b7ef308' on repair run with id '39fd5ed0-fddf-11e7-baa0-c70b9b7ef308' INFO [2018-01-23 22:40:26,470] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308:39fdfb22-fddf-11e7-baa0-c70b9b7ef308] i.c.j.JmxProxy - Triggering repair of range (-1211415970840697202,-1184422912999070688] for keyspace "outbrain_document_features" on host 10.10.122.225, with repair parallelism dc_parallel, in cluster with Cassandra version '3.11.1' (can use DATACENTER_AWARE 'true'), for column families: [] INFO [2018-01-23 22:40:26,492] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308:39fdfb22-fddf-11e7-baa0-c70b9b7ef308] i.c.s.SegmentRunner - Repair for segment 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 started, status wait will timeout in 1800000 millis INFO [2018-01-23 22:49:39,205] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308:39fdfb22-fddf-11e7-baa0-c70b9b7ef308] i.c.s.SegmentRunner - Repair command 106 on segment 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 returned with state DONE

So it seems that somehow the state of that segment got transitioned from RUNNING to NOT_STARTED - and hence startNextSegment method was able to pick it up.

UPDATE: I was able to reproduce this issue. In fact is has nothing to do with the segment so I am going to open a new ticket for this.

adejanovski · 2018-01-26T17:16:31Z

Will be fixed by #321

adejanovski · 2018-05-28T13:34:42Z

That kind of issue has been fixed with the 1.1.0 release.

ostefano mentioned this issue Jul 5, 2017

Postponed a segment because no coordinator was reachable #103

Closed

adejanovski closed this as completed May 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

creaper wont repair last segment #92

creaper wont repair last segment #92

RolandOtta commented May 5, 2017 •

edited

Loading

adejanovski commented May 9, 2017

RolandOtta commented May 9, 2017

adejanovski commented May 12, 2017

adejanovski commented Jun 6, 2017

RolandOtta commented Jun 6, 2017

adejanovski commented Jun 15, 2017

ostefano commented Jul 5, 2017 •

edited

Loading

ostefano commented Jul 5, 2017 •

edited

Loading

beevee commented Jul 6, 2017 •

edited

Loading

pacorulo commented Sep 12, 2017

marshallking commented Sep 14, 2017

ostefano commented Sep 18, 2017

sergeyyegournov commented Jan 24, 2018 •

edited

Loading

adejanovski commented Jan 26, 2018

adejanovski commented May 28, 2018

creaper wont repair last segment #92

creaper wont repair last segment #92

Comments

RolandOtta commented May 5, 2017 • edited Loading

adejanovski commented May 9, 2017

RolandOtta commented May 9, 2017

adejanovski commented May 12, 2017

adejanovski commented Jun 6, 2017

RolandOtta commented Jun 6, 2017

adejanovski commented Jun 15, 2017

ostefano commented Jul 5, 2017 • edited Loading

ostefano commented Jul 5, 2017 • edited Loading

beevee commented Jul 6, 2017 • edited Loading

pacorulo commented Sep 12, 2017

marshallking commented Sep 14, 2017

ostefano commented Sep 18, 2017

sergeyyegournov commented Jan 24, 2018 • edited Loading

adejanovski commented Jan 26, 2018

adejanovski commented May 28, 2018

RolandOtta commented May 5, 2017 •

edited

Loading

ostefano commented Jul 5, 2017 •

edited

Loading

ostefano commented Jul 5, 2017 •

edited

Loading

beevee commented Jul 6, 2017 •

edited

Loading

sergeyyegournov commented Jan 24, 2018 •

edited

Loading