Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creaper wont repair last segment #92

Closed
RolandOtta opened this issue May 5, 2017 · 15 comments
Closed

creaper wont repair last segment #92

RolandOtta opened this issue May 5, 2017 · 15 comments

Comments

@RolandOtta
Copy link

RolandOtta commented May 5, 2017

hi there,

i have the following problem in my production cluster.

i started a full repair which took days. now creaper has done 4095 segments out of 4096 but it seems it refuse to do the last segment ... i couldnt find something suspicious in the creaper or casssandra logs ... there are also no pending repairs in the cluster.

i paused the repair and reactivated it ... but it did not solve the problem. also restarting creaper did not help
the only thing that creaper is currently writing to its log is the following line every 30 seconds:

INFO [2017-05-05 08:27:41,929] [productioncluster:36] c.s.r.s.RepairRunner - Repair amount done 4095

@adejanovski
Copy link
Contributor

Hi @RolandOtta,

which version of Cassandra are you using ?
Could you restart Reaper (if I'm not too late) with DEBUG as log level (in your reaper yaml file) and share the logs ?
It's odd that Reaper doesn't even try to process that last segment and it should output something about it at the very least.

Thanks,

@RolandOtta
Copy link
Author

Hi @adejanovski,

thank you for your response.
we are using cassandra 3.7

after switching to log level debug i get dozens of that messages between the earlier mentioned "Repair amount done ..." messages

DEBUG [2017-05-09 22:30:15,565] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9113347751948521685,9133161811851011073]
DEBUG [2017-05-09 22:30:15,565] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9133161811851011073,9157259674207723043]
DEBUG [2017-05-09 22:30:15,566] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9157259674207723043,9177171256206987672]
DEBUG [2017-05-09 22:30:15,566] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9177171256206987672,9203061361411792250]
DEBUG [2017-05-09 22:30:15,567] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9203061361411792250,-9216219198908204042]

thanks,
roland

@adejanovski
Copy link
Contributor

Hi @RolandOtta,

for some reason (race condition?) it looks like all your segments were repaired but Reaper still thinks it should repair the last one, and the whole job wasn't marked as DONE.
Are you using the Cassandra backend for Reaper ?

Could you send me (alex[at]thelastpickle[dot]com) or link to a dump of the repair_run and repair_segment tables ?

This way I could check if both tables are in an inconsistent state.

Thanks !

@adejanovski
Copy link
Contributor

Hi @RolandOtta,

sorry it took a while to address that issue.
Looking at the data you sent me, there is clearly one segment that Reaper leaves aside and I've been able to reproduce the problem with the Cassandra backend.
It is fixed in the latest 0.6.0 release, along with huge performance improvements to Reaper using this backend.

Can you try with this release and tell me if that fixes your issue ?

Thanks !

@RolandOtta
Copy link
Author

hi @adejanovski,

it seems that the fix does not work for me.
i upgraded to the latest 0.6.0 and still get those messages in the log

DEBUG [2017-06-06 13:37:06,437] [productioncluster:36] c.s.r.s.RepairRunner - run() called for repair run #36 with run state RUNNING
DEBUG [2017-06-06 13:37:06,438] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (-9216219198908204042,-9198855260881653671]
DEBUG [2017-06-06 13:37:06,439] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (-9198855260881653671,-9189681434865597821]
....
....
some more No repair segment available messages
...
...
DEBUG [2017-06-06 13:37:06,660] [productioncluster:36] c.s.r.s.RepairRunner - run() exiting for repair run #36

and its trying again and again to repair the same segment every 30 seconds.

br,
roland

@adejanovski
Copy link
Contributor

Hi Roland,

I'm back on that issue after a little while.
Which backend are you using for Reaper ?
v0.6.0 moved to uuids for repair run and segment ids, so the log above would suggest that you're using Postgres or H2.

Could you confirm?

Thanks

@ostefano
Copy link

ostefano commented Jul 5, 2017

Hi @adejanovski,

I might be stumbling on the same issue (do tell me if you want me to open a new one though) but I am not that sure. Again, backend cassandra (3.0.14), I have the last segment continuously postponed. UI was showing the following status:

Postponed a segment because no coordinator was reachable

After enabling the logs I get the following:

! com.spotify.reaper.ReaperException: Null host given to JmxProxy.connect()
! at com.spotify.reaper.cassandra.JmxProxy.connect(JmxProxy.java:121)
! at com.spotify.reaper.cassandra.JmxConnectionFactory.connect(JmxConnectionFactory.java:53)
! at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:72)
! at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:156)
! at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:101)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
! at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
! at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
! at java.lang.Thread.run(Thread.java:748)
WARN  [2017-07-05 10:32:32,254] com.spotify.reaper.service.SegmentRunner: Failed to connect to a coordinator node for segment cb465ee3-6164-11e7-9d07-735ff78ae49c
! com.spotify.reaper.ReaperException: no host could be reached through JMX
! at com.spotify.reaper.cassandra.JmxConnectionFactory.connectAny(JmxConnectionFactory.java:78)
! at com.spotify.reaper.service.SegmentRunner.runRepair(SegmentRunner.java:156)
! at com.spotify.reaper.service.SegmentRunner.run(SegmentRunner.java:101)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
! at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
! at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
! at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
! at java.util.concurrent.FutureTask.run(FutureTask.java:266)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
! at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
! at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
! at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
! at java.lang.Thread.run(Thread.java:748)

@ostefano
Copy link

ostefano commented Jul 5, 2017

For some strange reason the table repair_run had a null coordinator for one of the segments. Running it again...

Could it be that because of no coordinator reachable, reaper didn't replace it successfully (hence the null)? Note that the cluster I am repairing is under quite some load.

@beevee
Copy link

beevee commented Jul 6, 2017

I'm having the same problem right now with Reaper version 0.6.1, Cassandra backend, Cassandra version 2.2.9. I'm using the same Cassandra cluster that I'm repairing as a backend.

Every 30 seconds Reaper writes "Repair amount done 1536" in log, while web UI shows that cluster contains 1537 segments. I don't see any errors.

@pacorulo
Copy link

Hi,

we have Cassandra v3.11 and we had the same issue and having 1600 repairs splitted it stopped and hung on repair number 1599. It was solved without restarting Cassandra or Reaper, we only used jmxterm to execute a forceTerminateAllRepairSessions on all nodes of the data center and after it restarted Reaper and it by itself started the last repair and it ended successfully.... hope it will help.

@marshallking
Copy link

Has this issue been fixed and verified?

@ostefano
Copy link

@marshallking afaik it has been fixed. Never had the issue again.

@sergeyyegournov
Copy link

sergeyyegournov commented Jan 24, 2018

I still have seen the same behavior - although likely the reason is different. Running the latest 1.1.0-SNAPSHOT as of writing this. Backend is Cassandra. So in my case repair did not proceed when 1333/1334 segments were repaired. Sadly I did not check repair_run table and segment_state column value for the problematic segment, but looking into the logs and source code I think the reason was that for some reason the last segment was in the RUNNING state - and not in the NOT_STARTED state for some reason.

Relevant part of the log:
INFO [2018-01-22 09:26:18,936] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Running segment for range (-9221357256650545873,-4662614694803227943] INFO [2018-01-22 09:26:18,961] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Running segment for range (-4662614694803227943,-153033555439306155] INFO [2018-01-22 09:26:18,986] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Running segment for range (-153033555439306155,4340134915807037722] INFO [2018-01-22 09:26:19,016] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Running segment for range (4340134915807037722,-9221357256650545873] INFO [2018-01-22 09:26:19,097] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Repair amount done 1333

And then it starts over again with the same entries being logged. Looking into the source I believe that in the startNextSegment method call to the getNextFreeSegmentInRange did not return a segment because it's state was not NOT_STARTED:
LOG.info("Running segment for range {}", parallelRanges.get(rangeIndex)); Optional<RepairSegment> nextRepairSegment = context.storage.getNextFreeSegmentInRange(repairRunId, Optional.of(parallelRanges.get(rangeIndex)));
On the other hand, the state of segment in question could not be DONE because in that case the following code would mark repair as successful and would not re-schedule repair anymore:

if (segmentsDone == segmentsTotal) { endRepairRun(); scheduleRetry = false; }

So likely the state of a segment was RUNNING.

What is really interesting is that after few restarts of the reaper (unrelated to this issue) it finally got repaired - and before it got repaired Cassandra backend was down for about a minute and after connection was re-established by Reaper the following happened:
INFO [2018-01-23 22:40:21,774] [pool-1-thread-1] i.c.s.SegmentRunner - Postponing segment 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 INFO [2018-01-23 22:40:21,777] [pool-1-thread-1] i.c.s.SegmentRunner - Aborting repair on segment with id 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 on coordinator 10.10.123.115 INFO [2018-01-23 22:40:23,638] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308] i.c.s.RepairRunner - Next segment to run : 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 INFO [2018-01-23 22:40:26,415] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308:39fdfb22-fddf-11e7-baa0-c70b9b7ef308] i.c.s.SegmentRunner - It is ok to repair segment '39fdfb22-fddf-11e7-baa0-c70b9b7ef308' on repair run with id '39fd5ed0-fddf-11e7-baa0-c70b9b7ef308' INFO [2018-01-23 22:40:26,470] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308:39fdfb22-fddf-11e7-baa0-c70b9b7ef308] i.c.j.JmxProxy - Triggering repair of range (-1211415970840697202,-1184422912999070688] for keyspace "outbrain_document_features" on host 10.10.122.225, with repair parallelism dc_parallel, in cluster with Cassandra version '3.11.1' (can use DATACENTER_AWARE 'true'), for column families: [] INFO [2018-01-23 22:40:26,492] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308:39fdfb22-fddf-11e7-baa0-c70b9b7ef308] i.c.s.SegmentRunner - Repair for segment 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 started, status wait will timeout in 1800000 millis INFO [2018-01-23 22:49:39,205] [documentservicesonline:39fd5ed0-fddf-11e7-baa0-c70b9b7ef308:39fdfb22-fddf-11e7-baa0-c70b9b7ef308] i.c.s.SegmentRunner - Repair command 106 on segment 39fdfb22-fddf-11e7-baa0-c70b9b7ef308 returned with state DONE

So it seems that somehow the state of that segment got transitioned from RUNNING to NOT_STARTED - and hence startNextSegment method was able to pick it up.

UPDATE: I was able to reproduce this issue. In fact is has nothing to do with the segment so I am going to open a new ticket for this.

@adejanovski
Copy link
Contributor

Will be fixed by #321

@adejanovski
Copy link
Contributor

That kind of issue has been fixed with the 1.1.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants