-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
creaper wont repair last segment #92
Comments
Hi @RolandOtta, which version of Cassandra are you using ? Thanks, |
Hi @adejanovski, thank you for your response. after switching to log level debug i get dozens of that messages between the earlier mentioned "Repair amount done ..." messages DEBUG [2017-05-09 22:30:15,565] [productioncluster:36] c.s.r.s.RepairRunner - No repair segment available for range (9113347751948521685,9133161811851011073] thanks, |
Hi @RolandOtta, for some reason (race condition?) it looks like all your segments were repaired but Reaper still thinks it should repair the last one, and the whole job wasn't marked as DONE. Could you send me (alex[at]thelastpickle[dot]com) or link to a dump of the repair_run and repair_segment tables ? This way I could check if both tables are in an inconsistent state. Thanks ! |
Hi @RolandOtta, sorry it took a while to address that issue. Can you try with this release and tell me if that fixes your issue ? Thanks ! |
hi @adejanovski, it seems that the fix does not work for me. DEBUG [2017-06-06 13:37:06,437] [productioncluster:36] c.s.r.s.RepairRunner - run() called for repair run #36 with run state RUNNING and its trying again and again to repair the same segment every 30 seconds. br, |
Hi Roland, I'm back on that issue after a little while. Could you confirm? Thanks |
Hi @adejanovski, I might be stumbling on the same issue (do tell me if you want me to open a new one though) but I am not that sure. Again, backend cassandra (3.0.14), I have the last segment continuously postponed. UI was showing the following status:
After enabling the logs I get the following:
|
For some strange reason the table Could it be that because of no coordinator reachable, reaper didn't replace it successfully (hence the |
I'm having the same problem right now with Reaper version 0.6.1, Cassandra backend, Cassandra version 2.2.9. I'm using the same Cassandra cluster that I'm repairing as a backend. Every 30 seconds Reaper writes "Repair amount done 1536" in log, while web UI shows that cluster contains 1537 segments. I don't see any errors. |
Hi, we have Cassandra v3.11 and we had the same issue and having 1600 repairs splitted it stopped and hung on repair number 1599. It was solved without restarting Cassandra or Reaper, we only used jmxterm to execute a forceTerminateAllRepairSessions on all nodes of the data center and after it restarted Reaper and it by itself started the last repair and it ended successfully.... hope it will help. |
Has this issue been fixed and verified? |
@marshallking afaik it has been fixed. Never had the issue again. |
I still have seen the same behavior - although likely the reason is different. Running the latest 1.1.0-SNAPSHOT as of writing this. Backend is Cassandra. So in my case repair did not proceed when 1333/1334 segments were repaired. Sadly I did not check repair_run table and segment_state column value for the problematic segment, but looking into the logs and source code I think the reason was that for some reason the last segment was in the RUNNING state - and not in the NOT_STARTED state for some reason. Relevant part of the log: And then it starts over again with the same entries being logged. Looking into the source I believe that in the startNextSegment method call to the getNextFreeSegmentInRange did not return a segment because it's state was not NOT_STARTED:
So likely the state of a segment was RUNNING. What is really interesting is that after few restarts of the reaper (unrelated to this issue) it finally got repaired - and before it got repaired Cassandra backend was down for about a minute and after connection was re-established by Reaper the following happened: So it seems that somehow the state of that segment got transitioned from RUNNING to NOT_STARTED - and hence startNextSegment method was able to pick it up. UPDATE: I was able to reproduce this issue. In fact is has nothing to do with the segment so I am going to open a new ticket for this. |
Will be fixed by #321 |
That kind of issue has been fixed with the 1.1.0 release. |
hi there,
i have the following problem in my production cluster.
i started a full repair which took days. now creaper has done 4095 segments out of 4096 but it seems it refuse to do the last segment ... i couldnt find something suspicious in the creaper or casssandra logs ... there are also no pending repairs in the cluster.
i paused the repair and reactivated it ... but it did not solve the problem. also restarting creaper did not help
the only thing that creaper is currently writing to its log is the following line every 30 seconds:
INFO [2017-05-05 08:27:41,929] [productioncluster:36] c.s.r.s.RepairRunner - Repair amount done 4095
The text was updated successfully, but these errors were encountered: