-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reaper still postponing segments #85
Comments
Hi @kamenik, I've seen this before indeed. It seems like something got stuck on your cluster, maybe a validation compaction, and that prevents Reaper from running anymore repairs. Let me know how it works after the roll restart and if the problem still shows up with the latest master. Thanks |
I have deleted reaper_db and restarted cassandra cluster but it is still the same. Good news - the negative timeout value bug disappeared:). |
Could you try to set logging level to DEBUG in reaper and send me the output after a few minutes of run ? Thanks |
Oki, is is after restart and with clean reaper_db. nodetool_tpstats.txt |
According to what I see in your outputs, things are currently going ok. What we currently see is very different from what you had previously with Reaper claiming it sees repair running on a node but has no trace in its own database and then tries to kill it. This was likely to be related to the negative timeout bug. So far, I'd say things look good. Please update the ticket with the progress. Thanks |
Thank you for help. I would like to ask about reaper performance yet. I run one not sheduled repair with default values and surprisingly it took much longer time than full repair at every server and even consumed much more CPU. First five peaks are full repairs, data since 15:45 to 19:00 are reaper repair. What do you thing about it? |
There are several possibilities here. Another possibility could be that there are performance problems with the Cassandra backend in Reaper.
You can try to use another backend, like H2 for example, and compare results. Does the first part of the graph (the 5 spikes) show the total time for the full repair to run (on all nodes) ? |
nodetool --host NODE_IP repair --full it is called from one server on all nodes, one after another. We have Cassandra 3.10 on all servers. I will try to run it with different backend, we will see.. Trying H2 now, it has lots of this stacktraces in log, but it is running.
|
Yesterday results - there we no traffic on cluster, only repairing. All repairs were paralel, full, intensity .95, only difference was backend. C* data (csstats.txt) are from the middle of the first C* run, plus minus.. It seems there is no (small) difference between run with UI and without it. |
Hi @kamenik , thanks for the results ! Thanks again for the time you spent on this ! |
No problem:). I run it with cassandra only today and it seems there is some problem with intensity settings too. You can see on graph - I set intensity 0.95, 0.5, 0.25, 0.125, 0.0625 (there is just beginning of it). Plus it says all segments are repaired some time before switch to state DONE - it is marked by red lines, some DB cleanup at the end? |
At this point the ticket changed from being one about postponed segments, now resolved, to reaper performance using the Cassandra backend. Could we either close this ticket, and move the comments into a new ticket, or rename this ticket? |
…ing. Example of logging queries to a separate file found in cassandra-reaper-cassandra.yaml ref: #85
…ing. Example of logging queries to a separate file found in cassandra-reaper-cassandra.yaml ref: #85
Following up on this in #94 |
@kamenik : I've created a branch that fixes the performance issues you've been experiencing with the Cassandra backend. Could you build an try with the following branch ? https://github.com/thelastpickle/cassandra-reaper/tree/alex/fix-parallel-repair-computation TL;DR : the number of parallel repairs was computed based on the number of tokens and not the number of nodes. If you use vnodes, Reaper will compute a high value and you'll have 15 threads competing to run repairs on your cluster. I've fixed this using the number of nodes instead and added a local cache to lighten the load on C*. Thanks |
@adejanovski: Thanx, it is much better now:). Big peak at the beginning is full repair, the rest is reaper with C* backend. Interesting is there is no difference between run with intensity 0.95 and 0.1 . Intensity 0.95 Intensity 0.1 |
Great to see the improvements on your charts. Intensity mustn't apply here because your segments are very fast to run (within seconds I guess). if you spend 1s repairing then intensity 0.1 will wait for : (1/0.1 - 1) = 9s. The upcoming merge will bring many more improvements on the data model and make good use of the Cassandra row cache for segments. |
Hi guys,
I am trying to use reaper at test cluster, but I am stuck with this issue. After few test runs it
starts to postpone segment repairs, it seems that it tries to start same segment twice, second run fails
and is postponed. I tried to delete reaper_db ks to reset it, but it did not help. Any idea?
Few lines from beginning of log:
The text was updated successfully, but these errors were encountered: