-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration tests capable of testing multiple running reaper instances #124
Integration tests capable of testing multiple running reaper instances #124
Conversation
b543247
to
4992004
Compare
8011e7f
to
1e69d35
Compare
1e69d35
to
f7a205c
Compare
f7a205c
to
3c35f05
Compare
3c35f05
to
f7f54dc
Compare
f7f54dc
to
88a1f27
Compare
88a1f27
to
1736fbc
Compare
1736fbc
to
522f1aa
Compare
SonarQube analysis reported 4 issues Watch the comments in this conversation to review them. 3 extra issuesNote: The following issues were found on lines that were not modified in the pull request. Because these issues can't be reported as line comments, they are summarized here:
|
if (!fetchedUnit.isPresent()) { | ||
LOG.warn("RepairUnit with id {} not found", schedule.getRepairUnitId()); | ||
return false; | ||
private boolean manageSchedule(RepairSchedule schedule_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.runHistory(newRunHistory) | ||
.build(schedule.getId())); | ||
|
||
if (result && !repairRunAlreadyScheduled(schedule, repairUnit)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this repairRunAlreadyScheduled(..)
needs to be removed.
522f1aa
to
3d8d415
Compare
3d8d415
to
9325bb8
Compare
@adejanovski currently the repair and incremental repair scenarios break (most of the time) when running ReaperCassandraIT due to valid concurrency problems¹. the problem exceerbates as you add ccm nodes, increase RF, and increase concurrency in ReaperCassandraIT (see line 50). the concurrency problems do appear to be those i raised concerns about earlier in #49 (comment) and #49 (comment) |
00e48c7
to
8247da9
Compare
@@ -767,7 +767,7 @@ public void we_wait_for_at_least_segments_to_be_repaired(int nbSegmentsToBeRepai | |||
assertEquals(Response.Status.OK.getStatusCode(), response.getStatus()); | |||
String responseData = response.readEntity(String.class); | |||
RepairRunStatus run = SimpleReaperClient.parseRepairRunStatusJSON(responseData); | |||
return nbSegmentsToBeRepaired == run.getSegmentsRepaired(); | |||
return nbSegmentsToBeRepaired <= run.getSegmentsRepaired(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
8247da9
to
dfc3c68
Compare
rebased off |
dfc3c68
to
8839605
Compare
Running the SchedulingManager concurrently (in different processes) leads to parallel repair runs being spawned at the same time. This has been implemented by - making reads and writes to repair_schedule table strictly consistent (quorum), - updating (writes to storage) the repair_schedule incremently in the SchedulingManager.manageSchedule(..) codepath, and - multiple checks (reads from storage) through the same codepath. ref: #124
…tances. Introducing a fault tolerant reaper (where multiple reaper instances could run and coordinate ongoing repairs on top of a shared Cassandra backend storage) requires extensive testing. While the UI is by default an eventually consistent experience, the code design is not. By extending the testing framework to support multiple (and unstalbe/flapping) reaper instances, and parallel and duplicate htp requests to the RESTful interface, it becomes possible to test such coordination and the required 'partition tolerance' on the backend storage. Changes made: - a number of REST http status code responses were corrected, ie better usage of: METHOD_NOT_ALLOWED, NOT_MODIFIED, NOT_FOUND, - a number of REST resoure methods were strengthen ensuring the correct error http status code were returned, - making JMX connect timeout configurable - make all c* requests marked as idempotent, - in `CassandraStorage.getSegment(..)` return a random segment which hasn't yet been started, instead of that with the lowest failCount, - in BasicSteps reaper instances can be added and removed concurrently, but synchronized by test method, - in BasicSteps parallel stream requests through all reaper instances where appropriate, where not pick a random instance to send the request through, - in BasicSteps accept a set of possible http status codes from the response, as multiple put/post requests mean all but one with fail in some manner, - in BasicSteps append assertions after multiple possible http status codes have been checked to ensure a resulting consistency in furture http status response codes, - ReaperCassandraIT is parameterised, via system properties "grim.reaper.min" and "grim.reaper.max", for how many stable and flapping reaper instances are to be used, - In ReaperCassandraIt put a timeout on how long we'll keep retrying to drop the test keyspace, - ReaperTestJettyRunner needed a little redesign to allow multiple instances per jvm, - move TestUtils methods into BasicSteps, - put a timeout on the mutex waits in RepairRunnerTest. ref: #124
…om an elected leader. It's too easy for the code to evolve with reads and writes occurring outside the leader-election mechanism. This makes the code difficult to reason and to test. By adding the asserts into place we enforce a constraint that simplifies the design. The take on the leader election has been moved up to SegmentRunner, as this is where all other leader-election actions are taken. ref: #124
- configure travis to run separate jobs for multi-concurrency ReaperCassandraIT executions. - reduce ccm count and memory use - increase ccm request timeouts (tests can run very slowly) - after a failure display ccm errors - upgrade to dropwizard 1.0.8 - upgrade jerysey-client to 2.25.1 due to a bug where requests can be sent twice (still a problem but happens less) - use `allow_failures` on the longer "smoke" tests ref: #124
…connection didn't fail in the past. Allow only a single Reaper instance to process an incremental repair segment through leader election Lower host metrics TTL to 3mn ref: #124
Running the SchedulingManager concurrently (in different processes) leads to parallel repair runs being spawned at the same time. This has been implemented by - making reads and writes to repair_schedule table strictly consistent (quorum), - updating (writes to storage) the repair_schedule incremently in the SchedulingManager.manageSchedule(..) codepath, and - multiple checks (reads from storage) through the same codepath. ref: #124
…tances. Introducing a fault tolerant reaper (where multiple reaper instances could run and coordinate ongoing repairs on top of a shared Cassandra backend storage) requires extensive testing. While the UI is by default an eventually consistent experience, the code design is not. By extending the testing framework to support multiple (and unstalbe/flapping) reaper instances, and parallel and duplicate htp requests to the RESTful interface, it becomes possible to test such coordination and the required 'partition tolerance' on the backend storage. Changes made: - a number of REST http status code responses were corrected, ie better usage of: METHOD_NOT_ALLOWED, NOT_MODIFIED, NOT_FOUND, - a number of REST resoure methods were strengthen ensuring the correct error http status code were returned, - making JMX connect timeout configurable - make all c* requests marked as idempotent, - in `CassandraStorage.getSegment(..)` return a random segment which hasn't yet been started, instead of that with the lowest failCount, - in BasicSteps reaper instances can be added and removed concurrently, but synchronized by test method, - in BasicSteps parallel stream requests through all reaper instances where appropriate, where not pick a random instance to send the request through, - in BasicSteps accept a set of possible http status codes from the response, as multiple put/post requests mean all but one with fail in some manner, - in BasicSteps append assertions after multiple possible http status codes have been checked to ensure a resulting consistency in furture http status response codes, - ReaperCassandraIT is parameterised, via system properties "grim.reaper.min" and "grim.reaper.max", for how many stable and flapping reaper instances are to be used, - In ReaperCassandraIt put a timeout on how long we'll keep retrying to drop the test keyspace, - ReaperTestJettyRunner needed a little redesign to allow multiple instances per jvm, - move TestUtils methods into BasicSteps, - put a timeout on the mutex waits in RepairRunnerTest. ref: #124
…om an elected leader. It's too easy for the code to evolve with reads and writes occurring outside the leader-election mechanism. This makes the code difficult to reason and to test. By adding the asserts into place we enforce a constraint that simplifies the design. The take on the leader election has been moved up to SegmentRunner, as this is where all other leader-election actions are taken. ref: #124
- configure travis to run separate jobs for multi-concurrency ReaperCassandraIT executions. - reduce ccm count and memory use - increase ccm request timeouts (tests can run very slowly) - after a failure display ccm errors - upgrade to dropwizard 1.0.8 - upgrade jerysey-client to 2.25.1 due to a bug where requests can be sent twice (still a problem but happens less) - use `allow_failures` on the longer "smoke" tests ref: #124
…connection didn't fail in the past. Allow only a single Reaper instance to process an incremental repair segment through leader election Lower host metrics TTL to 3mn ref: #124
Running the SchedulingManager concurrently (in different processes) leads to parallel repair runs being spawned at the same time. This has been implemented by - making reads and writes to repair_schedule table strictly consistent (quorum), - updating (writes to storage) the repair_schedule incremently in the SchedulingManager.manageSchedule(..) codepath, and - multiple checks (reads from storage) through the same codepath. ref: #124
…tances. Introducing a fault tolerant reaper (where multiple reaper instances could run and coordinate ongoing repairs on top of a shared Cassandra backend storage) requires extensive testing. While the UI is by default an eventually consistent experience, the code design is not. By extending the testing framework to support multiple (and unstalbe/flapping) reaper instances, and parallel and duplicate htp requests to the RESTful interface, it becomes possible to test such coordination and the required 'partition tolerance' on the backend storage. Changes made: - a number of REST http status code responses were corrected, ie better usage of: METHOD_NOT_ALLOWED, NOT_MODIFIED, NOT_FOUND, - a number of REST resoure methods were strengthen ensuring the correct error http status code were returned, - making JMX connect timeout configurable - make all c* requests marked as idempotent, - in `CassandraStorage.getSegment(..)` return a random segment which hasn't yet been started, instead of that with the lowest failCount, - in BasicSteps reaper instances can be added and removed concurrently, but synchronized by test method, - in BasicSteps parallel stream requests through all reaper instances where appropriate, where not pick a random instance to send the request through, - in BasicSteps accept a set of possible http status codes from the response, as multiple put/post requests mean all but one with fail in some manner, - in BasicSteps append assertions after multiple possible http status codes have been checked to ensure a resulting consistency in furture http status response codes, - ReaperCassandraIT is parameterised, via system properties "grim.reaper.min" and "grim.reaper.max", for how many stable and flapping reaper instances are to be used, - In ReaperCassandraIt put a timeout on how long we'll keep retrying to drop the test keyspace, - ReaperTestJettyRunner needed a little redesign to allow multiple instances per jvm, - move TestUtils methods into BasicSteps, - put a timeout on the mutex waits in RepairRunnerTest. ref: #124
…om an elected leader. It's too easy for the code to evolve with reads and writes occurring outside the leader-election mechanism. This makes the code difficult to reason and to test. By adding the asserts into place we enforce a constraint that simplifies the design. The take on the leader election has been moved up to SegmentRunner, as this is where all other leader-election actions are taken. ref: #124
- configure travis to run separate jobs for multi-concurrency ReaperCassandraIT executions. - reduce ccm count and memory use - increase ccm request timeouts (tests can run very slowly) - after a failure display ccm errors - upgrade to dropwizard 1.0.8 - upgrade jerysey-client to 2.25.1 due to a bug where requests can be sent twice (still a problem but happens less) - use `allow_failures` on the longer "smoke" tests ref: #124
…connection didn't fail in the past. Allow only a single Reaper instance to process an incremental repair segment through leader election Lower host metrics TTL to 3mn ref: #124
Work in progress still.
Make integration tests capable of testing multiple running reaper instances