Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests capable of testing multiple running reaper instances #124

Merged
merged 12 commits into from
Jul 25, 2017

Conversation

michaelsembwever
Copy link
Member

Work in progress still.

Make integration tests capable of testing multiple running reaper instances

@adejanovski adejanovski force-pushed the ft-reaper-improvements-final branch from b543247 to 4992004 Compare June 26, 2017 06:27
@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from 8011e7f to 1e69d35 Compare June 27, 2017 08:55
michaelsembwever added a commit that referenced this pull request Jun 27, 2017
Make integration tests capable of testing multiple running reaper instances

ref: #124
@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from 1e69d35 to f7a205c Compare June 27, 2017 09:54
michaelsembwever added a commit that referenced this pull request Jun 27, 2017
Make integration tests capable of testing multiple running reaper instances

ref: #124
@michaelsembwever michaelsembwever changed the base branch from ft-reaper-improvements-final to mck/ft-reaper-improvements-final June 27, 2017 09:54
@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from f7a205c to 3c35f05 Compare June 27, 2017 10:34
michaelsembwever added a commit that referenced this pull request Jun 27, 2017
Make integration tests capable of testing multiple running reaper instances

ref: #124
@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from 3c35f05 to f7f54dc Compare June 27, 2017 12:07
michaelsembwever added a commit that referenced this pull request Jun 27, 2017
Make integration tests capable of testing multiple running reaper instances

ref: #124
@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from f7f54dc to 88a1f27 Compare June 28, 2017 07:25
michaelsembwever added a commit that referenced this pull request Jun 28, 2017
Make integration tests capable of testing multiple running reaper instances

ref: #124
@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from 88a1f27 to 1736fbc Compare June 28, 2017 12:01
michaelsembwever added a commit that referenced this pull request Jun 28, 2017
Make integration tests capable of testing multiple running reaper instances

ref: #124
@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from 1736fbc to 522f1aa Compare June 29, 2017 10:33
michaelsembwever added a commit that referenced this pull request Jun 29, 2017
Make integration tests capable of testing multiple running reaper instances

ref: #124
@tlpsonarqube
Copy link
Collaborator

SonarQube analysis reported 4 issues

  • MAJOR 2 major
  • MINOR 2 minor

Watch the comments in this conversation to review them.

3 extra issues

Note: The following issues were found on lines that were not modified in the pull request. Because these issues can't be reported as line comments, they are summarized here:

  1. MAJOR CassandraStorage.java#L491: Merge this if statement with the enclosing one. rule
  2. MAJOR CassandraStorage.java#L796: Remove this "return" statement or make it conditional. rule
  3. MINOR SchedulingManager.java#L8: Remove this unused import 'com.spotify.reaper.core.Cluster'. rule

if (!fetchedUnit.isPresent()) {
LOG.warn("RepairUnit with id {} not found", schedule.getRepairUnitId());
return false;
private boolean manageSchedule(RepairSchedule schedule_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR Rename this local variable to match the regular expression '^[a-z][a-zA-Z0-9]*$'. rule

.runHistory(newRunHistory)
.build(schedule.getId()));

if (result && !repairRunAlreadyScheduled(schedule, repairUnit)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this repairRunAlreadyScheduled(..) needs to be removed.

@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from 522f1aa to 3d8d415 Compare June 30, 2017 11:25
michaelsembwever added a commit that referenced this pull request Jun 30, 2017
Make integration tests capable of testing multiple running reaper instances

ref: #124
@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from 3d8d415 to 9325bb8 Compare July 2, 2017 08:40
michaelsembwever added a commit that referenced this pull request Jul 2, 2017
Make integration tests capable of testing multiple running reaper instances

ref: #124
@michaelsembwever
Copy link
Member Author

@adejanovski
this is ready for further testing.

currently the repair and incremental repair scenarios break (most of the time) when running ReaperCassandraIT due to valid concurrency problems¹.
(when they fail note that they can also cause other scenarios to also fail.)

the problem exceerbates as you add ccm nodes, increase RF, and increase concurrency in ReaperCassandraIT (see line 50).

the concurrency problems do appear to be those i raised concerns about earlier in #49 (comment) and #49 (comment)

@adejanovski adejanovski force-pushed the mck/distributed-testing branch from 00e48c7 to 8247da9 Compare July 4, 2017 07:56
@@ -767,7 +767,7 @@ public void we_wait_for_at_least_segments_to_be_repaired(int nbSegmentsToBeRepai
assertEquals(Response.Status.OK.getStatusCode(), response.getStatus());
String responseData = response.readEntity(String.class);
RepairRunStatus run = SimpleReaperClient.parseRepairRunStatusJSON(responseData);
return nbSegmentsToBeRepaired == run.getSegmentsRepaired();
return nbSegmentsToBeRepaired <= run.getSegmentsRepaired();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@adejanovski adejanovski force-pushed the mck/distributed-testing branch from 8247da9 to dfc3c68 Compare July 4, 2017 08:35
@michaelsembwever
Copy link
Member Author

rebased off ft-reaper-improvements-final

@michaelsembwever michaelsembwever force-pushed the mck/distributed-testing branch from dfc3c68 to 8839605 Compare July 4, 2017 10:06
michaelsembwever added a commit that referenced this pull request Aug 8, 2017
Running the SchedulingManager concurrently (in different processes) leads to parallel repair runs being spawned at the same time.

This has been implemented by
 - making reads and writes to repair_schedule table strictly consistent (quorum),
 - updating (writes to storage) the repair_schedule incremently in the SchedulingManager.manageSchedule(..) codepath, and
 - multiple checks (reads from storage) through the same codepath.

 ref: #124
michaelsembwever added a commit that referenced this pull request Aug 8, 2017
…tances.

Introducing a fault tolerant reaper (where multiple reaper instances could run and coordinate ongoing repairs on top of a shared Cassandra backend storage) requires extensive testing.
While the UI is by default an eventually consistent experience, the code design is not. By extending the testing framework to support multiple (and unstalbe/flapping) reaper instances,
and parallel and duplicate htp requests to the RESTful interface, it becomes possible to test such coordination and the required 'partition tolerance' on the backend storage.

Changes made:
 - a number of REST http status code responses were corrected, ie better usage of: METHOD_NOT_ALLOWED, NOT_MODIFIED, NOT_FOUND,
 - a number of REST resoure methods were strengthen ensuring the correct error http status code were returned,
 - making JMX connect timeout configurable
 - make all c* requests marked as idempotent,
 - in `CassandraStorage.getSegment(..)` return a random segment which hasn't yet been started, instead of that with the lowest failCount,
 - in BasicSteps reaper instances can be added and removed concurrently, but synchronized by test method,
 - in BasicSteps parallel stream requests through all reaper instances where appropriate, where not pick a random instance to send the request through,
 - in BasicSteps accept a set of possible http status codes from the response, as multiple put/post requests mean all but one with fail in some manner,
 - in BasicSteps append assertions after multiple possible http status codes have been checked to ensure a resulting consistency in furture http status response codes,
 - ReaperCassandraIT is parameterised, via system properties "grim.reaper.min" and "grim.reaper.max", for how many stable and flapping reaper instances are to be used,
 - In ReaperCassandraIt put a timeout on how long we'll keep retrying to drop the test keyspace,
 - ReaperTestJettyRunner needed a little redesign to allow multiple instances per jvm,
 - move TestUtils methods into BasicSteps,
 - put a timeout on the mutex waits in RepairRunnerTest.

ref: #124
michaelsembwever added a commit that referenced this pull request Aug 8, 2017
…om an elected leader.

It's too easy for the code to evolve with reads and writes occurring outside the leader-election mechanism.
This makes the code difficult to reason and to test. By adding the asserts into place we enforce a constraint that simplifies the design.

The take on the leader election has been moved up to SegmentRunner, as this is where all other leader-election actions are taken.

ref: #124
michaelsembwever added a commit that referenced this pull request Aug 8, 2017
   - configure travis to run separate jobs for multi-concurrency ReaperCassandraIT executions.
   - reduce ccm count and memory use
   - increase ccm request timeouts (tests can run very slowly)
   - after a failure display ccm errors
   - upgrade to dropwizard 1.0.8
   - upgrade jerysey-client to 2.25.1 due to a bug where requests can be sent twice (still a problem but happens less)
   - use `allow_failures` on the longer "smoke" tests

  ref: #124
michaelsembwever added a commit that referenced this pull request Aug 8, 2017
…sful requests they have little impact of design and usuability.

ref:
 - #124
 - #129
michaelsembwever added a commit that referenced this pull request Aug 8, 2017
michaelsembwever pushed a commit that referenced this pull request Aug 8, 2017
…connection didn't fail in the past.

Allow only a single Reaper instance to process an incremental repair segment through leader election
Lower host metrics TTL to 3mn
 ref: #124
michaelsembwever added a commit that referenced this pull request Aug 8, 2017
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
Running the SchedulingManager concurrently (in different processes) leads to parallel repair runs being spawned at the same time.

This has been implemented by
 - making reads and writes to repair_schedule table strictly consistent (quorum),
 - updating (writes to storage) the repair_schedule incremently in the SchedulingManager.manageSchedule(..) codepath, and
 - multiple checks (reads from storage) through the same codepath.

 ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
…tances.

Introducing a fault tolerant reaper (where multiple reaper instances could run and coordinate ongoing repairs on top of a shared Cassandra backend storage) requires extensive testing.
While the UI is by default an eventually consistent experience, the code design is not. By extending the testing framework to support multiple (and unstalbe/flapping) reaper instances,
and parallel and duplicate htp requests to the RESTful interface, it becomes possible to test such coordination and the required 'partition tolerance' on the backend storage.

Changes made:
 - a number of REST http status code responses were corrected, ie better usage of: METHOD_NOT_ALLOWED, NOT_MODIFIED, NOT_FOUND,
 - a number of REST resoure methods were strengthen ensuring the correct error http status code were returned,
 - making JMX connect timeout configurable
 - make all c* requests marked as idempotent,
 - in `CassandraStorage.getSegment(..)` return a random segment which hasn't yet been started, instead of that with the lowest failCount,
 - in BasicSteps reaper instances can be added and removed concurrently, but synchronized by test method,
 - in BasicSteps parallel stream requests through all reaper instances where appropriate, where not pick a random instance to send the request through,
 - in BasicSteps accept a set of possible http status codes from the response, as multiple put/post requests mean all but one with fail in some manner,
 - in BasicSteps append assertions after multiple possible http status codes have been checked to ensure a resulting consistency in furture http status response codes,
 - ReaperCassandraIT is parameterised, via system properties "grim.reaper.min" and "grim.reaper.max", for how many stable and flapping reaper instances are to be used,
 - In ReaperCassandraIt put a timeout on how long we'll keep retrying to drop the test keyspace,
 - ReaperTestJettyRunner needed a little redesign to allow multiple instances per jvm,
 - move TestUtils methods into BasicSteps,
 - put a timeout on the mutex waits in RepairRunnerTest.

ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
…om an elected leader.

It's too easy for the code to evolve with reads and writes occurring outside the leader-election mechanism.
This makes the code difficult to reason and to test. By adding the asserts into place we enforce a constraint that simplifies the design.

The take on the leader election has been moved up to SegmentRunner, as this is where all other leader-election actions are taken.

ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
   - configure travis to run separate jobs for multi-concurrency ReaperCassandraIT executions.
   - reduce ccm count and memory use
   - increase ccm request timeouts (tests can run very slowly)
   - after a failure display ccm errors
   - upgrade to dropwizard 1.0.8
   - upgrade jerysey-client to 2.25.1 due to a bug where requests can be sent twice (still a problem but happens less)
   - use `allow_failures` on the longer "smoke" tests

  ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
…sful requests they have little impact of design and usuability.

ref:
 - #124
 - #129
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
michaelsembwever pushed a commit that referenced this pull request Aug 23, 2017
…connection didn't fail in the past.

Allow only a single Reaper instance to process an incremental repair segment through leader election
Lower host metrics TTL to 3mn
 ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
Running the SchedulingManager concurrently (in different processes) leads to parallel repair runs being spawned at the same time.

This has been implemented by
 - making reads and writes to repair_schedule table strictly consistent (quorum),
 - updating (writes to storage) the repair_schedule incremently in the SchedulingManager.manageSchedule(..) codepath, and
 - multiple checks (reads from storage) through the same codepath.

 ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
…tances.

Introducing a fault tolerant reaper (where multiple reaper instances could run and coordinate ongoing repairs on top of a shared Cassandra backend storage) requires extensive testing.
While the UI is by default an eventually consistent experience, the code design is not. By extending the testing framework to support multiple (and unstalbe/flapping) reaper instances,
and parallel and duplicate htp requests to the RESTful interface, it becomes possible to test such coordination and the required 'partition tolerance' on the backend storage.

Changes made:
 - a number of REST http status code responses were corrected, ie better usage of: METHOD_NOT_ALLOWED, NOT_MODIFIED, NOT_FOUND,
 - a number of REST resoure methods were strengthen ensuring the correct error http status code were returned,
 - making JMX connect timeout configurable
 - make all c* requests marked as idempotent,
 - in `CassandraStorage.getSegment(..)` return a random segment which hasn't yet been started, instead of that with the lowest failCount,
 - in BasicSteps reaper instances can be added and removed concurrently, but synchronized by test method,
 - in BasicSteps parallel stream requests through all reaper instances where appropriate, where not pick a random instance to send the request through,
 - in BasicSteps accept a set of possible http status codes from the response, as multiple put/post requests mean all but one with fail in some manner,
 - in BasicSteps append assertions after multiple possible http status codes have been checked to ensure a resulting consistency in furture http status response codes,
 - ReaperCassandraIT is parameterised, via system properties "grim.reaper.min" and "grim.reaper.max", for how many stable and flapping reaper instances are to be used,
 - In ReaperCassandraIt put a timeout on how long we'll keep retrying to drop the test keyspace,
 - ReaperTestJettyRunner needed a little redesign to allow multiple instances per jvm,
 - move TestUtils methods into BasicSteps,
 - put a timeout on the mutex waits in RepairRunnerTest.

ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
…om an elected leader.

It's too easy for the code to evolve with reads and writes occurring outside the leader-election mechanism.
This makes the code difficult to reason and to test. By adding the asserts into place we enforce a constraint that simplifies the design.

The take on the leader election has been moved up to SegmentRunner, as this is where all other leader-election actions are taken.

ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
   - configure travis to run separate jobs for multi-concurrency ReaperCassandraIT executions.
   - reduce ccm count and memory use
   - increase ccm request timeouts (tests can run very slowly)
   - after a failure display ccm errors
   - upgrade to dropwizard 1.0.8
   - upgrade jerysey-client to 2.25.1 due to a bug where requests can be sent twice (still a problem but happens less)
   - use `allow_failures` on the longer "smoke" tests

  ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
…sful requests they have little impact of design and usuability.

ref:
 - #124
 - #129
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
michaelsembwever pushed a commit that referenced this pull request Aug 23, 2017
…connection didn't fail in the past.

Allow only a single Reaper instance to process an incremental repair segment through leader election
Lower host metrics TTL to 3mn
 ref: #124
michaelsembwever added a commit that referenced this pull request Aug 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants