Integration tests capable of testing multiple running reaper instances #124

michaelsembwever · 2017-06-25T12:38:22Z

Work in progress still.

Make integration tests capable of testing multiple running reaper instances

Make integration tests capable of testing multiple running reaper instances ref: #124

tlpsonarqube · 2017-06-29T10:34:36Z

SonarQube analysis reported 4 issues

2 major
2 minor

Watch the comments in this conversation to review them.

3 extra issues

Note: The following issues were found on lines that were not modified in the pull request. Because these issues can't be reported as line comments, they are summarized here:

CassandraStorage.java#L491: Merge this if statement with the enclosing one.
CassandraStorage.java#L796: Remove this "return" statement or make it conditional.
SchedulingManager.java#L8: Remove this unused import 'com.spotify.reaper.core.Cluster'.

tlpsonarqube · 2017-06-29T10:34:36Z

src/main/java/com/spotify/reaper/service/SchedulingManager.java

-        if (!fetchedUnit.isPresent()) {
-          LOG.warn("RepairUnit with id {} not found", schedule.getRepairUnitId());
-          return false;
+  private boolean manageSchedule(RepairSchedule schedule_) {


Rename this local variable to match the regular expression '^[a-z][a-zA-Z0-9]*$'.

michaelsembwever · 2017-06-29T10:44:35Z

src/main/java/com/spotify/reaper/service/SchedulingManager.java

+                            .runHistory(newRunHistory)
+                            .build(schedule.getId()));
+
+                    if (result && !repairRunAlreadyScheduled(schedule, repairUnit)) {


this repairRunAlreadyScheduled(..) needs to be removed.

Make integration tests capable of testing multiple running reaper instances ref: #124

michaelsembwever · 2017-07-02T09:55:04Z

@adejanovski
this is ready for further testing.

currently the repair and incremental repair scenarios break (most of the time) when running ReaperCassandraIT due to valid concurrency problems¹.
(when they fail note that they can also cause other scenarios to also fail.)

the problem exceerbates as you add ccm nodes, increase RF, and increase concurrency in ReaperCassandraIT (see line 50).

the concurrency problems do appear to be those i raised concerns about earlier in #49 (comment) and #49 (comment)

michaelsembwever · 2017-07-04T08:23:21Z

src/test/java/com/spotify/reaper/acceptance/BasicSteps.java

@@ -767,7 +767,7 @@ public void we_wait_for_at_least_segments_to_be_repaired(int nbSegmentsToBeRepai
          assertEquals(Response.Status.OK.getStatusCode(), response.getStatus());
          String responseData = response.readEntity(String.class);
          RepairRunStatus run = SimpleReaperClient.parseRepairRunStatusJSON(responseData);
-          return nbSegmentsToBeRepaired == run.getSegmentsRepaired();
+          return nbSegmentsToBeRepaired <= run.getSegmentsRepaired();


michaelsembwever · 2017-07-04T10:06:31Z

rebased off ft-reaper-improvements-final

Running the SchedulingManager concurrently (in different processes) leads to parallel repair runs being spawned at the same time. This has been implemented by - making reads and writes to repair_schedule table strictly consistent (quorum), - updating (writes to storage) the repair_schedule incremently in the SchedulingManager.manageSchedule(..) codepath, and - multiple checks (reads from storage) through the same codepath. ref: #124

…tances. Introducing a fault tolerant reaper (where multiple reaper instances could run and coordinate ongoing repairs on top of a shared Cassandra backend storage) requires extensive testing. While the UI is by default an eventually consistent experience, the code design is not. By extending the testing framework to support multiple (and unstalbe/flapping) reaper instances, and parallel and duplicate htp requests to the RESTful interface, it becomes possible to test such coordination and the required 'partition tolerance' on the backend storage. Changes made: - a number of REST http status code responses were corrected, ie better usage of: METHOD_NOT_ALLOWED, NOT_MODIFIED, NOT_FOUND, - a number of REST resoure methods were strengthen ensuring the correct error http status code were returned, - making JMX connect timeout configurable - make all c* requests marked as idempotent, - in `CassandraStorage.getSegment(..)` return a random segment which hasn't yet been started, instead of that with the lowest failCount, - in BasicSteps reaper instances can be added and removed concurrently, but synchronized by test method, - in BasicSteps parallel stream requests through all reaper instances where appropriate, where not pick a random instance to send the request through, - in BasicSteps accept a set of possible http status codes from the response, as multiple put/post requests mean all but one with fail in some manner, - in BasicSteps append assertions after multiple possible http status codes have been checked to ensure a resulting consistency in furture http status response codes, - ReaperCassandraIT is parameterised, via system properties "grim.reaper.min" and "grim.reaper.max", for how many stable and flapping reaper instances are to be used, - In ReaperCassandraIt put a timeout on how long we'll keep retrying to drop the test keyspace, - ReaperTestJettyRunner needed a little redesign to allow multiple instances per jvm, - move TestUtils methods into BasicSteps, - put a timeout on the mutex waits in RepairRunnerTest. ref: #124

…om an elected leader. It's too easy for the code to evolve with reads and writes occurring outside the leader-election mechanism. This makes the code difficult to reason and to test. By adding the asserts into place we enforce a constraint that simplifies the design. The take on the leader election has been moved up to SegmentRunner, as this is where all other leader-election actions are taken. ref: #124

- configure travis to run separate jobs for multi-concurrency ReaperCassandraIT executions. - reduce ccm count and memory use - increase ccm request timeouts (tests can run very slowly) - after a failure display ccm errors - upgrade to dropwizard 1.0.8 - upgrade jerysey-client to 2.25.1 due to a bug where requests can be sent twice (still a problem but happens less) - use `allow_failures` on the longer "smoke" tests ref: #124

…sful requests they have little impact of design and usuability. ref: - #124 - #129

ref: - #124 - #129

…ents ref: - #124 - #129

…ally) always retries. ref: - #124 - #129

…connection didn't fail in the past. Allow only a single Reaper instance to process an incremental repair segment through leader election Lower host metrics TTL to 3mn ref: #124

ref: #124

Running the SchedulingManager concurrently (in different processes) leads to parallel repair runs being spawned at the same time. This has been implemented by - making reads and writes to repair_schedule table strictly consistent (quorum), - updating (writes to storage) the repair_schedule incremently in the SchedulingManager.manageSchedule(..) codepath, and - multiple checks (reads from storage) through the same codepath. ref: #124

…tances. Introducing a fault tolerant reaper (where multiple reaper instances could run and coordinate ongoing repairs on top of a shared Cassandra backend storage) requires extensive testing. While the UI is by default an eventually consistent experience, the code design is not. By extending the testing framework to support multiple (and unstalbe/flapping) reaper instances, and parallel and duplicate htp requests to the RESTful interface, it becomes possible to test such coordination and the required 'partition tolerance' on the backend storage. Changes made: - a number of REST http status code responses were corrected, ie better usage of: METHOD_NOT_ALLOWED, NOT_MODIFIED, NOT_FOUND, - a number of REST resoure methods were strengthen ensuring the correct error http status code were returned, - making JMX connect timeout configurable - make all c* requests marked as idempotent, - in `CassandraStorage.getSegment(..)` return a random segment which hasn't yet been started, instead of that with the lowest failCount, - in BasicSteps reaper instances can be added and removed concurrently, but synchronized by test method, - in BasicSteps parallel stream requests through all reaper instances where appropriate, where not pick a random instance to send the request through, - in BasicSteps accept a set of possible http status codes from the response, as multiple put/post requests mean all but one with fail in some manner, - in BasicSteps append assertions after multiple possible http status codes have been checked to ensure a resulting consistency in furture http status response codes, - ReaperCassandraIT is parameterised, via system properties "grim.reaper.min" and "grim.reaper.max", for how many stable and flapping reaper instances are to be used, - In ReaperCassandraIt put a timeout on how long we'll keep retrying to drop the test keyspace, - ReaperTestJettyRunner needed a little redesign to allow multiple instances per jvm, - move TestUtils methods into BasicSteps, - put a timeout on the mutex waits in RepairRunnerTest. ref: #124

…om an elected leader. It's too easy for the code to evolve with reads and writes occurring outside the leader-election mechanism. This makes the code difficult to reason and to test. By adding the asserts into place we enforce a constraint that simplifies the design. The take on the leader election has been moved up to SegmentRunner, as this is where all other leader-election actions are taken. ref: #124

- configure travis to run separate jobs for multi-concurrency ReaperCassandraIT executions. - reduce ccm count and memory use - increase ccm request timeouts (tests can run very slowly) - after a failure display ccm errors - upgrade to dropwizard 1.0.8 - upgrade jerysey-client to 2.25.1 due to a bug where requests can be sent twice (still a problem but happens less) - use `allow_failures` on the longer "smoke" tests ref: #124

…sful requests they have little impact of design and usuability. ref: - #124 - #129

ref: - #124 - #129

…ents ref: - #124 - #129

…ally) always retries. ref: - #124 - #129

…connection didn't fail in the past. Allow only a single Reaper instance to process an incremental repair segment through leader election Lower host metrics TTL to 3mn ref: #124

ref: #124

Running the SchedulingManager concurrently (in different processes) leads to parallel repair runs being spawned at the same time. This has been implemented by - making reads and writes to repair_schedule table strictly consistent (quorum), - updating (writes to storage) the repair_schedule incremently in the SchedulingManager.manageSchedule(..) codepath, and - multiple checks (reads from storage) through the same codepath. ref: #124

…tances. Introducing a fault tolerant reaper (where multiple reaper instances could run and coordinate ongoing repairs on top of a shared Cassandra backend storage) requires extensive testing. While the UI is by default an eventually consistent experience, the code design is not. By extending the testing framework to support multiple (and unstalbe/flapping) reaper instances, and parallel and duplicate htp requests to the RESTful interface, it becomes possible to test such coordination and the required 'partition tolerance' on the backend storage. Changes made: - a number of REST http status code responses were corrected, ie better usage of: METHOD_NOT_ALLOWED, NOT_MODIFIED, NOT_FOUND, - a number of REST resoure methods were strengthen ensuring the correct error http status code were returned, - making JMX connect timeout configurable - make all c* requests marked as idempotent, - in `CassandraStorage.getSegment(..)` return a random segment which hasn't yet been started, instead of that with the lowest failCount, - in BasicSteps reaper instances can be added and removed concurrently, but synchronized by test method, - in BasicSteps parallel stream requests through all reaper instances where appropriate, where not pick a random instance to send the request through, - in BasicSteps accept a set of possible http status codes from the response, as multiple put/post requests mean all but one with fail in some manner, - in BasicSteps append assertions after multiple possible http status codes have been checked to ensure a resulting consistency in furture http status response codes, - ReaperCassandraIT is parameterised, via system properties "grim.reaper.min" and "grim.reaper.max", for how many stable and flapping reaper instances are to be used, - In ReaperCassandraIt put a timeout on how long we'll keep retrying to drop the test keyspace, - ReaperTestJettyRunner needed a little redesign to allow multiple instances per jvm, - move TestUtils methods into BasicSteps, - put a timeout on the mutex waits in RepairRunnerTest. ref: #124

…om an elected leader. It's too easy for the code to evolve with reads and writes occurring outside the leader-election mechanism. This makes the code difficult to reason and to test. By adding the asserts into place we enforce a constraint that simplifies the design. The take on the leader election has been moved up to SegmentRunner, as this is where all other leader-election actions are taken. ref: #124

- configure travis to run separate jobs for multi-concurrency ReaperCassandraIT executions. - reduce ccm count and memory use - increase ccm request timeouts (tests can run very slowly) - after a failure display ccm errors - upgrade to dropwizard 1.0.8 - upgrade jerysey-client to 2.25.1 due to a bug where requests can be sent twice (still a problem but happens less) - use `allow_failures` on the longer "smoke" tests ref: #124

…sful requests they have little impact of design and usuability. ref: - #124 - #129

ref: - #124 - #129

…ents ref: - #124 - #129

…ally) always retries. ref: - #124 - #129

…connection didn't fail in the past. Allow only a single Reaper instance to process an incremental repair segment through leader election Lower host metrics TTL to 3mn ref: #124

ref: #124

adejanovski force-pushed the ft-reaper-improvements-final branch from b543247 to 4992004 Compare June 26, 2017 06:27

michaelsembwever force-pushed the mck/distributed-testing branch from 8011e7f to 1e69d35 Compare June 27, 2017 08:55

michaelsembwever added a commit that referenced this pull request Jun 27, 2017

WIP

1e69d35

Make integration tests capable of testing multiple running reaper instances ref: #124

michaelsembwever force-pushed the mck/distributed-testing branch from 1e69d35 to f7a205c Compare June 27, 2017 09:54

michaelsembwever added a commit that referenced this pull request Jun 27, 2017

WIP

f7a205c

Make integration tests capable of testing multiple running reaper instances ref: #124

michaelsembwever changed the base branch from ft-reaper-improvements-final to mck/ft-reaper-improvements-final June 27, 2017 09:54

michaelsembwever force-pushed the mck/distributed-testing branch from f7a205c to 3c35f05 Compare June 27, 2017 10:34

michaelsembwever added a commit that referenced this pull request Jun 27, 2017

WIP

3c35f05

Make integration tests capable of testing multiple running reaper instances ref: #124

michaelsembwever force-pushed the mck/distributed-testing branch from 3c35f05 to f7f54dc Compare June 27, 2017 12:07

michaelsembwever added a commit that referenced this pull request Jun 27, 2017

WIP

f7f54dc

Make integration tests capable of testing multiple running reaper instances ref: #124

michaelsembwever force-pushed the mck/distributed-testing branch from f7f54dc to 88a1f27 Compare June 28, 2017 07:25

michaelsembwever added a commit that referenced this pull request Jun 28, 2017

WIP

88a1f27

Make integration tests capable of testing multiple running reaper instances ref: #124

michaelsembwever force-pushed the mck/distributed-testing branch from 88a1f27 to 1736fbc Compare June 28, 2017 12:01

michaelsembwever added a commit that referenced this pull request Jun 28, 2017

WIP

1736fbc

Make integration tests capable of testing multiple running reaper instances ref: #124

michaelsembwever force-pushed the mck/distributed-testing branch from 1736fbc to 522f1aa Compare June 29, 2017 10:33

michaelsembwever added a commit that referenced this pull request Jun 29, 2017

WIP

522f1aa

Make integration tests capable of testing multiple running reaper instances ref: #124

tlpsonarqube reviewed Jun 29, 2017

View reviewed changes

michaelsembwever commented Jun 29, 2017

View reviewed changes

michaelsembwever force-pushed the mck/distributed-testing branch from 522f1aa to 3d8d415 Compare June 30, 2017 11:25

michaelsembwever added a commit that referenced this pull request Jun 30, 2017

WIP

3d8d415

Make integration tests capable of testing multiple running reaper instances ref: #124

michaelsembwever force-pushed the mck/distributed-testing branch from 3d8d415 to 9325bb8 Compare July 2, 2017 08:40

michaelsembwever added a commit that referenced this pull request Jul 2, 2017

WIP

9325bb8

Make integration tests capable of testing multiple running reaper instances ref: #124

michaelsembwever mentioned this pull request Jul 2, 2017

Fault-tolerant Reaper #49

Closed

adejanovski force-pushed the mck/distributed-testing branch from 00e48c7 to 8247da9 Compare July 4, 2017 07:56

michaelsembwever commented Jul 4, 2017

View reviewed changes

adejanovski force-pushed the mck/distributed-testing branch from 8247da9 to dfc3c68 Compare July 4, 2017 08:35

michaelsembwever force-pushed the mck/distributed-testing branch from dfc3c68 to 8839605 Compare July 4, 2017 10:06

michaelsembwever added a commit that referenced this pull request Aug 8, 2017

Remove the LOGGED batch statements. Only offering atomicity on succes…

bc5e320

…sful requests they have little impact of design and usuability. ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 8, 2017

In the Cassandra storage make all write statements async.

e14a761

ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 8, 2017

In the CassandraStorage make constant the remaining select cql statem…

b9f3ddb

…ents ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 8, 2017

Configure the Cassandra driver with a custom retry policy that (basic…

d2dd3aa

…ally) always retries. ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 8, 2017

circleci setup, postgres excluded

928b5db

ref: #124

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

Remove the LOGGED batch statements. Only offering atomicity on succes…

7a4933d

…sful requests they have little impact of design and usuability. ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

In the Cassandra storage make all write statements async.

ecc3041

ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

In the CassandraStorage make constant the remaining select cql statem…

8586ae6

…ents ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

Configure the Cassandra driver with a custom retry policy that (basic…

1c1dcba

…ally) always retries. ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

circleci setup, postgres excluded

9ef4aff

ref: #124

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

Remove the LOGGED batch statements. Only offering atomicity on succes…

a7104ba

…sful requests they have little impact of design and usuability. ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

In the Cassandra storage make all write statements async.

138f6b3

ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

In the CassandraStorage make constant the remaining select cql statem…

c0b1774

…ents ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

Configure the Cassandra driver with a custom retry policy that (basic…

ca25596

…ally) always retries. ref: - #124 - #129

michaelsembwever added a commit that referenced this pull request Aug 23, 2017

circleci setup, postgres excluded

b2db7c2

ref: #124

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration tests capable of testing multiple running reaper instances #124

Integration tests capable of testing multiple running reaper instances #124

michaelsembwever commented Jun 25, 2017

tlpsonarqube commented Jun 29, 2017

tlpsonarqube Jun 29, 2017

michaelsembwever Jun 29, 2017

michaelsembwever commented Jul 2, 2017

michaelsembwever Jul 4, 2017

michaelsembwever commented Jul 4, 2017

Integration tests capable of testing multiple running reaper instances #124

Integration tests capable of testing multiple running reaper instances #124

Conversation

michaelsembwever commented Jun 25, 2017

tlpsonarqube commented Jun 29, 2017

3 extra issues

tlpsonarqube Jun 29, 2017

Choose a reason for hiding this comment

michaelsembwever Jun 29, 2017

Choose a reason for hiding this comment

michaelsembwever commented Jul 2, 2017

michaelsembwever Jul 4, 2017

Choose a reason for hiding this comment

michaelsembwever commented Jul 4, 2017