-
Notifications
You must be signed in to change notification settings - Fork 91
Cassandra integration #66
base: master
Are you sure you want to change the base?
Cassandra integration #66
Conversation
…y to track progress in all partitions
// This is for DFSJarStore | ||
"${PROG_HOME}/lib/yarn/*" | ||
// "${PROG_HOME}/lib/yarn/*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to comment out to avoid runtime issues. When I try to submit an example job to Gearpump I get "java.lang.NoSuchMethodError: com.google.common.util.concurrent.Futures.withFallback(". I believe that happens because Gearpump pulls com.google.guava:guava version 11.0.2 from Hadoop dependencies, but Cassandra Java driver I am using needs version 16.0.1. Need to figure out a solution to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate help here as I may not understand exactly what my changes may cause elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this once 0.8.1 is out. Sorry we may need hold this for a while.
@zapletal-martin Thanks for your contribution. I'll pull your branch and try playing with it. |
… more classdefnot found version issues.
Cassandra database integration
Reuses some Spark-Cassandra connector files and follows how that works. The intent is to allow the connector to be reused when version for other processing systems is available. The Source looks up token ranges in the desired table, splits to independent sets of partitions and assigns those to available number of source tasks, allowing very good parallelism. All fetches of data except the first one are asynchronous. The Sink can be trivially parallelised by the user where different writes are assigned to different tasks.
The Source scans a current table snapshot and does not currently honour updates (so not a continuous stream). The source is not time replayable. There are options how to handle both these, but must be properly thought through. The test coverage is poor at the moment. but this first attempt will allow iteration and continuous improvement of the code and adding features.