-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Ability to import HBase Snapshot data into Cloud Bigtable using Dataflow #2755
Conversation
df8bf6f
to
b99c98a
Compare
...main/java/com/google/cloud/bigtable/beam/hbasesnapshots/HBaseSnapshotInputConfiguration.java
Outdated
Show resolved
Hide resolved
...main/java/com/google/cloud/bigtable/beam/hbasesnapshots/HBaseSnapshotInputConfiguration.java
Outdated
Show resolved
Hide resolved
...main/java/com/google/cloud/bigtable/beam/hbasesnapshots/HBaseSnapshotInputConfiguration.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java
Outdated
Show resolved
Hide resolved
bigtable-dataflow-parent/bigtable-beam-import/src/main/resources/log4j.properties
Outdated
Show resolved
Hide resolved
...able-beam-import/src/test/java/com/google/cloud/bigtable/beam/hbasesnapshots/EndToEndIT.java
Outdated
Show resolved
Hide resolved
...able-beam-import/src/test/java/com/google/cloud/bigtable/beam/hbasesnapshots/EndToEndIT.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java
Outdated
Show resolved
Hide resolved
47e90d7
to
ed26647
Compare
Hi @kolea2 @vermas2012 , This is good for another round of review. Please take another look. We plan to keep tuning the performance, but it would be great if we can get this merge so we can start collaborating on the downstream processes(validation, syncTable). |
...-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/Main.java
Show resolved
Hide resolved
...main/java/com/google/cloud/bigtable/beam/hbasesnapshots/HBaseSnapshotInputConfiguration.java
Outdated
Show resolved
Hide resolved
...main/java/com/google/cloud/bigtable/beam/hbasesnapshots/HBaseSnapshotInputConfiguration.java
Outdated
Show resolved
Hide resolved
...main/java/com/google/cloud/bigtable/beam/hbasesnapshots/HBaseSnapshotInputConfiguration.java
Outdated
Show resolved
Hide resolved
...main/java/com/google/cloud/bigtable/beam/hbasesnapshots/HBaseSnapshotInputConfiguration.java
Outdated
Show resolved
Hide resolved
da773b3
to
bac51a1
Compare
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs.
Serialization support
it to be more intuitive
1. use guava.version instead of beam-guava.version 2. fix typo
New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs.
1. use guava.version instead of beam-guava.version 2. fix typo
2520346
to
53f73bc
Compare
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
.../src/main/java/com/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java
Outdated
Show resolved
Hide resolved
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
...n/java/com/google/cloud/bigtable/beam/hbasesnapshots/CleanupHBaseSnapshotRestoreFilesFn.java
Outdated
Show resolved
Hide resolved
...main/java/com/google/cloud/bigtable/beam/hbasesnapshots/HBaseSnapshotInputConfigBuilder.java
Outdated
Show resolved
Hide resolved
...n/java/com/google/cloud/bigtable/beam/hbasesnapshots/CleanupHBaseSnapshotRestoreFilesFn.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java
Show resolved
Hide resolved
.../src/main/java/com/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/google/cloud/bigtable/beam/hbasesnapshots/ImportJobFromHbaseSnapshot.java
Outdated
Show resolved
Hide resolved
Minor code refactorig to reduce confusion
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
… Dataflow (googleapis#2755) * Support import from HBase snapshot New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs. * Update document * Change the conf type for HBaseSnapshotInputConfiguration for Serialization support * Rename HBASE_ROOT_PATH to HBASE_EXPORT_ROOT_PATH in example doc for it to be more intuitive * Addressing the review comments: 1. use guava.version instead of beam-guava.version 2. fix typo * Add the original Main.java under sequencefiles back * gcs connector still requires non-android guava version * Support import from HBase snapshot New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs. * Addressing the review comments: 1. use guava.version instead of beam-guava.version 2. fix typo * switch HBasesnapshotConfiguration to a builder class * use DataflowRunner instead of DirectRunner for integration tests * revert pom file override * Remove all ValueProvider for now * Add gcsProject parameter and remove template related document * recover the dependency missed in the rebase * Update new files using latest header comment format and update year to 2021 * Remove workaround for BIGTABLE_BULK_AUTOFLUSH_MS_KEY * Exclude hbase-shaded-client * Clean up all transitive depdendencies on hbase-shaded-client * Add document for integration test generation instructions remove unnecessary code * Update document * Fail the pipeline building when there is an exception configuring input updated unit test more comments * renaming according to review comments * update comments * More document about hbase snapshot file structure * System.out -> LOG * throw out exception instead of terminating JVM * Remove outside visible parameter restoreDir, use a default dir instead Add cleanup phase * use pattern without ending '/' * use listObject instead of match since GcsUtil expand intentionally filter out directories * Using a unique suffix for restore dir to avoid conflict * Add dependency to pom.xml * minimize accessibility for class * Fix typo and Add header comment for CleanupHBaseSnapshotRestoreFilesFn * Adding more error messages for HBaseSnapshotInputConfigBuilder Minor code refactorig to reduce confusion * Add document about how to handle temp files during job failures
… Dataflow (googleapis#2755) * Support import from HBase snapshot New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs. * Update document * Change the conf type for HBaseSnapshotInputConfiguration for Serialization support * Rename HBASE_ROOT_PATH to HBASE_EXPORT_ROOT_PATH in example doc for it to be more intuitive * Addressing the review comments: 1. use guava.version instead of beam-guava.version 2. fix typo * Add the original Main.java under sequencefiles back * gcs connector still requires non-android guava version * Support import from HBase snapshot New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs. * Addressing the review comments: 1. use guava.version instead of beam-guava.version 2. fix typo * switch HBasesnapshotConfiguration to a builder class * use DataflowRunner instead of DirectRunner for integration tests * revert pom file override * Remove all ValueProvider for now * Add gcsProject parameter and remove template related document * recover the dependency missed in the rebase * Update new files using latest header comment format and update year to 2021 * Remove workaround for BIGTABLE_BULK_AUTOFLUSH_MS_KEY * Exclude hbase-shaded-client * Clean up all transitive depdendencies on hbase-shaded-client * Add document for integration test generation instructions remove unnecessary code * Update document * Fail the pipeline building when there is an exception configuring input updated unit test more comments * renaming according to review comments * update comments * More document about hbase snapshot file structure * System.out -> LOG * throw out exception instead of terminating JVM * Remove outside visible parameter restoreDir, use a default dir instead Add cleanup phase * use pattern without ending '/' * use listObject instead of match since GcsUtil expand intentionally filter out directories * Using a unique suffix for restore dir to avoid conflict * Add dependency to pom.xml * minimize accessibility for class * Fix typo and Add header comment for CleanupHBaseSnapshotRestoreFilesFn * Adding more error messages for HBaseSnapshotInputConfigBuilder Minor code refactorig to reduce confusion * Add document about how to handle temp files during job failures (cherry picked from commit 5b3ab2b)
… Dataflow (googleapis#2755) * Support import from HBase snapshot New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs. * Update document * Change the conf type for HBaseSnapshotInputConfiguration for Serialization support * Rename HBASE_ROOT_PATH to HBASE_EXPORT_ROOT_PATH in example doc for it to be more intuitive * Addressing the review comments: 1. use guava.version instead of beam-guava.version 2. fix typo * Add the original Main.java under sequencefiles back * gcs connector still requires non-android guava version * Support import from HBase snapshot New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs. * Addressing the review comments: 1. use guava.version instead of beam-guava.version 2. fix typo * switch HBasesnapshotConfiguration to a builder class * use DataflowRunner instead of DirectRunner for integration tests * revert pom file override * Remove all ValueProvider for now * Add gcsProject parameter and remove template related document * recover the dependency missed in the rebase * Update new files using latest header comment format and update year to 2021 * Remove workaround for BIGTABLE_BULK_AUTOFLUSH_MS_KEY * Exclude hbase-shaded-client * Clean up all transitive depdendencies on hbase-shaded-client * Add document for integration test generation instructions remove unnecessary code * Update document * Fail the pipeline building when there is an exception configuring input updated unit test more comments * renaming according to review comments * update comments * More document about hbase snapshot file structure * System.out -> LOG * throw out exception instead of terminating JVM * Remove outside visible parameter restoreDir, use a default dir instead Add cleanup phase * use pattern without ending '/' * use listObject instead of match since GcsUtil expand intentionally filter out directories * Using a unique suffix for restore dir to avoid conflict * Add dependency to pom.xml * minimize accessibility for class * Fix typo and Add header comment for CleanupHBaseSnapshotRestoreFilesFn * Adding more error messages for HBaseSnapshotInputConfigBuilder Minor code refactorig to reduce confusion * Add document about how to handle temp files during job failures (cherry picked from commit 5b3ab2b)
…rt validation (#2958) * feat: Ability to import HBase Snapshot data into Cloud Bigtable using Dataflow (#2755) * Support import from HBase snapshot New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs. * Update document * Change the conf type for HBaseSnapshotInputConfiguration for Serialization support * Rename HBASE_ROOT_PATH to HBASE_EXPORT_ROOT_PATH in example doc for it to be more intuitive * Addressing the review comments: 1. use guava.version instead of beam-guava.version 2. fix typo * Add the original Main.java under sequencefiles back * gcs connector still requires non-android guava version * Support import from HBase snapshot New files for configuring HBaseSnapshotInputFormat resovle version conflict and upgrade Beam version to 2.24.0 revert disk option change, not enough quota Code reorg code reduction Refactor naming Add integration config Add unit test for HBaseSnapshotInputConfiguration Set up skeleton for integration testing Ship test data with code, integration tests pass Clean up code for PR Add HBase commands that generates our test snapshot Addressing review comments 1. revert pom file overrides for SkipITs 2. Store SerializableConfiguration as member variable 3. Rever log4j.properties 4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation failures failing the jobs. * Addressing the review comments: 1. use guava.version instead of beam-guava.version 2. fix typo * switch HBasesnapshotConfiguration to a builder class * use DataflowRunner instead of DirectRunner for integration tests * revert pom file override * Remove all ValueProvider for now * Add gcsProject parameter and remove template related document * recover the dependency missed in the rebase * Update new files using latest header comment format and update year to 2021 * Remove workaround for BIGTABLE_BULK_AUTOFLUSH_MS_KEY * Exclude hbase-shaded-client * Clean up all transitive depdendencies on hbase-shaded-client * Add document for integration test generation instructions remove unnecessary code * Update document * Fail the pipeline building when there is an exception configuring input updated unit test more comments * renaming according to review comments * update comments * More document about hbase snapshot file structure * System.out -> LOG * throw out exception instead of terminating JVM * Remove outside visible parameter restoreDir, use a default dir instead Add cleanup phase * use pattern without ending '/' * use listObject instead of match since GcsUtil expand intentionally filter out directories * Using a unique suffix for restore dir to avoid conflict * Add dependency to pom.xml * minimize accessibility for class * Fix typo and Add header comment for CleanupHBaseSnapshotRestoreFilesFn * Adding more error messages for HBaseSnapshotInputConfigBuilder Minor code refactorig to reduce confusion * Add document about how to handle temp files during job failures (cherry picked from commit 5b3ab2b) * feat: Add a new pipeline to validate data imported from HBase (#2828) * feat: add a new pipeline to validate data imported into cloud bigtable from HBase. (cherry picked from commit fa07a90)
A Dataflow pipeline template that could import HBase data from an HBase snapshot into Cloud Bigtable.