Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Ability to import HBase Snapshot data into Cloud Bigtable using Dataflow #2755

Merged
merged 36 commits into from
Feb 1, 2021

Conversation

lichng
Copy link
Contributor

@lichng lichng commented Dec 11, 2020

A Dataflow pipeline template that could import HBase data from an HBase snapshot into Cloud Bigtable.

@product-auto-label product-auto-label bot added the api: bigtable Issues related to the googleapis/java-bigtable-hbase API. label Dec 11, 2020
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Dec 11, 2020
@lichng lichng changed the title Ability to import HBase Snapshot into Cloud Bigtable using Dataflow feat: Ability to import HBase Snapshot data into Cloud Bigtable using Dataflow Dec 11, 2020
@lichng lichng marked this pull request as ready for review December 11, 2020 17:10
@lichng lichng requested review from a team as code owners December 11, 2020 17:10
@lichng
Copy link
Contributor Author

lichng commented Dec 14, 2020

Hi @kolea2 @vermas2012 ,

This is good for another round of review. Please take another look.

We plan to keep tuning the performance, but it would be great if we can get this merge so we can start collaborating on the downstream processes(validation, syncTable).

@google-cla
Copy link

google-cla bot commented Dec 29, 2020

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: no This human has *not* signed the Contributor License Agreement. and removed cla: yes This human has signed the Contributor License Agreement. labels Dec 29, 2020
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.
1. use guava.version instead of beam-guava.version
2. fix typo
New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.
1. use guava.version instead of beam-guava.version
2. fix typo
@google-cla
Copy link

google-cla bot commented Dec 29, 2020

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@google-cla google-cla bot added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels Dec 29, 2020
@google-cla
Copy link

google-cla bot commented Jan 21, 2021

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@lichng lichng requested a review from igorbernstein2 January 21, 2021 17:47
@google-cla
Copy link

google-cla bot commented Jan 22, 2021

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Jan 26, 2021

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Jan 26, 2021

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Jan 26, 2021

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Jan 28, 2021

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Jan 28, 2021

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

Minor code refactorig to reduce confusion
@google-cla
Copy link

google-cla bot commented Feb 1, 2021

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@google-cla
Copy link

google-cla bot commented Feb 1, 2021

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

Copy link
Collaborator

@igorbernstein2 igorbernstein2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@igorbernstein2 igorbernstein2 merged commit 5b3ab2b into googleapis:master Feb 1, 2021
kolea2 pushed a commit to kolea2/cloud-bigtable-client that referenced this pull request Feb 23, 2021
… Dataflow (googleapis#2755)

* Support import from HBase snapshot

New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.

* Update document

* Change the conf type for HBaseSnapshotInputConfiguration for
Serialization support

* Rename HBASE_ROOT_PATH to HBASE_EXPORT_ROOT_PATH in example doc for
it to be more intuitive

* Addressing the review comments:
1. use guava.version instead of beam-guava.version
2. fix typo

* Add the original Main.java under sequencefiles back

* gcs connector still requires non-android guava version

* Support import from HBase snapshot

New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.

* Addressing the review comments:
1. use guava.version instead of beam-guava.version
2. fix typo

* switch HBasesnapshotConfiguration to a builder class

* use DataflowRunner instead of DirectRunner for integration tests

* revert pom file override

* Remove all ValueProvider for now

* Add gcsProject parameter and remove template related document

* recover the dependency missed in the rebase

* Update new files using latest header comment format and update year to
2021

* Remove workaround for BIGTABLE_BULK_AUTOFLUSH_MS_KEY

* Exclude hbase-shaded-client

* Clean up all transitive depdendencies on hbase-shaded-client

* Add document for integration test generation instructions
remove unnecessary code

* Update document

* Fail the pipeline building when there is an exception configuring input
updated unit test
more comments

* renaming according to review comments

* update comments

* More document about hbase snapshot file structure

* System.out -> LOG

* throw out exception instead of terminating JVM

* Remove outside visible parameter restoreDir, use a default dir instead
Add cleanup phase

* use pattern without ending '/'

* use listObject instead of match since GcsUtil expand intentionally
filter out directories

* Using a unique suffix for restore dir to avoid conflict

* Add dependency to pom.xml

* minimize accessibility for class

* Fix typo and Add header comment for CleanupHBaseSnapshotRestoreFilesFn

* Adding more error messages for HBaseSnapshotInputConfigBuilder
Minor code refactorig to reduce confusion

* Add document about how to handle temp files during job failures
vermas2012 added a commit to vermas2012/java-bigtable-hbase that referenced this pull request May 5, 2021
… Dataflow (googleapis#2755)

* Support import from HBase snapshot

New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.

* Update document

* Change the conf type for HBaseSnapshotInputConfiguration for
Serialization support

* Rename HBASE_ROOT_PATH to HBASE_EXPORT_ROOT_PATH in example doc for
it to be more intuitive

* Addressing the review comments:
1. use guava.version instead of beam-guava.version
2. fix typo

* Add the original Main.java under sequencefiles back

* gcs connector still requires non-android guava version

* Support import from HBase snapshot

New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.

* Addressing the review comments:
1. use guava.version instead of beam-guava.version
2. fix typo

* switch HBasesnapshotConfiguration to a builder class

* use DataflowRunner instead of DirectRunner for integration tests

* revert pom file override

* Remove all ValueProvider for now

* Add gcsProject parameter and remove template related document

* recover the dependency missed in the rebase

* Update new files using latest header comment format and update year to
2021

* Remove workaround for BIGTABLE_BULK_AUTOFLUSH_MS_KEY

* Exclude hbase-shaded-client

* Clean up all transitive depdendencies on hbase-shaded-client

* Add document for integration test generation instructions
remove unnecessary code

* Update document

* Fail the pipeline building when there is an exception configuring input
updated unit test
more comments

* renaming according to review comments

* update comments

* More document about hbase snapshot file structure

* System.out -> LOG

* throw out exception instead of terminating JVM

* Remove outside visible parameter restoreDir, use a default dir instead
Add cleanup phase

* use pattern without ending '/'

* use listObject instead of match since GcsUtil expand intentionally
filter out directories

* Using a unique suffix for restore dir to avoid conflict

* Add dependency to pom.xml

* minimize accessibility for class

* Fix typo and Add header comment for CleanupHBaseSnapshotRestoreFilesFn

* Adding more error messages for HBaseSnapshotInputConfigBuilder
Minor code refactorig to reduce confusion

* Add document about how to handle temp files during job failures

(cherry picked from commit 5b3ab2b)
vermas2012 added a commit to vermas2012/java-bigtable-hbase that referenced this pull request May 10, 2021
… Dataflow (googleapis#2755)

* Support import from HBase snapshot

New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.

* Update document

* Change the conf type for HBaseSnapshotInputConfiguration for
Serialization support

* Rename HBASE_ROOT_PATH to HBASE_EXPORT_ROOT_PATH in example doc for
it to be more intuitive

* Addressing the review comments:
1. use guava.version instead of beam-guava.version
2. fix typo

* Add the original Main.java under sequencefiles back

* gcs connector still requires non-android guava version

* Support import from HBase snapshot

New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.

* Addressing the review comments:
1. use guava.version instead of beam-guava.version
2. fix typo

* switch HBasesnapshotConfiguration to a builder class

* use DataflowRunner instead of DirectRunner for integration tests

* revert pom file override

* Remove all ValueProvider for now

* Add gcsProject parameter and remove template related document

* recover the dependency missed in the rebase

* Update new files using latest header comment format and update year to
2021

* Remove workaround for BIGTABLE_BULK_AUTOFLUSH_MS_KEY

* Exclude hbase-shaded-client

* Clean up all transitive depdendencies on hbase-shaded-client

* Add document for integration test generation instructions
remove unnecessary code

* Update document

* Fail the pipeline building when there is an exception configuring input
updated unit test
more comments

* renaming according to review comments

* update comments

* More document about hbase snapshot file structure

* System.out -> LOG

* throw out exception instead of terminating JVM

* Remove outside visible parameter restoreDir, use a default dir instead
Add cleanup phase

* use pattern without ending '/'

* use listObject instead of match since GcsUtil expand intentionally
filter out directories

* Using a unique suffix for restore dir to avoid conflict

* Add dependency to pom.xml

* minimize accessibility for class

* Fix typo and Add header comment for CleanupHBaseSnapshotRestoreFilesFn

* Adding more error messages for HBaseSnapshotInputConfigBuilder
Minor code refactorig to reduce confusion

* Add document about how to handle temp files during job failures

(cherry picked from commit 5b3ab2b)
vermas2012 added a commit that referenced this pull request May 11, 2021
…rt validation (#2958)

* feat: Ability to import HBase Snapshot data into Cloud Bigtable using Dataflow (#2755)

* Support import from HBase snapshot

New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.

* Update document

* Change the conf type for HBaseSnapshotInputConfiguration for
Serialization support

* Rename HBASE_ROOT_PATH to HBASE_EXPORT_ROOT_PATH in example doc for
it to be more intuitive

* Addressing the review comments:
1. use guava.version instead of beam-guava.version
2. fix typo

* Add the original Main.java under sequencefiles back

* gcs connector still requires non-android guava version

* Support import from HBase snapshot

New files for configuring HBaseSnapshotInputFormat

resovle version conflict and upgrade Beam version to 2.24.0

revert disk option change, not enough quota

Code reorg

code reduction

Refactor naming
Add integration config

Add unit test for HBaseSnapshotInputConfiguration

Set up skeleton for integration testing

Ship test data with code, integration tests pass

Clean up code for PR

Add HBase commands that generates our test snapshot

Addressing review comments
1. revert pom file overrides for SkipITs
2. Store SerializableConfiguration as member variable
3. Rever log4j.properties
4. Disable BIGTABLE_BULK_AUTOFLUSH_MS_KEY to prevent bulk mutation
failures failing the jobs.

* Addressing the review comments:
1. use guava.version instead of beam-guava.version
2. fix typo

* switch HBasesnapshotConfiguration to a builder class

* use DataflowRunner instead of DirectRunner for integration tests

* revert pom file override

* Remove all ValueProvider for now

* Add gcsProject parameter and remove template related document

* recover the dependency missed in the rebase

* Update new files using latest header comment format and update year to
2021

* Remove workaround for BIGTABLE_BULK_AUTOFLUSH_MS_KEY

* Exclude hbase-shaded-client

* Clean up all transitive depdendencies on hbase-shaded-client

* Add document for integration test generation instructions
remove unnecessary code

* Update document

* Fail the pipeline building when there is an exception configuring input
updated unit test
more comments

* renaming according to review comments

* update comments

* More document about hbase snapshot file structure

* System.out -> LOG

* throw out exception instead of terminating JVM

* Remove outside visible parameter restoreDir, use a default dir instead
Add cleanup phase

* use pattern without ending '/'

* use listObject instead of match since GcsUtil expand intentionally
filter out directories

* Using a unique suffix for restore dir to avoid conflict

* Add dependency to pom.xml

* minimize accessibility for class

* Fix typo and Add header comment for CleanupHBaseSnapshotRestoreFilesFn

* Adding more error messages for HBaseSnapshotInputConfigBuilder
Minor code refactorig to reduce confusion

* Add document about how to handle temp files during job failures

(cherry picked from commit 5b3ab2b)

* feat: Add a new pipeline to validate data imported from HBase (#2828)

* feat: add a new pipeline to validate data imported into cloud bigtable from HBase.

(cherry picked from commit fa07a90)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigtable Issues related to the googleapis/java-bigtable-hbase API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants