Destination Snowflake : fixed duplicate rows on retries #9141

VitaliiMaltsev · 2021-12-28T10:01:03Z

What

When creating a new sync, if the sync fails and it has to retry, all rows which have already been put on the stage will then have rows appended again to the stage. So there are duplicate rows.

How

The stage now have a folder in it, as mentioned here https://docs.snowflake.com/en/user-guide/data-load-local-file-system-stage.html:

🚨 User Impact 🚨

There should be no visible impact on the user

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
Credentials added to Github CI. Instructions.
/test connector=connectors/<name> command is passing.
New Connector version released on Dockerhub by running the /publish command described here
After the connector is published, connector added to connector index as described here
Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

Updating a connector

Community member or Airbyter

Grant edit access to maintainers (instructions)
Secrets in the connector's spec are annotated with airbyte_secret
Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
Code reviews completed
Documentation updated
- Connector's README.md
- Connector's bootstrap.md. See description and examples
- Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
Credentials added to Github CI. Instructions.
/test connector=connectors/<name> command is passing.
New Connector version released on Dockerhub by running the /publish command described here
After the new connector version is published, connector version bumped in the seed directory as described here
Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

Connector Generator

Issue acceptance criteria met
PR name follows PR naming conventions
If adding a new generator, add it to the list of scaffold modules being tested
The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
Documentation which references the generator is updated as needed.

CLAassistant · 2021-12-28T10:01:08Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

vmaltsev seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

VitaliiMaltsev · 2021-12-28T10:18:32Z

/test connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1630097546
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1630097546
Python tests coverage:

	 ---------- coverage: platform linux, python 3.8.10-final-0 -----------
	 Name                                                              Stmts   Miss  Cover
	 -------------------------------------------------------------------------------------
	 main_dev_transform_catalog.py                                         3      3     0%
	 main_dev_transform_config.py                                          3      3     0%
	 normalization/__init__.py                                             4      0   100%
	 normalization/destination_type.py                                    13      0   100%
	 normalization/transform_catalog/__init__.py                           2      0   100%
	 normalization/transform_catalog/catalog_processor.py                143     77    46%
	 normalization/transform_catalog/destination_name_transformer.py     124      6    95%
	 normalization/transform_catalog/reserved_keywords.py                 13      0   100%
	 normalization/transform_catalog/stream_processor.py                 494    313    37%
	 normalization/transform_catalog/table_name_registry.py              174     34    80%
	 normalization/transform_catalog/transform.py                         45     26    42%
	 normalization/transform_catalog/utils.py                             33      7    79%
	 normalization/transform_config/__init__.py                            2      0   100%
	 normalization/transform_config/transform.py                         146     32    78%
	 -------------------------------------------------------------------------------------
	 TOTAL                                                              1199    501    58%

alexandertsukanov · 2021-12-28T13:17:09Z

...a/io/airbyte/integrations/destination/snowflake/SnowflakeInternalStagingConsumerFactory.java

@@ -45,6 +46,7 @@
  private static final Logger LOGGER = LoggerFactory.getLogger(SnowflakeInternalStagingConsumerFactory.class);

  private static final long MAX_BATCH_SIZE_BYTES = 1024 * 1024 * 1024 / 4; // 256mb
+  private static final String currentSyncPath = UUID.randomUUID().toString();


Please, use java convention for constants. Uppercase with underscores.

Please, use java convention for constants. Uppercase with underscores.

renamed

alexandertsukanov · 2021-12-28T13:18:41Z

...a/io/airbyte/integrations/destination/snowflake/SnowflakeInternalStagingConsumerFactory.java

+                  path);
+          try {
+            sqlOperations.copyIntoTmpTableFromStage(database, path, srcTableName, schemaName);
+          }catch (Exception e){


Looks like one space missed here before the catch, please reformat.

alexandertsukanov · 2021-12-28T13:20:51Z

@VitaliiMaltsev Minor comments, otherwise LGTM.

sherifnada · 2022-01-05T08:24:47Z

@edgao do you mind reviewing this PR? feel free to reassign to liren if underwater

edgao

could you add some tests for this behavior? (i.e. that we're creating different subdirectories per run, and that we're deleting the staging data on failure)

...src/main/java/io/airbyte/integrations/destination/snowflake/SnowflakeSQLNameTransformer.java

edgao · 2022-01-05T19:58:31Z

...a/io/airbyte/integrations/destination/snowflake/SnowflakeInternalStagingConsumerFactory.java

@@ -45,6 +46,7 @@
  private static final Logger LOGGER = LoggerFactory.getLogger(SnowflakeInternalStagingConsumerFactory.class);

  private static final long MAX_BATCH_SIZE_BYTES = 1024 * 1024 * 1024 / 4; // 256mb
+  private static final String CURRENT_SYNC_PATH = UUID.randomUUID().toString();


I'd prefer for this to be an instance field (i.e. non-static). Then SnowflakeInternalStagingDestination#getConsumer would need to call new SnowflakeInternalStagingConsumerFactory().create(...)

generally I avoid non-constant static fields, since they can be confusing + difficult to test.

CURRENT_SYNC_PATH field used in static methods recordWriterFunction and onCloseFunction. Non-static field can't be used in static methods

I think those methods could just be switched to non-static as well (they're only called from create, so that should be safe)

@edgao switched to non-static

VitaliiMaltsev · 2022-01-10T11:00:59Z

could you add some tests for this behavior? (i.e. that we're creating different subdirectories per run, and that we're deleting the staging data on failure)

@edgao During the implementation of this solution, while the airbyte sync lasts, I manually canceled the queries in the Snowflake console exactly at the moment when the data from the stage is inserted into the tables, in order for this to lead to the retry and the next sync of the same connection in airbyte. I'm not sure how to make a test that reproduces the same behavior.
I mean, during the execution of the test, the query should be canceled at a certain moment, but we do not know when and don't have access to the query id to cancel it https://docs.snowflake.com/en/sql-reference/functions/system_cancel_query.html

…ate-rows # Conflicts: # docs/integrations/destinations/snowflake.md

sherifnada · 2022-01-10T15:57:30Z

@VitaliiMaltsev can you create a separate issue for testing this behavior and work on it after merging this PR? I agree with Edward that we should have a test but would love to release this fix asap as it's a critical issue

edgao · 2022-01-10T16:30:15Z

I think writing a unit test would be sufficient - if you pass in a mocked SnowflakeStagingSqlOperations and initialize it with doThrow(...).when(mockedSqlOps.copyIntoTmpTableFromStage(...)), then you can use verify(mockedSqlOps).cleanUpStage(...) to check that it handled the exception correctly

…ate-rows # Conflicts: # airbyte-integrations/connectors/destination-snowflake/src/main/java/io/airbyte/integrations/destination/snowflake/SnowflakeStagingSqlOperations.java # docs/integrations/destinations/snowflake.md

edgao

saw that you created #9389 -

VitaliiMaltsev · 2022-01-10T17:55:43Z

/publish connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1678741625
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/1678741625

ChristopheDuong · 2022-03-04T23:35:33Z

...a/io/airbyte/integrations/destination/snowflake/SnowflakeInternalStagingConsumerFactory.java

@@ -149,13 +151,13 @@ private static RecordWriter recordWriterFunction(final JdbcDatabase database,
      final WriteConfig writeConfig = pairToWriteConfig.get(pair);
      final String schemaName = writeConfig.getOutputSchemaName();
      final String tableName = writeConfig.getOutputTableName();
-      final String stageName = namingResolver.getStageName(schemaName, tableName);
+      final String path = namingResolver.getStagingPath(schemaName, tableName, CURRENT_SYNC_PATH);


When there are multiple "flush" on the same stream during the same sync (large streams), are all batches/files written to the same path / same object in the stage area?

the stage is like a folder? or is it like a file? (I guess it acts like a folder where we stage multiple files?)

The path is like a virtual folder. It is similar to S3:
https://docs.snowflake.com/en/user-guide/data-load-considerations-stage.html

We always use the temp filename as the file under the folder:
https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/destination-snowflake/src/main/java/io/airbyte/integrations/destination/snowflake/SnowflakeStagingSqlOperations.java#L45

vmaltsev added 13 commits December 15, 2021 10:45

fix for jdk 17

93d53ac

Merge branch 'master' of github.com:airbytehq/airbyte

ede8d38

Merge branch 'master' of github.com:airbytehq/airbyte

27be3fe

Merge branch 'master' of github.com:airbytehq/airbyte

ca75033

Merge branch 'master' of github.com:airbytehq/airbyte

72ab46f

Merge branch 'master' of github.com:airbytehq/airbyte

aec3384

Merge branch 'master' of github.com:airbytehq/airbyte

7efd5aa

Merge branch 'master' of github.com:airbytehq/airbyte

6a773f9

Merge branch 'master' of github.com:airbytehq/airbyte

fa18537

Merge branch 'master' of github.com:airbytehq/airbyte

b0ba37b

Merge branch 'master' of github.com:airbytehq/airbyte

fa31a0d

Merge branch 'master' of github.com:airbytehq/airbyte

67e0bd6

Destination Snowflake: duplicate rows on retries

04353b3

github-actions bot added the area/connectors Connector related issues label Dec 28, 2021

added changelog

4c8dcef

github-actions bot added the area/documentation Improvements or additions to documentation label Dec 28, 2021

VitaliiMaltsev temporarily deployed to more-secrets December 28, 2021 10:05 Inactive

Merge branch 'master' of github.com:airbytehq/airbyte

ec0d1bd

jrhizor temporarily deployed to more-secrets December 28, 2021 10:20 Inactive

VitaliiMaltsev linked an issue Dec 28, 2021 that may be closed by this pull request

🐛 Destination Snowflake: duplicate rows on retries when using incremental staging #8832

Closed

VitaliiMaltsev requested review from andriikorotkov and alexandertsukanov December 28, 2021 12:29

alexandertsukanov reviewed Dec 28, 2021

View reviewed changes

fix checkstyle

3debdd9

VitaliiMaltsev requested a review from alexandertsukanov December 28, 2021 13:23

VitaliiMaltsev temporarily deployed to more-secrets December 28, 2021 13:23 Inactive

sherifnada requested a review from edgao January 5, 2022 08:24

sherifnada removed their request for review January 5, 2022 08:24

edgao requested changes Jan 5, 2022

View reviewed changes

replace concat with +

5203709

VitaliiMaltsev temporarily deployed to more-secrets January 10, 2022 10:28 Inactive

vmaltsev added 2 commits January 10, 2022 13:20

Merge branch 'master' of github.com:airbytehq/airbyte

e2cb62c

Merge branch 'master' into vmaltsev/8832-destination-snowflake-duplic…

b5010a2

…ate-rows # Conflicts: # docs/integrations/destinations/snowflake.md

VitaliiMaltsev temporarily deployed to more-secrets January 10, 2022 11:29 Inactive

VitaliiMaltsev mentioned this pull request Jan 10, 2022

Destination Snowflake: add test to avoid duplicated staging data #9389

Closed

vmaltsev added 2 commits January 10, 2022 19:00

Merge branch 'master' of github.com:airbytehq/airbyte

632583a

VitaliiMaltsev temporarily deployed to more-secrets January 10, 2022 17:05 Inactive

replaced static fields and methods with non-static

d53981e

VitaliiMaltsev temporarily deployed to more-secrets January 10, 2022 17:09 Inactive

edgao approved these changes Jan 10, 2022

View reviewed changes

bump version

b5d9c81

VitaliiMaltsev temporarily deployed to more-secrets January 10, 2022 17:56 Inactive

octavia-squidington-iii temporarily deployed to more-secrets January 10, 2022 17:58 Inactive

VitaliiMaltsev merged commit 1054e7e into master Jan 10, 2022

VitaliiMaltsev deleted the vmaltsev/8832-destination-snowflake-duplicate-rows branch January 10, 2022 19:26

VitaliiMaltsev mentioned this pull request Jan 11, 2022

Destination Snowflake add test to avoid duplicated staged data #9412

Merged

40 tasks

octavia-squidington-iii mentioned this pull request Jan 11, 2022

Bump Airbyte version from 0.35.4-alpha to 0.35.5-alpha #9421

Merged

karinakuz added connectors/destinations-warehouse connectors/destination/snowflake labels Jan 17, 2022

ChristopheDuong reviewed Mar 4, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destination Snowflake : fixed duplicate rows on retries #9141

Destination Snowflake : fixed duplicate rows on retries #9141

VitaliiMaltsev commented Dec 28, 2021 •

edited

Loading

CLAassistant commented Dec 28, 2021

VitaliiMaltsev commented Dec 28, 2021 •

edited by github-actions bot

Loading

alexandertsukanov Dec 28, 2021

VitaliiMaltsev Dec 28, 2021

alexandertsukanov Dec 28, 2021

VitaliiMaltsev Dec 28, 2021

alexandertsukanov commented Dec 28, 2021

sherifnada commented Jan 5, 2022 •

edited

Loading

edgao left a comment

edgao Jan 5, 2022

VitaliiMaltsev Jan 10, 2022

edgao Jan 10, 2022

VitaliiMaltsev Jan 10, 2022

VitaliiMaltsev commented Jan 10, 2022 •

edited

Loading

sherifnada commented Jan 10, 2022

edgao commented Jan 10, 2022

edgao left a comment

VitaliiMaltsev commented Jan 10, 2022 •

edited by github-actions bot

Loading

ChristopheDuong Mar 4, 2022 •

edited

Loading

tuliren Mar 4, 2022

Destination Snowflake : fixed duplicate rows on retries #9141

Destination Snowflake : fixed duplicate rows on retries #9141

Conversation

VitaliiMaltsev commented Dec 28, 2021 • edited Loading

What

How

Recommended reading order

🚨 User Impact 🚨

Pre-merge Checklist

Community member or Airbyter

Airbyter

Community member or Airbyter

Airbyter

CLAassistant commented Dec 28, 2021

VitaliiMaltsev commented Dec 28, 2021 • edited by github-actions bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexandertsukanov commented Dec 28, 2021

sherifnada commented Jan 5, 2022 • edited Loading

edgao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VitaliiMaltsev commented Jan 10, 2022 • edited Loading

sherifnada commented Jan 10, 2022

edgao commented Jan 10, 2022

edgao left a comment

Choose a reason for hiding this comment

VitaliiMaltsev commented Jan 10, 2022 • edited by github-actions bot Loading

ChristopheDuong Mar 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VitaliiMaltsev commented Dec 28, 2021 •

edited

Loading

VitaliiMaltsev commented Dec 28, 2021 •

edited by github-actions bot

Loading

sherifnada commented Jan 5, 2022 •

edited

Loading

VitaliiMaltsev commented Jan 10, 2022 •

edited

Loading

VitaliiMaltsev commented Jan 10, 2022 •

edited by github-actions bot

Loading

ChristopheDuong Mar 4, 2022 •

edited

Loading