Source Zendesk Chat: Process large amount of data in batches for incremental #13387

RobertoBonnet · 2022-06-01T18:16:50Z

What

Making able to process a large amount of data more smothly, making data available faster. #13379

How

Making possible to configure a limit of maximum paginations will be processed for incremental data

Community member or Airbyter

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
If new credentials are required for use in CI, add them to GSM. Instructions.
/test connector=connectors/<name> command is passing
New Connector version released on Dockerhub by running the /publish command described here
After the connector is published, connector added to connector index as described here
Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

Updating a connector

Community member or Airbyter

Grant edit access to maintainers (instructions)
Secrets in the connector's spec are annotated with airbyte_secret
Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
Code reviews completed
Documentation updated
- Connector's README.md
- Connector's bootstrap.md. See description and examples
- Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
If new credentials are required for use in CI, add them to GSM. Instructions.
/test connector=connectors/<name> command is passing
New Connector version released on Dockerhub and connector version bumped by running the /publish command described here

Connector Generator

Issue acceptance criteria met
PR name follows PR naming conventions
If adding a new generator, add it to the list of scaffold modules being tested
The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
Documentation which references the generator is updated as needed

Tests

Unit

Put your unit tests output here.

Integration

Put your integration tests output here.

Acceptance

Put your acceptance tests output here.

marcosmarxm · 2022-06-01T21:14:12Z

airbyte-integrations/connectors/source-zendesk-chat/source_zendesk_chat/spec.json

+      "api_pagination_limit": {
+        "type": "integer",
+        "title": "Api Pagination Limit (0=No Limit)",
+        "description": "Limit up to how many pages will be queried in the API to force the writing of data",
+        "default": 0,
+        "minimum": 0
+      },


@RobertoBonnet this should not be changed only after 100k requests? Adding this to users will create an overhead process here.

Your point is to not allow this limitation to be configurable by user? With 0 the limitations doesn't exist. Yesterday we processed the data with a limitation of 15k. Around of 130 GB was processed with 23M of rows in 24h. After that, the steam state wans't updated. Maybe there is a problem with big query destination to long runs. In another day we tried do process without limitation. It took 5 days and an erro ocurrered with invalid response. We have lost 5 days of processing data. What is your suggestion?

My question is why not implement this #12591 (comment) in code instead give a complex configuration to users?

Limit to 1000 pages the sync and finish? not the best because users will think the connector has problem and will ask why data is missing. maybe one solution will add a warning in documentation

After 1000 pages, reset the connection and wait until the server is ok to start getting data again?

Implement in the connector the change of request/min after 1000 so users don't need to worry about it

@sherifnada can you give your opinion here? this is a problem that impacts users will large Zendesk accounts that got stuck because of Zendesk side

These changed that I have commented in these issue does not affect Zendesk Chat. The objective of this PR would be to have pieces of data without having to wait for the whole process. In zendesk support, for example, it would take 14 days to process everything at once. On Zendesk Chat we were processing for 5 days and we couldn't

marcosmarxm · 2022-06-02T13:26:31Z

@RobertoBonnet btw, thanks for all help in Zendesk connectors! The discussion is to find the best approach to solve problem align with CDK development.

RobertoBonnet · 2022-06-02T13:48:03Z

@marcosmarxm sure. No problem. You have a good point. Here at hurb we are using our Connector of Zendesk Chat (with theses changes in this PR) and Zendesk Support (We have to make some changes to be able to process a lot of data. I haven't opened a PR yet. I would need to mature some things and analyze the changes you guys have recently uploaded.

Zendesk Support I have already managed to update the data. look at the amount

airbyte-integrations/connectors/source-zendesk-chat/source_zendesk_chat/streams.py

sherifnada · 2022-06-03T01:30:39Z

airbyte-integrations/connectors/source-zendesk-chat/source_zendesk_chat/streams.py

@@ -104,11 +104,21 @@ def _field_to_datetime(value: Union[int, str]) -> pendulum.datetime:


 class TimeIncrementalStream(BaseIncrementalStream, ABC):
-    def __init__(self, start_date, **kwargs):
+    state_checkpoint_interval = 1000


@RobertoBonnet I believe this is the only line necessary in the current PR. Since this connector wasn't checkpointing previously, a halfway failure causes all progress to be lost. But with this line, the connector will "save" its progress every 1000 records. in that case even if a halfway failure occurs, then we don't lose data.

@sherifnada @marcosmarxm It makes sense. Increasing the limit of records that we will process for each request along with the checkpoints will help a lot. Could I do that?

is it also possible to use this no IdBased incremental streams? The docs don't mention anything on whether those records are returned in ascending ID order as far as I can see. But it seems really surprising if they don't.

If records are returned in Ascending order of the ID, we may be able to achieve the same effect by putting this line in BaseIncrementalStream

@sherifnada These routes that I changed are based on the record update date

sherifnada · 2022-06-03T17:35:20Z

airbyte-integrations/connectors/source-zendesk-chat/source_zendesk_chat/streams.py

@@ -104,11 +104,21 @@ def _field_to_datetime(value: Union[int, str]) -> pendulum.datetime:


 class TimeIncrementalStream(BaseIncrementalStream, ABC):
-    def __init__(self, start_date, **kwargs):
+    state_checkpoint_interval = 1000


is it also possible to use this no IdBased incremental streams? The docs don't mention anything on whether those records are returned in ascending ID order as far as I can see. But it seems really surprising if they don't.

If records are returned in Ascending order of the ID, we may be able to achieve the same effect by putting this line in BaseIncrementalStream

RobertoBonnet · 2022-06-06T14:12:56Z

@marcosmarxm @sherifnada I'm trying to run a fullrefresh with this version.

marcosmarxm · 2022-06-09T17:59:54Z

@marcosmarxm @sherifnada I'm trying to run a fullrefresh with this version.
did it work?

RobertoBonnet · 2022-06-09T19:15:31Z

@marcosmarxm @sherifnada I'm trying to run a fullrefresh with this version.
did it work?

Yes.

marcosmarxm · 2022-06-14T19:29:20Z

/test connector=connectors/source-zendesk-chat

🕑 connectors/source-zendesk-chat https://github.com/airbytehq/airbyte/actions/runs/2497658218
❌ connectors/source-zendesk-chat https://github.com/airbytehq/airbyte/actions/runs/2497658218
🐛 https://gradle.com/s/cqbrz6wnn5kty

Build Failed

Test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestDiscovery::test_discover[inputs0] - docker.errors.Co...
FAILED test_core.py::TestDiscovery::test_discover[inputs1] - docker.errors.Co...
FAILED test_core.py::TestDiscovery::test_discover[inputs2] - docker.errors.Co...
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs2]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs2]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs0-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs0-not]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs1-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs1-not]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs2-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs2-not]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs2]
ERROR test_core.py::TestBasicRead::test_read[inputs0] - docker.errors.Contain...
ERROR test_core.py::TestBasicRead::test_read[inputs1] - docker.errors.Contain...
ERROR test_full_refresh.py::TestFullRefresh::test_sequential_reads[inputs0]
ERROR test_incremental.py::TestIncremental::test_two_sequential_reads[inputs0]
ERROR test_incremental.py::TestIncremental::test_read_sequential_slices[inputs0]
ERROR test_incremental.py::TestIncremental::test_state_with_abnormally_large_values[inputs0]
=================== 3 failed, 16 passed, 21 errors in 20.23s ===================

marcosmarxm · 2022-06-23T20:19:52Z

/test connector=connectors/source-zendesk-chat

🕑 connectors/source-zendesk-chat https://github.com/airbytehq/airbyte/actions/runs/2551756464
❌ connectors/source-zendesk-chat https://github.com/airbytehq/airbyte/actions/runs/2551756464
🐛 https://gradle.com/s/sscsim57ecqyk

Build Failed

Test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestDiscovery::test_discover[inputs0] - docker.errors.Co...
FAILED test_core.py::TestDiscovery::test_discover[inputs1] - docker.errors.Co...
FAILED test_core.py::TestDiscovery::test_discover[inputs2] - docker.errors.Co...
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs2]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs2]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs0-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs0-not]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs1-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs1-not]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs2-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs2-not]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs2]
ERROR test_core.py::TestBasicRead::test_read[inputs0] - docker.errors.Contain...
ERROR test_core.py::TestBasicRead::test_read[inputs1] - docker.errors.Contain...
ERROR test_full_refresh.py::TestFullRefresh::test_sequential_reads[inputs0]
ERROR test_incremental.py::TestIncremental::test_two_sequential_reads[inputs0]
ERROR test_incremental.py::TestIncremental::test_read_sequential_slices[inputs0]
ERROR test_incremental.py::TestIncremental::test_state_with_abnormally_large_values[inputs0]
=================== 3 failed, 16 passed, 21 errors in 19.87s ===================

marcosmarxm

@RobertoBonnet I executed the merge and some changes in your contribution #14214. Thanks so much for this, hope to see more.

marcosmarxm · 2022-06-28T17:30:47Z

Closed here because this was merged in #14214

RobertoBonnet · 2022-06-28T17:38:07Z

@marcosmarxm OK. I was traveling and had no opportunity to follow what was happening. I'm going back to work today

marcosmarxm · 2022-06-28T17:40:09Z

You can update to version 0.1.8 and see your changes there.

RobertoBonnet added 5 commits June 1, 2022 14:51

increased the limit of itens in request

820e39e

Configuration for max api pages on requests

aa29f19

included api_pagination_limit in sample

206abd6

included api_pagination_limit in invalid_config

c760bcf

creating new table for chat_session

88d48d0

github-actions bot added the area/connectors Connector related issues label Jun 1, 2022

octavia-squidington-iii added the community label Jun 1, 2022

viniciusdsmello approved these changes Jun 1, 2022

View reviewed changes

marcosmarxm requested changes Jun 1, 2022

View reviewed changes

marcosmarxm requested a review from sherifnada June 2, 2022 13:25

marcosmarxm added the team/extensibility label Jun 2, 2022

bleonard added bounty connectors/source/zendesk-chat labels Jun 2, 2022

sherifnada reviewed Jun 3, 2022

View reviewed changes

RobertoBonnet added 2 commits June 3, 2022 09:23

reverted api_pagination_limit approach

2e27274

removed api_pagination_limit from TimeIncrementalStream

7d4ca54

sherifnada approved these changes Jun 3, 2022

View reviewed changes

alafanechere added internal and removed bounty labels Jun 7, 2022

marcosmarxm added team/tse Technical Support Engineers and removed team/extensibility labels Jun 14, 2022

marcosmarxm self-assigned this Jun 23, 2022

marcosmarxm approved these changes Jun 28, 2022

View reviewed changes

marcosmarxm closed this Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source Zendesk Chat: Process large amount of data in batches for incremental #13387

Source Zendesk Chat: Process large amount of data in batches for incremental #13387

RobertoBonnet commented Jun 1, 2022

marcosmarxm Jun 1, 2022

RobertoBonnet Jun 1, 2022

marcosmarxm Jun 2, 2022 •

edited

Loading

RobertoBonnet Jun 2, 2022

marcosmarxm commented Jun 2, 2022

RobertoBonnet commented Jun 2, 2022

sherifnada Jun 3, 2022

RobertoBonnet Jun 3, 2022

sherifnada Jun 3, 2022

RobertoBonnet Jun 3, 2022

sherifnada Jun 3, 2022

RobertoBonnet commented Jun 6, 2022

marcosmarxm commented Jun 9, 2022

RobertoBonnet commented Jun 9, 2022

marcosmarxm commented Jun 14, 2022 •

edited by github-actions bot

Loading

marcosmarxm commented Jun 23, 2022 •

edited by github-actions bot

Loading

marcosmarxm left a comment

marcosmarxm commented Jun 28, 2022

RobertoBonnet commented Jun 28, 2022

marcosmarxm commented Jun 28, 2022

Source Zendesk Chat: Process large amount of data in batches for incremental #13387

Source Zendesk Chat: Process large amount of data in batches for incremental #13387

Conversation

RobertoBonnet commented Jun 1, 2022

What

How

Community member or Airbyter

Airbyter

Community member or Airbyter

Airbyter

Tests

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcosmarxm Jun 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcosmarxm commented Jun 2, 2022

RobertoBonnet commented Jun 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobertoBonnet commented Jun 6, 2022

marcosmarxm commented Jun 9, 2022

RobertoBonnet commented Jun 9, 2022

marcosmarxm commented Jun 14, 2022 • edited by github-actions bot Loading

Build Failed

marcosmarxm commented Jun 23, 2022 • edited by github-actions bot Loading

Build Failed

marcosmarxm left a comment

Choose a reason for hiding this comment

marcosmarxm commented Jun 28, 2022

RobertoBonnet commented Jun 28, 2022

marcosmarxm commented Jun 28, 2022

marcosmarxm Jun 2, 2022 •

edited

Loading

marcosmarxm commented Jun 14, 2022 •

edited by github-actions bot

Loading

marcosmarxm commented Jun 23, 2022 •

edited by github-actions bot

Loading