Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Zendesk Chat: Process large amount of data in batches for incremental #13387

Conversation

RobertoBonnet
Copy link
Contributor

What

Making able to process a large amount of data more smothly, making data available faster. #13379

How

Making possible to configure a limit of maximum paginations will be processed for incremental data

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/SUMMARY.md
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • Changelog updated in docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub and connector version bumped by running the /publish command described here
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed

Tests

Unit

Put your unit tests output here.

Integration

Put your integration tests output here.

Acceptance

Put your acceptance tests output here.

@github-actions github-actions bot added the area/connectors Connector related issues label Jun 1, 2022
Comment on lines 23 to 29
"api_pagination_limit": {
"type": "integer",
"title": "Api Pagination Limit (0=No Limit)",
"description": "Limit up to how many pages will be queried in the API to force the writing of data",
"default": 0,
"minimum": 0
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobertoBonnet this should not be changed only after 100k requests? Adding this to users will create an overhead process here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point is to not allow this limitation to be configurable by user? With 0 the limitations doesn't exist. Yesterday we processed the data with a limitation of 15k. Around of 130 GB was processed with 23M of rows in 24h. After that, the steam state wans't updated. Maybe there is a problem with big query destination to long runs. In another day we tried do process without limitation. It took 5 days and an erro ocurrered with invalid response. We have lost 5 days of processing data. What is your suggestion?

Copy link
Member

@marcosmarxm marcosmarxm Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is why not implement this #12591 (comment) in code instead give a complex configuration to users?

  • Limit to 1000 pages the sync and finish? not the best because users will think the connector has problem and will ask why data is missing. maybe one solution will add a warning in documentation
  • After 1000 pages, reset the connection and wait until the server is ok to start getting data again?
  • Implement in the connector the change of request/min after 1000 so users don't need to worry about it

@sherifnada can you give your opinion here? this is a problem that impacts users will large Zendesk accounts that got stuck because of Zendesk side

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changed that I have commented in these issue does not affect Zendesk Chat. The objective of this PR would be to have pieces of data without having to wait for the whole process. In zendesk support, for example, it would take 14 days to process everything at once. On Zendesk Chat we were processing for 5 days and we couldn't

@marcosmarxm
Copy link
Member

@RobertoBonnet btw, thanks for all help in Zendesk connectors! The discussion is to find the best approach to solve problem align with CDK development.

@RobertoBonnet
Copy link
Contributor Author

@marcosmarxm sure. No problem. You have a good point. Here at hurb we are using our Connector of Zendesk Chat (with theses changes in this PR) and Zendesk Support (We have to make some changes to be able to process a lot of data. I haven't opened a PR yet. I would need to mature some things and analyze the changes you guys have recently uploaded.

Zendesk Support I have already managed to update the data. look at the amount

image

@@ -104,11 +104,21 @@ def _field_to_datetime(value: Union[int, str]) -> pendulum.datetime:


class TimeIncrementalStream(BaseIncrementalStream, ABC):
def __init__(self, start_date, **kwargs):
state_checkpoint_interval = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobertoBonnet I believe this is the only line necessary in the current PR. Since this connector wasn't checkpointing previously, a halfway failure causes all progress to be lost. But with this line, the connector will "save" its progress every 1000 records. in that case even if a halfway failure occurs, then we don't lose data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sherifnada @marcosmarxm It makes sense. Increasing the limit of records that we will process for each request along with the checkpoints will help a lot. Could I do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it also possible to use this no IdBased incremental streams? The docs don't mention anything on whether those records are returned in ascending ID order as far as I can see. But it seems really surprising if they don't.

If records are returned in Ascending order of the ID, we may be able to achieve the same effect by putting this line in BaseIncrementalStream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sherifnada These routes that I changed are based on the record update date

@@ -104,11 +104,21 @@ def _field_to_datetime(value: Union[int, str]) -> pendulum.datetime:


class TimeIncrementalStream(BaseIncrementalStream, ABC):
def __init__(self, start_date, **kwargs):
state_checkpoint_interval = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it also possible to use this no IdBased incremental streams? The docs don't mention anything on whether those records are returned in ascending ID order as far as I can see. But it seems really surprising if they don't.

If records are returned in Ascending order of the ID, we may be able to achieve the same effect by putting this line in BaseIncrementalStream

@RobertoBonnet
Copy link
Contributor Author

@marcosmarxm @sherifnada I'm trying to run a fullrefresh with this version.
image

@marcosmarxm
Copy link
Member

@marcosmarxm @sherifnada I'm trying to run a fullrefresh with this version.
did it work?

@RobertoBonnet
Copy link
Contributor Author

@marcosmarxm @sherifnada I'm trying to run a fullrefresh with this version.
did it work?

Yes.
image

@marcosmarxm marcosmarxm added team/tse Technical Support Engineers and removed team/extensibility labels Jun 14, 2022
@marcosmarxm
Copy link
Member

marcosmarxm commented Jun 14, 2022

/test connector=connectors/source-zendesk-chat

🕑 connectors/source-zendesk-chat https://github.com/airbytehq/airbyte/actions/runs/2497658218
❌ connectors/source-zendesk-chat https://github.com/airbytehq/airbyte/actions/runs/2497658218
🐛 https://gradle.com/s/cqbrz6wnn5kty

Build Failed

Test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestDiscovery::test_discover[inputs0] - docker.errors.Co...
FAILED test_core.py::TestDiscovery::test_discover[inputs1] - docker.errors.Co...
FAILED test_core.py::TestDiscovery::test_discover[inputs2] - docker.errors.Co...
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs2]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs2]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs0-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs0-not]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs1-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs1-not]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs2-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs2-not]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs2]
ERROR test_core.py::TestBasicRead::test_read[inputs0] - docker.errors.Contain...
ERROR test_core.py::TestBasicRead::test_read[inputs1] - docker.errors.Contain...
ERROR test_full_refresh.py::TestFullRefresh::test_sequential_reads[inputs0]
ERROR test_incremental.py::TestIncremental::test_two_sequential_reads[inputs0]
ERROR test_incremental.py::TestIncremental::test_read_sequential_slices[inputs0]
ERROR test_incremental.py::TestIncremental::test_state_with_abnormally_large_values[inputs0]
=================== 3 failed, 16 passed, 21 errors in 20.23s ===================

@marcosmarxm
Copy link
Member

marcosmarxm commented Jun 23, 2022

/test connector=connectors/source-zendesk-chat

🕑 connectors/source-zendesk-chat https://github.com/airbytehq/airbyte/actions/runs/2551756464
❌ connectors/source-zendesk-chat https://github.com/airbytehq/airbyte/actions/runs/2551756464
🐛 https://gradle.com/s/sscsim57ecqyk

Build Failed

Test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestDiscovery::test_discover[inputs0] - docker.errors.Co...
FAILED test_core.py::TestDiscovery::test_discover[inputs1] - docker.errors.Co...
FAILED test_core.py::TestDiscovery::test_discover[inputs2] - docker.errors.Co...
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_defined_cursors_exist_in_schema[inputs2]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_defined_refs_exist_in_schema[inputs2]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs0-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs0-not]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs1-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs1-not]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs2-allOf]
ERROR test_core.py::TestDiscovery::test_defined_keyword_exist_in_schema[inputs2-not]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs0]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs1]
ERROR test_core.py::TestDiscovery::test_primary_keys_exist_in_schema[inputs2]
ERROR test_core.py::TestBasicRead::test_read[inputs0] - docker.errors.Contain...
ERROR test_core.py::TestBasicRead::test_read[inputs1] - docker.errors.Contain...
ERROR test_full_refresh.py::TestFullRefresh::test_sequential_reads[inputs0]
ERROR test_incremental.py::TestIncremental::test_two_sequential_reads[inputs0]
ERROR test_incremental.py::TestIncremental::test_read_sequential_slices[inputs0]
ERROR test_incremental.py::TestIncremental::test_state_with_abnormally_large_values[inputs0]
=================== 3 failed, 16 passed, 21 errors in 19.87s ===================

@marcosmarxm marcosmarxm self-assigned this Jun 23, 2022
Copy link
Member

@marcosmarxm marcosmarxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobertoBonnet I executed the merge and some changes in your contribution #14214. Thanks so much for this, hope to see more.

@marcosmarxm
Copy link
Member

Closed here because this was merged in #14214

@RobertoBonnet
Copy link
Contributor Author

@marcosmarxm OK. I was traveling and had no opportunity to follow what was happening. I'm going back to work today

@marcosmarxm
Copy link
Member

You can update to version 0.1.8 and see your changes there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants