Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Deprecating API Type parameter in Source Salesforce #8419

Closed
augan-rymkhan opened this issue Dec 2, 2021 · 16 comments · Fixed by #9302
Closed

Investigate Deprecating API Type parameter in Source Salesforce #8419

augan-rymkhan opened this issue Dec 2, 2021 · 16 comments · Fixed by #9302
Assignees
Labels
area/connectors Connector related issues connectors/source/salesforce connectors/sources-api needs-scoping not-ready-for-work still need to gather info or make decisions before we start work type/bug Something isn't working

Comments

@augan-rymkhan
Copy link
Contributor

augan-rymkhan commented Dec 2, 2021

Enviroment

  • Airbyte version:
  • OS Version / Instance:
  • Deployment: *
  • Source Connector and version: Salesforce
  • Destination Connector and version: *
  • Severity: Low
  • Step where error happened:

Current Behavior

Tell us what happens.

Expected Behavior

Deprecate "API Type" parameter in Source Salesforce. This is an implementation detail that we should remove eventually.

Acceptance Criteria

The output of this issue should be a proposal/document describing whether it is possible to deprecate this option, and what are the tradeoffs when doing so.

@sherifnada
Copy link
Contributor

@augan-rymkhan actually this is one of the very few cases where it might not an implementation detail. Salesforce actually has two different account types: sandbox and production. They operate in different domains (my.salesfroce.com and test.salesforce.com). If you can find a way to identify reliably if the user should be using a sandbox account, great, let's do it. But I suspect it may be difficult especially with oauth

@sherifnada sherifnada added area/connectors Connector related issues and removed needs-triage labels Dec 3, 2021
@sherifnada
Copy link
Contributor

@augan-rymkhan I just realized why you created this issue, because of my comment on the other PR.

That comment was directed at the "API Type" parameter not the sandbox, apologies for putting it on the wrong line.

@augan-rymkhan augan-rymkhan changed the title Deprecate is_sandbox option in Source Salesforce Deprecate API Type parameter in Source Salesforce Dec 3, 2021
@augan-rymkhan
Copy link
Contributor Author

@sherifnada no worry, I have updated it.

@augan-rymkhan augan-rymkhan self-assigned this Dec 7, 2021
@sherifnada sherifnada changed the title Deprecate API Type parameter in Source Salesforce Investigate Deprecating API Type parameter in Source Salesforce Dec 10, 2021
@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Dec 14, 2021

Scoping report

Pros and cons of BULK API version:

Pros:

  1. BULK API returns a csv file. It helps to avoid rate limit.
  2. It also supports incremental sync.

Cons:

  1. Salesforce BULK API currently does not support loading fields with data type base64 and compound data (nested object). link
    If the stream's schema contains base64 and nested object, we can use REST version of the stream.

  2. BULK API is not supported by some streams.
    But we can use REST API for those streams.

The solution can be using BULK API by default where it's possible and without data loss. If not fall back to REST version.

@sherifnada
Copy link
Contributor

Some questions:

  1. what does base64 correspond to in Salesforce? I'm trying to understand how often a user will have a base64 object or a compound data field, which would impact this.
  2. what is the performance difference between bulk and REST API performance?

@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Dec 15, 2021

@sherifnada

  1. base64 corresponds to binary files.
    Body or Binary field contains the base64 encoded data in Attachment, Document and Scontrol streams.
    So it depends on how many files user uploaded. For example, here is the list of objects (streams) which can have attachments.
    image

For more info you can check this.

Address and Geolocation are compound data fields.
Field type of Address is used in the following streams:
Contact, Account, AccountBrand, Order, Employee, Organization, WorkOrder, other streams

  1. Compared performance REST vs BULK API.

The first test
8 synced streams: Account, ActiveFeatureLicenseMetric, ActivePermSetLicenseMetric, ActiveProfileMetric, AppDefinition,
Asset, PermissionSetTabSetting, LeadHistory

start date = "2020-06-01T00:00:00Z"
REST - 23s
BULK - 52s

The second test
40 streams were synced.

start date = "2021-11-01T00:00:00Z"

REST - 1 min 22s
BULK - 5 min 50s

The third test
80 streams

start date = "2021-05-01T00:00:00Z"

REST - 3 min 30s
BULK - 14min 40s

BULK api consists of the following steps:

  1. create job
  2. check job's status in loop by sleeping exp growing time interval
  3. if job is ready, download data

In the tests above BULK API is slower because, we wait for the streams which has not many records or even doesn't contain any records.

If there are a lot of data in the streams, REST API might face rate limit issue. In that case, BULK api could be preferable.

If a user does not reach rate limit, REST API is a good choice due to its better performance.

@augan-rymkhan augan-rymkhan self-assigned this Dec 16, 2021
@sherifnada
Copy link
Contributor

@misteryeo do you feel comfortable making the call here?

@VasylLazebnyk VasylLazebnyk added the not-ready-for-work still need to gather info or make decisions before we start work label Dec 23, 2021
@sherifnada
Copy link
Contributor

@augan-rymkhan can you do one more test with very high volume of data? I suspect that for a sufficiently high volume of data in a single stream, BULK is faster than REST. It may be useful to know what that is.

My guess is we may want to follow the strategy of: for the initial sync, always use bulk. For the incremental syncs, use REST. Just a guess though.

@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Dec 27, 2021

@sherifnada

  • I tried to import 1 million records for Note stream via BULK API, but it created only about 10K records, because we reached storage limit in Salesforce account.

image

  • Tested performance REST vs BULK.
    BULK API is executed in 28 s. Read 10101 records from Note stream.
    image

There was an issue with pagination in REST API version. It reads only 1000 records from some streams. I have fixed it locally, then ran the test. REST API version is executed in 14 s. Read 10101 records from Note stream.
(I created an issue for resolve it. I am waiting for the approve from the team to complete and merge the solution)

10k records is not enough to compare them. But we reached the limit. Is it possible to increase the storage limit?

@sherifnada
Copy link
Contributor

@augan-rymkhan what storage limit are you referring to? is it on salesforce side? feel free to increase it as needed.

@augan-rymkhan
Copy link
Contributor Author

@sherifnada
Yes, we need to increase data storage in Salesforce account. I've found this:
You can increase your data storage by buying additional data storage. It can be purchased on Your Account or by contacting your Account representative.
https://help.salesforce.com/s/articleView?id=000334051&type=1
I think we need help from your side here.

@sherifnada
Copy link
Contributor

@augan-rymkhan feel free to use the connectors credit card to purchase more quota, find it in lastpass

@sherifnada
Copy link
Contributor

also @augan-rymkhan I think I don't follow why we need to increase the storage limit. Didn't we already write 1 million records which we read via the batch API? why do we need to store things separately for the REST API?

@augan-rymkhan
Copy link
Contributor Author

@sherifnada I don't think we have 1 million records for one stream. The idea was to import more rows for one stream, then compare performance BULK vs REST. Not for only REST API.

@sherifnada
Copy link
Contributor

sorry I misread your earlier message about 10k records as 10M -- nvm me

@augan-rymkhan
Copy link
Contributor Author

augan-rymkhan commented Jan 5, 2022

@sherifnada I have submitted request to join Partner program, because it allows us to use Partner Developer Edition.

Developer Edition (our current account)
API CALLS per day: 15000
Storage: 20MB

Partner Developer Edition
API CALLS per day: 50000
Storage: 250MB

For more info go here.

A Partner Developer Edition is a licensed version of the free Developer Edition that includes more storage, features, and licenses. Partner Developer Editions are free to enrolled Salesforce partners. For more

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues connectors/source/salesforce connectors/sources-api needs-scoping not-ready-for-work still need to gather info or make decisions before we start work type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants