Source FB Marketing: improve insights jobs reliability & runtime #8282

sherifnada · 2021-11-29T04:28:42Z

Tell us about the problem you're trying to solve

Placeholder for https://github.com/airbytehq/oncall/issues/32 and https://github.com/airbytehq/oncall/issues/31

The *_insights streams for FB provide a suboptimal user experience for a few reasons:

They sometimes take a very long time to finish.
If a job takes "too long" to finish (something like 3 hours as defined by the connector code) we assume it will not succeed and we fail the job/sync.
It is unpredictable before creating a job whether it will succeed or fail. Holding the tuning parameters constant doesn't seem to be the answer as it seems different accounts work with different params depending on the data they have.

Right now, our approach for tuning jobs is: come up with a tuning configuration / selection of columns and cross our fingers that it will work. If it does, we call it case closed (but don't necessarily learn why it failed previously). If it fails or takes too long, there is no clear answer as to what went wrong. It seems really wrong that we're having these issues with one of the biggest ad platforms in the world. We really need to rethink how we're interacting with the FB API.

Some questions:

Is there any way we can predict if jobs will fail / take too long?
We currently assume that if a job takes too long, it is stuck and will fail. Is this a correct assumption? Is the right reaction to fail the job? Or should we just wait until it succeeds because that's the best we can do?
If we notice certain jobs failing, is there anything we can do dynamically inside the connector to make them succeed? Maybe submit them with different parameters? maybe submit multiple jobs which each collect a subset of the fields, then merge them into one?
Are there any other API users whose experience we can learn from? we must be reinventing the wheel here. Do the FB developer forums, stack overflow, or the internet contain any useful examples of how we can solve this problem?

A fantastic outcome of this investigation would be:

We find a way to deterministically make jobs from FB Marketing succeed, no matter the
After 1 is achieved, we find a way to make these jobs a lot faster. You have full creative freedom over how we can do this. Throw the whole CDK away and start over with assembly code if need be. (ok, maybe not assembly code :) )

Basically: make it work, then make it fast.

The linked issues at the beginning of the ticket contain more information about the Airbyte users facing this problem and where to find their instances.

The text was updated successfully, but these errors were encountered:

avida · 2021-11-30T13:06:28Z

This is investigation of Facebook marketing performance issue based on logs provided by customer.

Here is plot built based on amount of records produced through the running read operation:

On x axis is the time of operation and on y axis is total record count. vertical dotted red lines representing fails (job stopped with "Job Failed" status) and vertical dotted green line - retries (job stuck on "Job running" status for ~ 30 minutes). Each read operation has date range of 1 day.

Based on this plot and logs we could make some conslusions:

User data is distributed evenly through the timerange.
ads_insights stream is pretty fast compared to other (ad_sets and ads) streams
Each run (for this case) with 1 day range takes 2-5 minutes to complete even same request failed (see 11:34 fail)
Restarting stuck request without waiting it to be completed makes thing worse (on 13:35 we started vicious loop of spawning jobs so it could hit a throttle limit). It took 2 hours out of 4 hours of totals run

How Ads Insights implemented now

We are using async call to schedule max 10 async jobs with minimal date range (1 day) so each time we have running jobs for 10 days ahead. It read out sequentialy started from the former day and append next job to the queue after first job completed. If job is failed it scheduled one more time up to 5 times. If jobs havent been completed after scheduling in ~30 minutes it considered failed and rescheduled again.

Pros:

All records are processed sequntially. In case if some job is ready we stil have to wait untill first job processed.
If job status not updated for 30 minutes it considered to be failed and job restarted. But we have no means to cancel this job so it keep consuming resources and causing next job to be throttled and it makes vicious cycle of spawning new jobs.
We have constant number of 10 jobs running async without any feedback on how efficiantly we using rate limit.

Cons:

Simple implementation
Still have 10 async job running, its better than one(as long limit is not reached).
We restarting stucked job so eventually it would be more than 10 running jobs and it could run out of control.

Proposed solutions

Improve existing approach

Main downside of existing approach is that it waits current job to be completed before proceeding to next jobs despite they could have already be completed. Im not sure if this a serious downside becase accrding to this there is a limits by number of rows in response and number of data points required to compute the total and not mentioned about limit number of active jobs (beside that job id expires in 30 days).

To improve it we need:

Do not give up on stuck job. This is the main reason connector running to long is that spawning job lead throttling
Each time we updating job status do this for all the jobs and schedule new job when one of the running job is completed.
Dynamically determine number of simultaneously running job by reading limit utilization from x-fb-ads-insights-throttle header (there is a json with app_id_util_pct field representing total number of capacity consumed from 0 to 100).

Not sure how it would work with small sets of data, need additional investigation.

Only one async job at time

We could run on only one async job at time with adjustable date range based on X-Business-Use-Case header that have information on number of resource used by previous request. This is most easy for implementation and should work fine for large and small sets of data.

A lot of async jobs but no particular read order

This is similar to existing approach but instead of reading results sequntialy we process first job that has been completed and spawn next one. It would require additional logic to manage state consistency.

keu · 2022-01-20T15:28:21Z

I have tested improved connector on short and long intervals (up to 1 year)
The new code shows 10+ times improvement in speed. In terms of reliability new code succeed in reading all intervals and recovering from falls due to new state logic (state per slice)

There are minor defects that need to be fixed still:

logic that cleans up slices states doesn't work because the cursor value used as start date (defect from prev version)
CDK prints state each time we set it, in case of FB state could be a very large object, and printing it each time reduce the performance and spoils the logs.
need to merge with master and latest changes
add changes from 🐛 Source FB Marketing: deprecate INSIGHTS_DAYS_PER_JOB from connector's specification. #8234

keu · 2022-01-26T20:20:20Z

created new PR to have a better overview of the final changes. #9805

sherifnada added type/enhancement New feature or request area/connectors Connector related issues labels Nov 29, 2021

sherifnada added this to the Connectors Dec 10 2021 milestone Nov 29, 2021

sherifnada changed the title ~~Source FB Marketing: improve runtime~~ Source FB Marketing: improve insights jobs reliability & runtime Nov 29, 2021

avida self-assigned this Nov 30, 2021

avida mentioned this issue Dec 1, 2021

Facebook Marketing performance improvement #8385

Closed

VasylLazebnyk modified the milestones: Connectors Dec 10 2021, Connectors Dec 24 2021 Dec 13, 2021

VasylLazebnyk modified the milestones: Connectors Dec 24 2021, Connectors Jan 14 2022 Dec 27, 2021

karinakuz added connectors/sources-api connectors/source/facebook-marketing labels Jan 5, 2022

VasylLazebnyk assigned sergei-solonitcyn Jan 13, 2022

VasylLazebnyk modified the milestones: Connectors Jan 14 2022, Connectors Jan 28 2022 Jan 17, 2022

VasylLazebnyk unassigned avida Jan 17, 2022

sergei-solonitcyn assigned keu and unassigned sergei-solonitcyn Jan 18, 2022

keu linked a pull request Jan 26, 2022 that will close this issue

🎉 🎉 Source FB Marketing: performance and reliability fixes #9805

Merged

40 tasks

keu mentioned this issue Jan 28, 2022

Source Facebook Marketing: Allow user to set time_increment parameter for Insights API #6225

Closed

VasylLazebnyk modified the milestones: Connectors Jan 28 2022, Connectors Feb 11 Jan 31, 2022

VasylLazebnyk modified the milestones: Connectors Feb 11, Connectors Feb 25 Feb 14, 2022

keu closed this as completed in #9805 Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source FB Marketing: improve insights jobs reliability & runtime #8282

Source FB Marketing: improve insights jobs reliability & runtime #8282

sherifnada commented Nov 29, 2021

avida commented Nov 30, 2021

keu commented Jan 20, 2022 •

edited

Loading

keu commented Jan 26, 2022

Source FB Marketing: improve insights jobs reliability & runtime #8282

Source FB Marketing: improve insights jobs reliability & runtime #8282

Comments

sherifnada commented Nov 29, 2021

Tell us about the problem you're trying to solve

avida commented Nov 30, 2021

This is investigation of Facebook marketing performance issue based on logs provided by customer.

How Ads Insights implemented now

Pros:

Cons:

Proposed solutions

Improve existing approach

Only one async job at time

A lot of async jobs but no particular read order

keu commented Jan 20, 2022 • edited Loading

keu commented Jan 26, 2022

keu commented Jan 20, 2022 •

edited

Loading