Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source FB Marketing: improve insights jobs reliability & runtime #8282

Closed
sherifnada opened this issue Nov 29, 2021 · 3 comments · Fixed by #9805
Closed

Source FB Marketing: improve insights jobs reliability & runtime #8282

sherifnada opened this issue Nov 29, 2021 · 3 comments · Fixed by #9805

Comments

@sherifnada
Copy link
Contributor

Tell us about the problem you're trying to solve

Placeholder for https://github.com/airbytehq/oncall/issues/32 and https://github.com/airbytehq/oncall/issues/31

The *_insights streams for FB provide a suboptimal user experience for a few reasons:

  1. They sometimes take a very long time to finish.
  2. If a job takes "too long" to finish (something like 3 hours as defined by the connector code) we assume it will not succeed and we fail the job/sync.
  3. It is unpredictable before creating a job whether it will succeed or fail. Holding the tuning parameters constant doesn't seem to be the answer as it seems different accounts work with different params depending on the data they have.

Right now, our approach for tuning jobs is: come up with a tuning configuration / selection of columns and cross our fingers that it will work. If it does, we call it case closed (but don't necessarily learn why it failed previously). If it fails or takes too long, there is no clear answer as to what went wrong. It seems really wrong that we're having these issues with one of the biggest ad platforms in the world. We really need to rethink how we're interacting with the FB API.

Some questions:

  1. Is there any way we can predict if jobs will fail / take too long?
  2. We currently assume that if a job takes too long, it is stuck and will fail. Is this a correct assumption? Is the right reaction to fail the job? Or should we just wait until it succeeds because that's the best we can do?
  3. If we notice certain jobs failing, is there anything we can do dynamically inside the connector to make them succeed? Maybe submit them with different parameters? maybe submit multiple jobs which each collect a subset of the fields, then merge them into one?
  4. Are there any other API users whose experience we can learn from? we must be reinventing the wheel here. Do the FB developer forums, stack overflow, or the internet contain any useful examples of how we can solve this problem?

A fantastic outcome of this investigation would be:

  1. We find a way to deterministically make jobs from FB Marketing succeed, no matter the
  2. After 1 is achieved, we find a way to make these jobs a lot faster. You have full creative freedom over how we can do this. Throw the whole CDK away and start over with assembly code if need be. (ok, maybe not assembly code :) )

Basically: make it work, then make it fast.

The linked issues at the beginning of the ticket contain more information about the Airbyte users facing this problem and where to find their instances.

@sherifnada sherifnada added type/enhancement New feature or request area/connectors Connector related issues labels Nov 29, 2021
@sherifnada sherifnada added this to the Connectors Dec 10 2021 milestone Nov 29, 2021
@sherifnada sherifnada changed the title Source FB Marketing: improve runtime Source FB Marketing: improve insights jobs reliability & runtime Nov 29, 2021
@avida avida self-assigned this Nov 30, 2021
@avida
Copy link
Contributor

avida commented Nov 30, 2021

This is investigation of Facebook marketing performance issue based on logs provided by customer.

Here is plot built based on amount of records produced through the running read operation:

chart
On x axis is the time of operation and on y axis is total record count. vertical dotted red lines representing fails (job stopped with "Job Failed" status) and vertical dotted green line - retries (job stuck on "Job running" status for ~ 30 minutes). Each read operation has date range of 1 day.

Based on this plot and logs we could make some conslusions:

  1. User data is distributed evenly through the timerange.
  2. ads_insights stream is pretty fast compared to other (ad_sets and ads) streams
  3. Each run (for this case) with 1 day range takes 2-5 minutes to complete even same request failed (see 11:34 fail)
  4. Restarting stuck request without waiting it to be completed makes thing worse (on 13:35 we started vicious loop of spawning jobs so it could hit a throttle limit). It took 2 hours out of 4 hours of totals run

How Ads Insights implemented now

We are using async call to schedule max 10 async jobs with minimal date range (1 day) so each time we have running jobs for 10 days ahead. It read out sequentialy started from the former day and append next job to the queue after first job completed. If job is failed it scheduled one more time up to 5 times. If jobs havent been completed after scheduling in ~30 minutes it considered failed and rescheduled again.

Pros:

  • All records are processed sequntially. In case if some job is ready we stil have to wait untill first job processed.
  • If job status not updated for 30 minutes it considered to be failed and job restarted. But we have no means to cancel this job so it keep consuming resources and causing next job to be throttled and it makes vicious cycle of spawning new jobs.
  • We have constant number of 10 jobs running async without any feedback on how efficiantly we using rate limit.

Cons:

  • Simple implementation
  • Still have 10 async job running, its better than one(as long limit is not reached).
  • We restarting stucked job so eventually it would be more than 10 running jobs and it could run out of control.

Proposed solutions

Improve existing approach

Main downside of existing approach is that it waits current job to be completed before proceeding to next jobs despite they could have already be completed. Im not sure if this a serious downside becase accrding to this there is a limits by number of rows in response and number of data points required to compute the total and not mentioned about limit number of active jobs (beside that job id expires in 30 days).

To improve it we need:

  1. Do not give up on stuck job. This is the main reason connector running to long is that spawning job lead throttling
  2. Each time we updating job status do this for all the jobs and schedule new job when one of the running job is completed.
  3. Dynamically determine number of simultaneously running job by reading limit utilization from x-fb-ads-insights-throttle header (there is a json with app_id_util_pct field representing total number of capacity consumed from 0 to 100).

Not sure how it would work with small sets of data, need additional investigation.

Only one async job at time

We could run on only one async job at time with adjustable date range based on X-Business-Use-Case header that have information on number of resource used by previous request. This is most easy for implementation and should work fine for large and small sets of data.

A lot of async jobs but no particular read order

This is similar to existing approach but instead of reading results sequntialy we process first job that has been completed and spawn next one. It would require additional logic to manage state consistency.

@keu
Copy link
Contributor

keu commented Jan 20, 2022

I have tested improved connector on short and long intervals (up to 1 year)
The new code shows 10+ times improvement in speed. In terms of reliability new code succeed in reading all intervals and recovering from falls due to new state logic (state per slice)

There are minor defects that need to be fixed still:

@keu keu linked a pull request Jan 26, 2022 that will close this issue
40 tasks
@keu
Copy link
Contributor

keu commented Jan 26, 2022

created new PR to have a better overview of the final changes. #9805

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment