Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues in new Mongo Source Connector #6544

Closed
JCWahoo opened this issue Sep 29, 2021 · 13 comments · Fixed by #7982 or #8161
Closed

Performance issues in new Mongo Source Connector #6544

JCWahoo opened this issue Sep 29, 2021 · 13 comments · Fixed by #7982 or #8161
Assignees
Labels

Comments

@JCWahoo
Copy link

JCWahoo commented Sep 29, 2021

Enviroment

  • Airbyte version: 0.29.21-alpha
  • OS Version / Instance: AWS EC2
  • Deployment: Docker
  • Source Connector and version: Mongodb-v2 0.1.1
  • Destination Connector and version: Redshift 0.3.14
  • Severity: High
  • Step where error happened: New Connector + Sync

Current Behavior

When setting up a new source with this connector, schema discovery takes close to 50 minutes, and appears to be scanning entire collections from source Mongo database

When syncing records, an incremental load of 1 stream/collection < 10k records is taking > 1 hour. Comparing to the old Mongo connector, I can refresh 20 streams/collections in around 8 minutes. The connector seems to be scanning the entire collection in a much different manner than the old Ruby source

Expected Behavior

Comparable performance to old connector, no 50 minute delay in retrieving records

Logs

Attaching logs from initial full sync. Note the 50 minute gap before records are returned

Also including logs from the next incremental sync. Same gap

LOG from initial Full Sync

2021-09-24 15:09:24 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 15:09:24 INFO i.a.i.d.j.c.s.S3StreamCopier(<init>):142 - {} - S3 upload part size: 10 MB
2021-09-24 15:09:25 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 15:09:25 INFO c.m.d.l.SLF4JLogger(info):71 - {} - Opened connection [connectionId{localValue:3, serverValue:44254}] to sufferfestproduction-shard-00-03-naqnz.mongodb.net:27017
2021-09-24 15:09:25 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 15:09:25 INFO a.m.s.StreamTransferManager(getMultiPartOutputStreams):329 - {} - Initiated multipart upload to wahoo-rivery/8e240244-550f-4dfb-b8f1-e0fb3cafc251/parse/svl_Activity with full ID nd8i3Mpxr3A16tt81CwVaJyr_9vQSwG55mhvr3PaEYsAU90IrZZrsRYKhurCy5CiHuZSkjyE49BWFC92Y1pNI8k0GxFR_DtpbvLnO4ABkDN3olwES2URygz1KzaSqJqw
2021-09-24 15:09:25 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 15:09:25 INFO i.a.i.d.b.BufferedStreamConsumer(startTracked):143 - {} - class io.airbyte.integrations.destination.buffered_stream_consumer.BufferedStreamConsumer started.
2021-09-24 15:59:39 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):223 - Records read: 1000

LOG from incremental sync

2021-09-24 18:02:03 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 18:02:03 INFO i.a.i.d.b.BufferedStreamConsumer(startTracked):143 - {} - class io.airbyte.integrations.destination.buffered_stream_consumer.BufferedStreamConsumer started.
2021-09-24 18:52:01 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 18:52:01 INFO i.a.i.s.r.StateDecoratingIterator(computeNext):80 - {} - State Report: stream name: AirbyteStreamNameNamespacePair{name='Activity', namespace='parse'}, original cursor field: _updated_at, original cursor 2021-09-24T17:14:22Z, cursor field: _updated_at, new cursor: 2021-09-24T18:51:11Z
2021-09-24 18:52:01 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 18:52:01 INFO i.a.i.s.r.AbstractDbSource(lambda$read$2):141 - {} - Closing database connection pool.
2021-09-24 18:52:01 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 18:52:01 INFO i.a.i.s.r.AbstractDbSource(lambda$read$2):143 - {} - Closed database connection pool.
2021-09-24 18:52:01 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 18:52:01 INFO i.a.i.b.IntegrationRunner(run):153 - {} - Completed integration: io.airbyte.integrations.source.mongodb.MongoDbSource
2021-09-24 18:52:01 INFO () DefaultAirbyteStreamFactory(lambda$create$0):73 - 2021-09-24 18:52:01 INFO i.a.i.s.m.MongoDbSource(main):84 - {} - completed source: class io.airbyte.integrations.source.mongodb.MongoDbSource
2021-09-24 18:52:02 INFO () DefaultReplicationWorker(run):141 - Source thread complete.

Steps to Reproduce

  1. Setup new MongoDB Connection, Connect Redshift Destination, notice delay in schema discovery. Observe long running full scans on MongoDB instance
  2. Execute full or incremental load of any stream from new Mongo connector --> destination

Are you willing to submit a PR?

Unfortunately cannot at this time

@JCWahoo JCWahoo added the type/bug Something isn't working label Sep 29, 2021
@sherifnada sherifnada added the area/connectors Connector related issues label Sep 29, 2021
@marcosmarxm marcosmarxm added the priority/high High priority label Sep 29, 2021
@marcosmarxm
Copy link
Member

Hi @JCWahoo I was able to sync the sample_training in 8 min. Can you give more context about your dataset? rows/mb?
image

@JCWahoo
Copy link
Author

JCWahoo commented Oct 4, 2021 via email

@JCWahoo
Copy link
Author

JCWahoo commented Oct 11, 2021

Old connector runs around 16 collections hourly in ~ 5 minutes, ~10k records. Doesnt have the same issues on schema discovery or initial sync

@JCWahoo
Copy link
Author

JCWahoo commented Oct 18, 2021

Tried the latest version and issue is still present

@JCWahoo
Copy link
Author

JCWahoo commented Oct 25, 2021

Are there any updates that can be shared?

@JCWahoo
Copy link
Author

JCWahoo commented Nov 3, 2021

Same behavior in latest 0.1.3

@marcosmarxm
Copy link
Member

@JCWahoo sorry not answer you before here. I'll address this to the connector team.

@JCWahoo
Copy link
Author

JCWahoo commented Nov 29, 2021

I upgraded both Strict and the V2 connector and have no change in performance

@andriikorotkov
Copy link
Contributor

@JCWahoo, could you please provide more details so that we can improve this for you? Our team tested the new changes on multiple mongodb databases with multiple collections and different field types. At the moment, we see an acceleration of work at the moment of receiving collections and their fields. I provide a video on which you can see how fast it works. The test was carried out on a database with one collection, in which there are many different fields and 10 thousand records.

mongodbtest.mp4

@JCWahoo
Copy link
Author

JCWahoo commented Nov 30, 2021

Yes, I provided those stats previously. The database has around 40-50 collections, some of which have upwards of 100mm records. The old connector returns schema in ~60-90 seconds. The new connector returns schema in around 30 minutes if at all. When looking on the Mongo side the new connector appears to parse the entire collection for schema discovery as opposed to just a 10,000 record sample. I recommend testing on larger collections to see what I'm talking and viewing real time monitoring on the Mongo side. I'm currently running a nightly incremental sync of 100k-150k records using the old connector in about 20 minutes. The new connector I cannot complete an incremental sync in less than an hour, even when testing a single stream and performing the incremental sync immediately after a full refresh

@andriikorotkov andriikorotkov reopened this Dec 1, 2021
@TSkrebe
Copy link
Contributor

TSkrebe commented Dec 3, 2021

+1. I didn't have an opportunity to try the old connector but the new one feels slow. it took around 10-15 mins to return the schema for a database with 10 collections and around 13MM docs.

@alexandr-shegeda alexandr-shegeda moved this to Implementation in progress in GL Roadmap Dec 17, 2021
@andriikorotkov
Copy link
Contributor

@JCWahoo, @TSkrebe - We added a new version - 0.1.19. We tested these changes on 20 collections in which there were from 110 to 250 thousand documents. please try the newer version and report your results.

@JCWahoo
Copy link
Author

JCWahoo commented Dec 17, 2021 via email

@andriikorotkov andriikorotkov moved this from Implementation in progress to On hold in GL Roadmap Dec 20, 2021
@alexandr-shegeda alexandr-shegeda moved this from On hold to Done in GL Roadmap Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Archived in project
7 participants