Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 minute pause at GATK startup due to NIO library #3491

Open
chandrans opened this issue Aug 23, 2017 · 18 comments
Open

2 minute pause at GATK startup due to NIO library #3491

chandrans opened this issue Aug 23, 2017 · 18 comments

Comments

@chandrans
Copy link
Contributor

chandrans commented Aug 23, 2017

User has reported longer runtimes in GATK4 beta3 release compared to GATK4 beta 2 release. It sounds like this is not expected. Her runtimes are below. The first post in the forum thread has her original report.

Tool 4.beta.2 4.beta.3
BaseRecalibrator 1m 3s 3m 3s
ApplyBQSR (scattered) 4m 48s 11m 51s
HaplotypeCaller (scattered) 23m 42s 29m 7s
GenotypeGVCFs (scattered) 4m 6s 9m 28s
VariantRecalibrator (for SNPs) 4m 7s 6m 38s
VariantRecalibrator (for INDELs) 2m 7s 4m 8s
ApplyVQSR (for SNPs) 37s 2m 36s
ApplyVQSR (for INDELs) 39s 2m 35s

This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/comment/41669#Comment_41669

@droazen droazen added this to the Engine-4.0 milestone Aug 23, 2017
@droazen droazen changed the title GATK 4.beta.2 vs. 4.beta.3 performance Possible GATK 4.beta.2 vs. 4.beta.3 performance regression Aug 23, 2017
@lbergelson
Copy link
Member

A quick run off between 4.beta.2 BaseRecalibrator and 4.beta.3 doesn't show any difference to me. It must be either some difference in the user's command line, their data, or possibly a change in their environment over time. I'm going to request that they upload example commandline and test files that reproduce the problem.

@droazen
Copy link
Contributor

droazen commented Aug 23, 2017

It could just be natural variability in the user's runtime environment, but it's worth doing some longer-running tests to be sure.

@chandrans
Copy link
Contributor Author

Great. Thank you both. Louis, I can request the user upload data and get back to you guys on here. Thanks for looking into this.

@lbergelson
Copy link
Member

I just asked them.

@chandrans
Copy link
Contributor Author

Ah, I see that. Thanks Louis.

@chandrans
Copy link
Contributor Author

User came back and confirmed there is a 2 minute difference, but did not upload data. I just asked again with instructions to upload.

@lbergelson
Copy link
Member

User is reporting a nearly exact 2 minute pause at tool startup. Seems very suspicious, possibly some sort of gcs operation trying and timing out?

14:33:39.416 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/gatk-package-4.beta.3-local.jar!/com/intel/gkl/native/libgkl_compression.so
14:35:46.843 INFO  BaseRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 5
14:35:46.843 INFO  BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
14:35:46.843 INFO  BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
14:35:46.844 INFO  BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
14:35:46.844 INFO  BaseRecalibrator - Deflater: IntelDeflater
14:35:46.844 INFO  BaseRecalibrator - Inflater: IntelInflater
14:35:46.844 INFO  BaseRecalibrator - GCS max retries/reopens: 20
14:35:46.844 INFO  BaseRecalibrator - Using google-cloud-java patch 317951be3c2e898e3916a4b1abf5a9c220d84df8
14:35:46.844 INFO  BaseRecalibrator - Initializing engine

@chandrans
Copy link
Contributor Author

Hey Louis. Thanks for helping. She just came back and gave exact commands and some more information, basically it only occurs on the cloud and not on her local machine. I think she cannot upload her data

@lbergelson
Copy link
Member

With the help of @erniebrau we've reproduced this and gotten a useful stack trace:

"main" #1 prio=5 os_prio=0 tid=0x00007f0bc801f800 nid=0x226b runnable [0x00007f0bcdffd000]
   java.lang.Thread.State: RUNNABLE
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	- locked <0x0000000584a63348> (a java.net.SocksSocketImpl)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at java.net.Socket.connect(Socket.java:538)
	at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
	- locked <0x0000000584a62640> (a sun.net.www.http.HttpClient)
	at sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
	at sun.net.www.http.HttpClient.New(HttpClient.java:308)
	at sun.net.www.http.HttpClient.New(HttpClient.java:326)
	at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)
	at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1546)
	- locked <0x0000000584a60148> (a sun.net.www.protocol.http.HttpURLConnection)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
	- locked <0x0000000584a60148> (a sun.net.www.protocol.http.HttpURLConnection)
	at com.google.cloud.MetadataConfig.getAttribute(MetadataConfig.java:65)
	at com.google.cloud.MetadataConfig.getProjectId(MetadataConfig.java:41)
	at com.google.cloud.ServiceOptions.getGoogleCloudProjectId(ServiceOptions.java:376)
	at com.google.cloud.ServiceOptions.getDefaultProjectId(ServiceOptions.java:311)
	at com.google.cloud.ServiceOptions.getDefaultProject(ServiceOptions.java:284)
	at com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:238)
	at com.google.cloud.storage.StorageOptions.<init>(StorageOptions.java:82)
	at com.google.cloud.storage.StorageOptions.<init>(StorageOptions.java:30)
	at com.google.cloud.storage.StorageOptions$Builder.build(StorageOptions.java:77)
	at org.broadinstitute.hellbender.utils.gcs.BucketUtils.setGlobalNIODefaultOptions(BucketUtils.java:361)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:155)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:195)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
	at org.broadinstitute.hellbender.Main.main(Main.java:233)

I'm still not sure what condition triggers it, but at least we know where it's happening now.

@lbergelson lbergelson changed the title Possible GATK 4.beta.2 vs. 4.beta.3 performance regression Confirmed 4.beta.2 vs. 4.beta.3 performance regression Sep 19, 2017
@droazen droazen assigned lbergelson and unassigned lbergelson Oct 17, 2017
@droazen droazen modified the milestones: Engine-4.0, Engine-4.1 Jan 16, 2018
@droazen
Copy link
Contributor

droazen commented Oct 4, 2018

Closing as obsolete.

@droazen droazen closed this as completed Oct 4, 2018
@lbergelson lbergelson reopened this Oct 4, 2018
@lbergelson
Copy link
Member

I don't think this is obsolete and we haven't fixed it yet.

@droazen
Copy link
Contributor

droazen commented Oct 5, 2018

@lbergelson Assigning to you, in that case

@droazen droazen removed their assignment Oct 5, 2018
@droazen droazen changed the title Confirmed 4.beta.2 vs. 4.beta.3 performance regression 2 minute pause at startup due to NIO library Oct 5, 2018
@droazen droazen changed the title 2 minute pause at startup due to NIO library 2 minute pause at GATK startup due to NIO library Oct 5, 2018
@droazen droazen removed this from the Engine-2Q2018 milestone Oct 5, 2018
@droazen
Copy link
Contributor

droazen commented Oct 5, 2018

Also renamed this ticket to be less scary and more precise, since we know it's a 2-minute pause in the NIO library. It clearly doesn't always happen, though, as I don't think I've ever seen it.

@jean-philippe-martin
Copy link
Contributor

This seems to happen in the cloud auth layers, which I don't control.

One potential workaround would be to add a command-line option to disable GCS support. This would only help the original reporter if they don't use GCS paths, of course. Is this something we think may be worth doing at all?

@droazen
Copy link
Contributor

droazen commented Oct 5, 2018

@jean-philippe-martin It's not clear to me how widespread this issue is, or what conditions trigger it -- @lbergelson care to comment?

@lbergelson
Copy link
Member

I think it triggers in certain situations where a firewall is blocking the connection. If the internet is simply unreachable it doesn't happen, so I don't know what the exact error case is. It happened consistently for people inside Intel's firewall or vpn.

An option to disable gcs support isn't a bad idea, it's kind of a hack though, it would be better if we could understand and avoid triggering the problem. If we could only initialize GCS support when we are sure that we actually are accessing files from google that could be a useful, but it doesn't seem like there's any single point we can plug into to detect that, it would have to be spread over everything that uses paths.

@kshakir
Copy link
Contributor

kshakir commented Dec 3, 2018

Here's a stack trace of the area I think the two minute wait may be occurring.

The below example fails-fast and prints out stack trace when there is no internet. I suspect that the slow-and-quiet alternative occurs when the connection to google is blocked vs. completely unavailable.

Dec 02, 2018 7:50:25 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
WARNING: Failed to detect whether we are running on Google Compute Engine.
java.net.ConnectException: Connection refused (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:589)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
    at sun.net.www.http.HttpClient.New(HttpClient.java:339)
    at sun.net.www.http.HttpClient.New(HttpClient.java:357)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
    at shaded.cloud_nio.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:104)
    at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
    at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.runningOnComputeEngine(ComputeEngineCredentials.java:210)
    at shaded.cloud_nio.com.google.auth.oauth2.DefaultCredentialsProvider.tryGetComputeCredentials(DefaultCredentialsProvider.java:290)
    at shaded.cloud_nio.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentialsUnsynchronized(DefaultCredentialsProvider.java:207)
    at shaded.cloud_nio.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentials(DefaultCredentialsProvider.java:124)
    at shaded.cloud_nio.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:127)
    at shaded.cloud_nio.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:100)
    at com.google.cloud.ServiceOptions.defaultCredentials(ServiceOptions.java:304)
    at com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:278)
    at com.google.cloud.storage.StorageOptions.<init>(StorageOptions.java:83)
    at com.google.cloud.storage.StorageOptions.<init>(StorageOptions.java:31)
    at com.google.cloud.storage.StorageOptions$Builder.build(StorageOptions.java:78)
    at org.broadinstitute.hellbender.utils.gcs.BucketUtils.setGlobalNIODefaultOptions(BucketUtils.java:382)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:183)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

Produced by pulling the docker image, shutting off the internet connection, mounting helloHaplotypeCaller, and running:

docker run \
  --rm \
  -v /Users/kshakir/Downloads/helloHaplotypeCaller:/data \
  broadinstitute/gatk:4.0.11.0 \
  gatk \
  HaplotypeCaller \
  -R /data/ref/human_g1k_b37_20.fasta \
  -I /data/inputs/NA12878_wgs_20.bam \
  -O test.vcf

Adding in a GOOGLE_APPLICATION_CREDENTIALS environment variable short circuits the above stack trace.

docker run \
  -e GOOGLE_APPLICATION_CREDENTIALS=whatever
  --rm \
  -v /Users/kshakir/Downloads/helloHaplotypeCaller:/data \
  broadinstitute/gatk:4.0.11.0 \
  gatk \
  HaplotypeCaller \
  -R /data/ref/human_g1k_b37_20.fasta \
  -I /data/inputs/NA12878_wgs_20.bam \
  -O test.vcf

@jean-philippe-martin
Copy link
Contributor

Thank you @kshakir. What I see there is that the code sets the default NIO option, and as part of this is creates a google cloud StorageOptions object. Sadly for us, when this object is created it determines which Google credentials to use, and if nothing was specified by the user it will send some network messages to try to figure out whether it's running on a Google Compute Engine machine.

When we wrote the default-setting code we didn't realize that setting the number of retries was going to cause a network message to be sent, with the associated potential retries and delays.

We can't change the way Google Compute Engine works, or how the Google authentication works either. Ideally we'd want some way to only search for credentials when we know NIO is going to be used. The point of these defaults is that they're used for anything that uses NIO, including third-party library code. We can't fully replicate this behavior in a different way from the outside.

So I think the "correct" fix would be to go deep inside the Google NIO library and change it so that instead of providing a default configuration (that the user would have to put together, causing the problem you've seen), we can provide a callback that sets the configuration when the Google Cloud NIO provider is loaded. This is harder for future developers to wrap their heads around, but at least it would prevent this delay if NIO is not used.

I'd like to think about this some more before doing something quite this drastic, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants