-
Notifications
You must be signed in to change notification settings - Fork 602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2 minute pause at GATK startup due to NIO library #3491
Comments
A quick run off between 4.beta.2 BaseRecalibrator and 4.beta.3 doesn't show any difference to me. It must be either some difference in the user's command line, their data, or possibly a change in their environment over time. I'm going to request that they upload example commandline and test files that reproduce the problem. |
It could just be natural variability in the user's runtime environment, but it's worth doing some longer-running tests to be sure. |
Great. Thank you both. Louis, I can request the user upload data and get back to you guys on here. Thanks for looking into this. |
I just asked them. |
Ah, I see that. Thanks Louis. |
User came back and confirmed there is a 2 minute difference, but did not upload data. I just asked again with instructions to upload. |
User is reporting a nearly exact 2 minute pause at tool startup. Seems very suspicious, possibly some sort of gcs operation trying and timing out?
|
Hey Louis. Thanks for helping. She just came back and gave exact commands and some more information, basically it only occurs on the cloud and not on her local machine. I think she cannot upload her data |
With the help of @erniebrau we've reproduced this and gotten a useful stack trace:
I'm still not sure what condition triggers it, but at least we know where it's happening now. |
Closing as obsolete. |
I don't think this is obsolete and we haven't fixed it yet. |
@lbergelson Assigning to you, in that case |
Also renamed this ticket to be less scary and more precise, since we know it's a 2-minute pause in the NIO library. It clearly doesn't always happen, though, as I don't think I've ever seen it. |
This seems to happen in the cloud auth layers, which I don't control. One potential workaround would be to add a command-line option to disable GCS support. This would only help the original reporter if they don't use GCS paths, of course. Is this something we think may be worth doing at all? |
@jean-philippe-martin It's not clear to me how widespread this issue is, or what conditions trigger it -- @lbergelson care to comment? |
I think it triggers in certain situations where a firewall is blocking the connection. If the internet is simply unreachable it doesn't happen, so I don't know what the exact error case is. It happened consistently for people inside Intel's firewall or vpn. An option to disable gcs support isn't a bad idea, it's kind of a hack though, it would be better if we could understand and avoid triggering the problem. If we could only initialize GCS support when we are sure that we actually are accessing files from google that could be a useful, but it doesn't seem like there's any single point we can plug into to detect that, it would have to be spread over everything that uses paths. |
Here's a stack trace of the area I think the two minute wait may be occurring. The below example fails-fast and prints out stack trace when there is no internet. I suspect that the slow-and-quiet alternative occurs when the connection to google is blocked vs. completely unavailable. Dec 02, 2018 7:50:25 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
WARNING: Failed to detect whether we are running on Google Compute Engine.
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at sun.net.www.http.HttpClient.New(HttpClient.java:339)
at sun.net.www.http.HttpClient.New(HttpClient.java:357)
at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1220)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:984)
at shaded.cloud_nio.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:104)
at shaded.cloud_nio.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials.runningOnComputeEngine(ComputeEngineCredentials.java:210)
at shaded.cloud_nio.com.google.auth.oauth2.DefaultCredentialsProvider.tryGetComputeCredentials(DefaultCredentialsProvider.java:290)
at shaded.cloud_nio.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentialsUnsynchronized(DefaultCredentialsProvider.java:207)
at shaded.cloud_nio.com.google.auth.oauth2.DefaultCredentialsProvider.getDefaultCredentials(DefaultCredentialsProvider.java:124)
at shaded.cloud_nio.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:127)
at shaded.cloud_nio.com.google.auth.oauth2.GoogleCredentials.getApplicationDefault(GoogleCredentials.java:100)
at com.google.cloud.ServiceOptions.defaultCredentials(ServiceOptions.java:304)
at com.google.cloud.ServiceOptions.<init>(ServiceOptions.java:278)
at com.google.cloud.storage.StorageOptions.<init>(StorageOptions.java:83)
at com.google.cloud.storage.StorageOptions.<init>(StorageOptions.java:31)
at com.google.cloud.storage.StorageOptions$Builder.build(StorageOptions.java:78)
at org.broadinstitute.hellbender.utils.gcs.BucketUtils.setGlobalNIODefaultOptions(BucketUtils.java:382)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:183)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289) Produced by pulling the docker image, shutting off the internet connection, mounting helloHaplotypeCaller, and running: docker run \
--rm \
-v /Users/kshakir/Downloads/helloHaplotypeCaller:/data \
broadinstitute/gatk:4.0.11.0 \
gatk \
HaplotypeCaller \
-R /data/ref/human_g1k_b37_20.fasta \
-I /data/inputs/NA12878_wgs_20.bam \
-O test.vcf Adding in a docker run \
-e GOOGLE_APPLICATION_CREDENTIALS=whatever
--rm \
-v /Users/kshakir/Downloads/helloHaplotypeCaller:/data \
broadinstitute/gatk:4.0.11.0 \
gatk \
HaplotypeCaller \
-R /data/ref/human_g1k_b37_20.fasta \
-I /data/inputs/NA12878_wgs_20.bam \
-O test.vcf |
Thank you @kshakir. What I see there is that the code sets the default NIO option, and as part of this is creates a google cloud When we wrote the default-setting code we didn't realize that setting the number of retries was going to cause a network message to be sent, with the associated potential retries and delays. We can't change the way Google Compute Engine works, or how the Google authentication works either. Ideally we'd want some way to only search for credentials when we know NIO is going to be used. The point of these defaults is that they're used for anything that uses NIO, including third-party library code. We can't fully replicate this behavior in a different way from the outside. So I think the "correct" fix would be to go deep inside the Google NIO library and change it so that instead of providing a default configuration (that the user would have to put together, causing the problem you've seen), we can provide a callback that sets the configuration when the Google Cloud NIO provider is loaded. This is harder for future developers to wrap their heads around, but at least it would prevent this delay if NIO is not used. I'd like to think about this some more before doing something quite this drastic, though. |
User has reported longer runtimes in GATK4 beta3 release compared to GATK4 beta 2 release. It sounds like this is not expected. Her runtimes are below. The first post in the forum thread has her original report.
This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/comment/41669#Comment_41669
The text was updated successfully, but these errors were encountered: