Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support use of Path in ParsingUtils, SeekableStreamFactory, AbstractVCFCodec #724

Merged
merged 1 commit into from
Jan 20, 2017

Conversation

tomwhite
Copy link
Contributor

Description

Added support for java.util.nio.Path for some more code pathways.

This is to support loading of Features with Spark in GATK.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.009%) to 69.032% when pulling d3f3a4b on tomwhite:tw_use_path_in_parsing_utils into 88b6719 on samtools:master.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite Seems like very reasonable changes. Hard to test that they're actually working though. Especially the case with a weird classpath. I have a few concrete comments and then a vague suggestion that we need tests with non file: paths just to shake out any places where it might accidentally be falling back to files without us knowing it. I think maybe if we include a test dependency on an in memory filesytem we could make much stronger test for the paths without having to bundle anything with htsjdk on release.

public static Path getPath(String uriString) throws IOException {
URI uri = URI.create(uriString);
try {
return uri.getScheme() == null ? Paths.get(uriString) : Paths.get(uri);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment here explaining this line. It's going to be very confusing for anyone who doesn't know that Paths.get(String) and Paths.get(URI) have different mechanics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (cl == null) {
throw e;
}
return FileSystems.newFileSystem(uri, new HashMap<>(), cl).provider().getPath(uri);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bug here. I tested this code explicitly by including the gcs provider in the hellbender path and running the following tests:

    @Test
    public void testCompletePath() throws IOException {
        newFilesystem(URI.create("gs://hellbender/somefile.txt"));
    }

    @Test
    public void testBucketOnly() throws IOException {
        newFilesystem(URI.create("gs://hellbender"));
    }

    private void newFilesystem(URI uri) throws IOException {
        ClassLoader cl = Thread.currentThread().getContextClassLoader();
        final FileSystem fileSystem = FileSystems.newFileSystem(uri, new HashMap<>(), cl);
    }

the first test fails with

java.lang.IllegalArgumentException: GCS FileSystem URIs mustn't have: port, userinfo, path, query, or fragment: gs://hellbender/somefile.txt

	at shaded.cloud-nio.com.google.common.base.Preconditions.checkArgument(Preconditions.java:146)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newFileSystem(CloudStorageFileSystemProvider.java:192)
	at com.google.cloud.storage.contrib.nio.CloudStorageFileSystemProvider.newFileSystem(CloudStorageFileSystemProvider.java:83)
	at java.nio.file.FileSystems.newFileSystem(FileSystems.java:326)
	at htsjdk.tribble.util.ParsingUtilsTest.newFilesystem(ParsingUtilsTest.java:179)
	at htsjdk.tribble.util.ParsingUtilsTest.testCompletePath(ParsingUtilsTest.java:169)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:85)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:639)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:816)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1124)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
	at org.testng.TestRunner.privateRun(TestRunner.java:774)
	at org.testng.TestRunner.run(TestRunner.java:624)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:359)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:354)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:312)
	at org.testng.SuiteRunner.run(SuiteRunner.java:261)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1215)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
	at org.testng.TestNG.run(TestNG.java:1048)
	at org.testng.IDEARemoteTestNG.run(IDEARemoteTestNG.java:72)
	at org.testng.RemoteTestNGStarter.main(RemoteTestNGStarter.java:127)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

The second passes. So we can't just use the URI that's passed in to open the new file system. I'm not sure if this is a bug in the implementation of the GCS fileSystemProvider or if it's going to be a consistent issue with multiple providers. The other providers that I know of don't offer consistent behavior here that I can tell. UnixFileSystem provider always throws because it doesn't want to be reinstantiated. The HDFS provider from hdfs.jsr203 seems to be more robust and just picks out the part of the URI that it needs.

@jean-philippe-martin Can you comment on this? It looks like a simple patch to the gcs provider would make it so that this approach would work.

@tomwhite Have you tested this running in spark? I don't see how it can work, but maybe I'm missing something important. I do see that our implementation of this method in gatk has a special case for gcs still.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try running this on HDFS with Spark a while ago and it worked. There's a fix for the GCS issue from @jean-philippe-martin here: googleapis/google-cloud-java#1470

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergerson: yes this was a bug, it's fixed now.

@tomwhite: Does this mean that on Spark after someone calls getPath on a gs:// URI, the fs provider will be loaded and Path.get will work correctly? Because that would be sweet.

@@ -143,6 +164,18 @@ private void tstExists(String path, boolean expectExists) throws IOException{
}

@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a way of testing non-file URI's. I'm not sure the best way to go about it. One possibility would be to include one of the alternative filesystem providers as a test dependency. Either that or maybe we could mock one. There is a [google in memory filesystem] that might make a good test for the non-file based file systems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen What do you think about including a new test dependency for providing alternative paths. It might help us avoid problems like the index one we saw before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I've added a test dependency on https://github.com/google/jimfs, and a few tests that use non-file URIs that exercise the NIO file path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* @return the resulting {@code Path}
* @throws IOException an I/O error occurs creating the file system
*/
public static Path getPath(String uriString) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you replace the use of Paths.get in SamInputResource.UrlInputResource with a call to this instead? Just in case...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tomwhite tomwhite force-pushed the tw_use_path_in_parsing_utils branch from d3f3a4b to f15fbdd Compare January 17, 2017 12:24
@codecov-io
Copy link

Current coverage is 63.863% (diff: 50.000%)

Merging #724 into master will decrease coverage by 0.019%

@@             master       #724   diff @@
==========================================
  Files           523        523          
  Lines         31643      31660    +17   
  Methods           0          0          
  Messages          0          0          
  Branches       6766       6772     +6   
==========================================
+ Hits          20214      20219     +5   
- Misses         9293       9301     +8   
- Partials       2136       2140     +4   

Sunburst

Diff Coverage File Path
0% src/main/java/htsjdk/samtools/SamInputResource.java
0% ...k/samtools/seekablestream/SeekableStreamFactory.java
••••• 50% src/main/java/htsjdk/samtools/util/IOUtil.java
••••• 50% ...c/main/java/htsjdk/variant/vcf/AbstractVCFCodec.java
•••••••••• 100% src/main/java/htsjdk/tribble/util/ParsingUtils.java

Powered by Codecov. Last update 5a9d819...f15fbdd

@tomwhite
Copy link
Contributor Author

@lbergelson thanks for the review. I've addressed all your comments, so please take another look. (See broadinstitute/gatk#2344 for the companion GATK issue.)

@lbergelson
Copy link
Member

@tomwhite Changes look good. Tests with jimfs are a good addition. It would be good to have a test that reads a vcf with an index from jimfs so we make sure we avoid the issue we had with bam indexes. I want to get this in though, so I'm going to merge and we can add one after the fact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants