Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VSIZIP and VSISTDIN are not compatible #751

Closed
emilyselwood opened this issue Jul 13, 2018 · 5 comments
Closed

VSIZIP and VSISTDIN are not compatible #751

emilyselwood opened this issue Jul 13, 2018 · 5 comments

Comments

@emilyselwood
Copy link

emilyselwood commented Jul 13, 2018

The code for /VSISTDIN/ only kicks in if the entire path provided matches.

This means that if you try and do something like

gdalinfo /vsizip//vsis
tdin//S2B_MSIL1C_20180623T152639_N0206_R025_T18NWP_20180623T201322.SAFE/GRANULE/L1C_T18NWP_A006772_20180623T153006/IMG_D
ATA/T18NWP_20180623T152639_B01.jp2 < S2B_MSIL1C_20180623T152639_N0206_R025_T18NWP_20180623T201322.zip 

It does not work, like it would if we pulled the file down with /vsicurl/ or similar. Despite the documentation on chaining ( http://www.gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_chaining ) being contradicted by the section on /vsistdin/ ( http://www.gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsistdin )

While the example is easy enough to work around by using the standard file path. We are tying to use GDAL with Apache Nifi which passes files to stdin of sub-processes. This means we need to write the file to disk first which can be a bit tricky with clustered environments.

Expected behavior and actual behavior.

gdalinfo /vsizip//vsis
tdin//S2B_MSIL1C_20180623T152639_N0206_R025_T18NWP_20180623T201322.SAFE/GRANULE/L1C_T18NWP_A006772_20180623T153006/IMG_D
ATA/T18NWP_20180623T152639_B01.jp2 < S2B_MSIL1C_20180623T152639_N0206_R025_T18NWP_20180623T201322.zip 

Something like that prints out the info for the correct image in the zip file.
Instead we get:

ERROR 4: `/vsizip//vsistdin//S2B_MSIL1C_20180623T152639_N0206_R025_T18NWP_20180623T201322.SAFE/GRANULE/L1C_T18NWP_A006772_20180623T153006/IMG_DATA/T18NWP_20180623T152639_B01.jp2' does not exist in the file system, and is not recognized as a supported dataset name.
gdalinfo failed - unable to open '/vsizip//vsistdin//S2B_MSIL1C_20180623T152639_N0206_R025_T18NWP_20180623T201322.SAFE/GRANULE/L1C_T18NWP_A006772_20180623T153006/IMG_DATA/T18NWP_20180623T152639_B01.jp2'.

Operating system

Tested on Windows 10 and Ubuntu 16.04

GDAL version and provenance

PS C:\data> gdalinfo --version
GDAL 2.3.0, released 2018/05/04

Code checked against Master ( 448138d)

@rouault
Copy link
Member

rouault commented Jul 13, 2018

There is no way that /vsistdin/ and /vsizip/ can be combined together. Reading from a zip requires to read its end first. Whereas /vsistdin/ reading must be sequential from the beginning. So that "works" as expected.

@bayersglassey-zesty
Copy link

bayersglassey-zesty commented May 13, 2022

I'm running into this as well.
If the current implementation doesn't allow for it, wouldn't it be possible to change the implementation? From the outside, reading ZIP data from stdin seems reasonable -- most people would assume it would work.

Reading from a zip requires to read its end first.

Would it not be possible to read stdin into a buffer, and then you can read to the end of that buffer?
So /vsistdin/ would need to know that it's being used in a way which requires everything to be read into a buffer first.
But since there is already /vsemem/, perhaps this just means that a little code needs to be added to read into a buffer, and then "hand off" to /vsimem/?

I guess if I'm going to flippantly suggest that something might or might not be possible, I should probably take a quick peek at the code and see whether it looks imposing. :)
Looks like the source code for /vsistdin/ and /vsimem/ are here:

...and they work by implementing subclasses of VSIFilesystemHandler and VSIVirtualHandle, e.g.

class VSIStdinFilesystemHandler final : public VSIFilesystemHandler
class VSIStdinHandle final : public VSIVirtualHandle

class VSIMemFilesystemHandler final : public VSIFilesystemHandler
class VSIMemHandle final : public VSIVirtualHandle

...and the VSIVirtualHandle classes are basically "virtual files", with methods similar to the cstdlib functions for working with FILE *.
And in particular, I see that VSIStdinHandle::Seek can throw the following errors:

  • Seek(xx != 0, SEEK_END) unsupported on /vsistdin
  • Seek(SEEK_END) unsupported on /vsistdin when stdin > 1 MB
  • backward Seek() unsupported on /vsistdin above first MB

...so okay, it sounds like the first MB of stdin is indeed read into a buffer.
The reason seemingly being that the purpose of /vsistdin/ is to support "large files" (as the comment at top of file says: "Purpose: Implement VSI large file api for stdin").
And in fact, I see the hardcoded 1MB buffer size:

// We buffer the first 1MB of standard input to enable drivers
// to autodetect data. In the first MB, backward and forward seeking
// is allowed, after only forward seeking will work.
// TODO(schwehr): Make BUFFER_SIZE a static const.
#define BUFFER_SIZE (1024 * 1024)

static GByte* pabyBuffer = nullptr;
static GUInt32 nBufferLen = 0;
static GUIntBig nRealPos = 0;

/// ...and later, in VSIStdinInit:
pabyBuffer = static_cast<GByte *>(CPLMalloc(BUFFER_SIZE));

And for completeness' sake, the source for /vsizip/ is here:

...and I see that VSIGZipHandle::VSIGZipHandle takes a VSIVirtualHandle* poBaseHandle, and begins by calling VSIFSeekL and VSIFTellL on it in order to determine compressed_size. So that's presumably where it would fail when poBaseHandle is a VSIStdinHandle...

So in order to allow reading ZIP files from stdin, I guess one of the following would be necessary:

  • A version of /vsistdin/ (or some kind of flag for it?) which dynamically allocated a large enough buffer for whatever you threw at it
  • Some kind of "temporary file" VSI subclass?..

I suppose there is no great cry for those at the moment, and so no plan for them. Hmmmm.
It seems like these days, when RAM is somewhat cheaper, it would be nice to have some way to read all of stdin into a buffer. Maybe a /vsistdinmem/ option. 😅

@rouault
Copy link
Member

rouault commented May 14, 2022

well, one could potentially modify the code to replace the hard coded 1MB buffer by something configurable with a VSISTDIN_BUFFER_LIMIT configuration option that would default to 1MB and could be set to another value, or -1 to indicate no limit. Or perhaps accept /vsistdin?buffer_limit=.... (would require looking for the use of "/vsistdin/" in the code base, particularly drivers, to make it more flexible)

@rouault rouault reopened this May 14, 2022
@bayersglassey-zesty
Copy link

bayersglassey-zesty commented May 14, 2022

Or perhaps accept /vsistdin?buffer_limit=....

Wow, I didn't think the VSI syntax was so flexible. I guess it can be whatever you make it, though. :)
Edit: ah, I see vsicurl uses it, e.g.

ogr/ogrsf_frmts/plscenes/ogrplscenesdatav1dataset.cpp
574:        CPLString osTmpURL("/vsicurl?use_head=no&max_retry=3&empty_dir=yes&use_redirect_url_if_no_query_string_params=yes&url=");

I feel like specifying a higher, but still finite, limit isn't a feature which people will generally need.
In any case, what I was hoping for was just for things to magically work -- to be able to combine /vsistdin/ and /vsizip/.
If the current /vsistdin/ behaviour is still desired -- that is, to throw errors when someone tries to seek etc -- then I guess the new behaviour could be /vsistdin?realloc=true/ or /vsidynamicstdin/ or something.
But it seems like it would be ideal for /vsistdin/ to always allocate the initial 1MB buffer, then dynamically grow it if needed.
So for instance, the cases where it currently errors out -- like when you seek to the end -- would trigger a dynamic buffer resize.

Hmmm, but I guess that doesn't actually make sense, because of the following scenario:

  • We allocate initial 1MB and read into it
  • We read another 1MB without reallocating the buffer... so this data is lost after it is read
  • Now we are asked to seek back to an offset of 1MB from start of input, but we can't because we have already read and thrown away that data

...so I guess you would need two separate modes: the current, "buffer 1MB but then throw away data after you've read it" mode, and a separate, "keep a dynamically resized buffer containing all data ever read" mode.
And then, stdin + ZIP could be achieved, using that second mode.

So then yes, I think your suggestion makes a lot of sense: /vsistdin?buffer_limit=N/, where N can I guess be a number of bytes >= 1024 * 1024, or -1 for dynamic. 🤔

rouault added a commit that referenced this issue Jun 16, 2022
/vsistdin/: make size of buffered area configurable (fixes #751)
@bayersglassey-zesty
Copy link

Thanks @rouault!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants