version 0.5.0 recursively copy remote folder fails #359

mgsnuno · 2020-09-08T09:51:48Z

What happened:
recursively copying a remote folder fails. it looks like async _get_file does handle directories paths properly.

new aio code (0.5.0):

async def _get_file(self, rpath, lpath, version_id=None):
    bucket, key, vers = self.split_path(rpath)
    resp = await self._call_s3(
        self.s3.get_object, Bucket=bucket, Key=key,
        **version_id_kw(version_id or vers),
    )
    with open(lpath, "wb") as f0:
        body = resp['Body']
        while True:
            chunk = await body.read(2**16)
            if not chunk:
                break
            f0.write(chunk)

old code (0.4.2):

def get_file(self, rpath, lpath, **kwargs):
    """Copy single remote file to local"""
    if self.isdir(rpath):
        os.makedirs(lpath, exist_ok=True)
    else:
        with self.open(rpath, "rb", **kwargs) as f1:
            with open(lpath, "wb") as f2:
                data = True
                while data:
                    data = f1.read(self.blocksize)
                    f2.write(data)

when copying recursively a remote folder with the latest aio code, rpath is not checked to be a directory and this causes FileNotFoundError: The specified key does not exist.

With version 0.4.2 this works fine.

Folder structure

folder
    subfolder
        file1
        ...
    subfolder1
        file1
        ...
    ...

Environment:

Dask version: 2.25.0
Python version: 3.8.5
Operating System: manjaro
Install method (conda, pip, source): conda

The text was updated successfully, but these errors were encountered:

martindurant · 2020-09-08T12:51:04Z

It would be very helpful if you could set up your example as a test function, so that we can fix it and know that the fix works. Also, a traceback showing where the FielNotFound is coming from - is this on the local system, or is s3fs trying to download files that don't exist?
In making a fix, it must be remembered that: s3 does not have directories; and that it is possible to have a key with exactly the same name a "prefix".

Note that I intend to implement fsspec/gcsfs#280 for s3fs, which should make this simpler.

mgsnuno · 2020-09-15T09:09:36Z

@martindurant issue still exists with s3fs 0.5.1 (which I assume has your changes to find).

The problem is like you mentioned

s3fs trying to download files that don't exist?

Because _get_file is not checking if rpath is a directory or not (like previously), and s3.get_object(.., key='path_directory') will fail of course.

Full tracebak:

Unexpected error: FileNotFoundError('The specified key does not exist.')
Traceback (most recent call last):
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/s3fs/core.py", line 207, in _call_s3
    return await method(**additional_kwargs)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/aiobotocore/client.py", line 151, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 822, in get_task_run_state
    value = timeout_handler(
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/prefect/utilities/executors.py", line 188, in timeout_handler
    return fn(*args, **kwargs)
  File "<ipython-input-5-00b18abfffb4>", line 25, in table_copy
  File "/home/work/paf/pipelines/pipelines/core/pipeline.py", line 399, in datadump_table_copy
    s3.get(path_s3, path_local, recursive=True)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/fsspec/asyn.py", line 266, in get
    return sync(self.loop, self._get, rpaths, lpaths)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/fsspec/asyn.py", line 67, in sync
    raise exc.with_traceback(tb)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/fsspec/asyn.py", line 51, in f
    result[0] = await future
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/fsspec/asyn.py", line 251, in _get
    return await asyncio.gather(
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/s3fs/core.py", line 694, in _get_file
    resp = await self._call_s3(
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/s3fs/core.py", line 225, in _call_s3
    raise translate_boto_error(err) from err
FileNotFoundError: The specified key does not exist.

martindurant · 2020-09-15T12:52:50Z

I would suggest rather than checking each path to whether it is a key or not (which may be expensive - the best way to check is to try to download), we implement "ignoring" or exceptions in the bulk operations. For example, rm does not error if a key it tries to remove is already gone.

Alternatively, the keys to be got could be sorted such that folders always come last - if we already created a local folder, then we cannot overwrite it with the contents of a remote key with the same name, even if it exists.

What do you think?

mgsnuno · 2020-09-16T07:31:30Z

we implement "ignoring" or exceptions in the bulk operations. For example, rm does not error if a key it tries to remove is already gone.

seems more robust, I'd go for that one, even though it might be slower. try/except FileNotFoundError blocks seems good.

It this an easy change on those functions I mentioned? any more pointers on what should be done? I can give it a try.

martindurant · 2020-09-16T13:09:53Z

fsspec/filesystem_spec#409 does a similar thing for cat. There should be a reasonable way not to have to repeat the error-catching code for all batch functions.

mgsnuno · 2020-09-24T08:13:52Z

@martindurant https://github.com/dask/s3fs/pull/368/files#diff-1997c4b809971172b55a040ccbb82ea0R694 implements a fix similar to what suggested when opening this issue correct?

Not sure of the impact of: if self.isdir(rpath): vs if os.path.isdir(lpath):

Do you still want to implement the following?

we implement "ignoring" or exceptions in the bulk operations. For example, rm does not error if a key it tries to remove is already gone.

martindurant · 2020-09-24T14:25:33Z

Yes, it is the same; the directories should all be created up front, and then an attempt to download a key with the same name as a directory will be ignored. Since isdir is ambiguous on s3, this seems like the better approach.

Do you still want to implement

Yes, this is reasonable. As I say, I would hope not to have to repeat the code, and this should be possible (although the default value might be "ignore" for rm and something else for the others).

martindurant mentioned this issue Sep 8, 2020

Do find without recursion #360

Merged

martindurant closed this as completed in #360 Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

version 0.5.0 recursively copy remote folder fails #359

version 0.5.0 recursively copy remote folder fails #359

mgsnuno commented Sep 8, 2020 •

edited

Loading

martindurant commented Sep 8, 2020

mgsnuno commented Sep 15, 2020 •

edited

Loading

martindurant commented Sep 15, 2020

mgsnuno commented Sep 16, 2020

martindurant commented Sep 16, 2020

mgsnuno commented Sep 24, 2020 •

edited

Loading

martindurant commented Sep 24, 2020

version 0.5.0 recursively copy remote folder fails #359

version 0.5.0 recursively copy remote folder fails #359

Comments

mgsnuno commented Sep 8, 2020 • edited Loading

martindurant commented Sep 8, 2020

mgsnuno commented Sep 15, 2020 • edited Loading

martindurant commented Sep 15, 2020

mgsnuno commented Sep 16, 2020

martindurant commented Sep 16, 2020

mgsnuno commented Sep 24, 2020 • edited Loading

martindurant commented Sep 24, 2020

mgsnuno commented Sep 8, 2020 •

edited

Loading

mgsnuno commented Sep 15, 2020 •

edited

Loading

mgsnuno commented Sep 24, 2020 •

edited

Loading