Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

version 0.5.0 recursively copy remote folder fails #359

Closed
mgsnuno opened this issue Sep 8, 2020 · 7 comments · Fixed by #360
Closed

version 0.5.0 recursively copy remote folder fails #359

mgsnuno opened this issue Sep 8, 2020 · 7 comments · Fixed by #360

Comments

@mgsnuno
Copy link

mgsnuno commented Sep 8, 2020

What happened:
recursively copying a remote folder fails. it looks like async _get_file does handle directories paths properly.

new aio code (0.5.0):

async def _get_file(self, rpath, lpath, version_id=None):
    bucket, key, vers = self.split_path(rpath)
    resp = await self._call_s3(
        self.s3.get_object, Bucket=bucket, Key=key,
        **version_id_kw(version_id or vers),
    )
    with open(lpath, "wb") as f0:
        body = resp['Body']
        while True:
            chunk = await body.read(2**16)
            if not chunk:
                break
            f0.write(chunk)

old code (0.4.2):

def get_file(self, rpath, lpath, **kwargs):
    """Copy single remote file to local"""
    if self.isdir(rpath):
        os.makedirs(lpath, exist_ok=True)
    else:
        with self.open(rpath, "rb", **kwargs) as f1:
            with open(lpath, "wb") as f2:
                data = True
                while data:
                    data = f1.read(self.blocksize)
                    f2.write(data)

when copying recursively a remote folder with the latest aio code, rpath is not checked to be a directory and this causes FileNotFoundError: The specified key does not exist.

With version 0.4.2 this works fine.

Folder structure

folder
    subfolder
        file1
        ...
    subfolder1
        file1
        ...
    ...

Environment:

  • Dask version: 2.25.0
  • Python version: 3.8.5
  • Operating System: manjaro
  • Install method (conda, pip, source): conda
@martindurant
Copy link
Member

It would be very helpful if you could set up your example as a test function, so that we can fix it and know that the fix works. Also, a traceback showing where the FielNotFound is coming from - is this on the local system, or is s3fs trying to download files that don't exist?
In making a fix, it must be remembered that: s3 does not have directories; and that it is possible to have a key with exactly the same name a "prefix".

Note that I intend to implement fsspec/gcsfs#280 for s3fs, which should make this simpler.

@mgsnuno
Copy link
Author

mgsnuno commented Sep 15, 2020

@martindurant issue still exists with s3fs 0.5.1 (which I assume has your changes to find).

The problem is like you mentioned

s3fs trying to download files that don't exist?

Because _get_file is not checking if rpath is a directory or not (like previously), and s3.get_object(.., key='path_directory') will fail of course.

Full tracebak:
Unexpected error: FileNotFoundError('The specified key does not exist.')
Traceback (most recent call last):
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/s3fs/core.py", line 207, in _call_s3
    return await method(**additional_kwargs)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/aiobotocore/client.py", line 151, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 822, in get_task_run_state
    value = timeout_handler(
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/prefect/utilities/executors.py", line 188, in timeout_handler
    return fn(*args, **kwargs)
  File "<ipython-input-5-00b18abfffb4>", line 25, in table_copy
  File "/home/work/paf/pipelines/pipelines/core/pipeline.py", line 399, in datadump_table_copy
    s3.get(path_s3, path_local, recursive=True)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/fsspec/asyn.py", line 266, in get
    return sync(self.loop, self._get, rpaths, lpaths)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/fsspec/asyn.py", line 67, in sync
    raise exc.with_traceback(tb)
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/fsspec/asyn.py", line 51, in f
    result[0] = await future
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/fsspec/asyn.py", line 251, in _get
    return await asyncio.gather(
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/s3fs/core.py", line 694, in _get_file
    resp = await self._call_s3(
  File "/home/miniconda3/envs/pipelines/lib/python3.8/site-packages/s3fs/core.py", line 225, in _call_s3
    raise translate_boto_error(err) from err
FileNotFoundError: The specified key does not exist.

@martindurant
Copy link
Member

I would suggest rather than checking each path to whether it is a key or not (which may be expensive - the best way to check is to try to download), we implement "ignoring" or exceptions in the bulk operations. For example, rm does not error if a key it tries to remove is already gone.

Alternatively, the keys to be got could be sorted such that folders always come last - if we already created a local folder, then we cannot overwrite it with the contents of a remote key with the same name, even if it exists.

What do you think?

@mgsnuno
Copy link
Author

mgsnuno commented Sep 16, 2020

we implement "ignoring" or exceptions in the bulk operations. For example, rm does not error if a key it tries to remove is already gone.

seems more robust, I'd go for that one, even though it might be slower. try/except FileNotFoundError blocks seems good.

It this an easy change on those functions I mentioned? any more pointers on what should be done? I can give it a try.

@martindurant
Copy link
Member

fsspec/filesystem_spec#409 does a similar thing for cat. There should be a reasonable way not to have to repeat the error-catching code for all batch functions.

@mgsnuno
Copy link
Author

mgsnuno commented Sep 24, 2020

@martindurant https://github.com/dask/s3fs/pull/368/files#diff-1997c4b809971172b55a040ccbb82ea0R694 implements a fix similar to what suggested when opening this issue correct?

Not sure of the impact of: if self.isdir(rpath): vs if os.path.isdir(lpath):

Do you still want to implement the following?

we implement "ignoring" or exceptions in the bulk operations. For example, rm does not error if a key it tries to remove is already gone.

@martindurant
Copy link
Member

Yes, it is the same; the directories should all be created up front, and then an attempt to download a key with the same name as a directory will be ignored. Since isdir is ambiguous on s3, this seems like the better approach.

Do you still want to implement

Yes, this is reasonable. As I say, I would hope not to have to repeat the code, and this should be possible (although the default value might be "ignore" for rm and something else for the others).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants