Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for Issue 183 list collection #342

Closed

Conversation

christian-monch
Copy link
Contributor

A draft for issue #183

-[ ] add credential management
-[ ] add tests

mih and others added 30 commits February 24, 2023 09:08
For now just downloads via unauthenticated connections. The included
code enables downloads of selected archived content, without having to
download the entire archive first.

See datalad/datalad#373

For ZIP and TAR archives this is hooked into `AnyUrlOperations` and
thereby accessible via the `datalad download` command.

Demo:

```sh
❯ datalad download 'zip://datalad-datalad-cddbe22/requirements-devel.txt::https://zenodo.org/record/7497306/files/datalad/datalad-0.18.0.zip?download=1 -'
 # Theoretically we don't want -e here but ATM pip would puke if just .[full] is provided
 # Since we use requirements.txt ATM only for development IMHO it is ok but
 # we need to figure out/complaint to pip folks
 -e .[devel]

❯ datalad download 'tar://datalad-0.18.0/requirements-devel.txt::https://files.pythonhosted.org/packages/dd/5e/9be11886ef4c3c64e78a8cdc3f9ac3f27d2dac403a6337d5685cd5686770/datalad-0.18.0.tar.gz -'
 # Theoretically we don't want -e here but ATM pip would puke if just .[full] is provided
 # Since we use requirements.txt ATM only for development IMHO it is ok but
 # we need to figure out/complaint to pip folks
 -e .[devel]
```

As demo'ed in the code, dependening on the capabilities of the
particular filesystem abstraction it needs custom handling of the
actual download process after `open()` was called.
This includes, but is not limited to, being able to specify
whether or not anonymous access should be attempted (first).

This change paves the way for endpoint customizations and
anything else that FSSPEC exposes.

Moreover, when an explicit `credential` identifier is given,
the boto-based attempt to locate credentials or try
anonymous access first is skipped, and a credential is looked
up and provisioned immediately.
This facilitates exploitation of the numerous filesystem specific
features provided by FSSPEC.
This changes demos the utility of this feature for the FSSPEC
based access to S3 resources. By default anonymous access
is attempted, but additional handlers could change this behavior
(for particular S3 targets).
Requires particular permissions and comes with a potential
performance penalty.
They are mostly needed for the tests (and more were needed), and we want
to keep the actually core-dependencies small.
See docs inside. At present it is unclear how relevant this will be
in practice. However, different filesystem caching settings
impact performance substantially (empricial observation). If caching
is turned off entirely, this parameter is the only way to specify
chunk sizes (for reading). Hence I consider it a useful thing to have
in general -- and it is cheap.

This change also flips the default download method from iteration
over a file pointer to reading chunks of a specific size. Ths has been
found to be more performant in some cases.
Such an explicit option conflicts with processing of S3 URL that
specify a version explicitly.
When working with S3 URLs, s3fs employs internal URL parsing to detect
whether the S3 file has versioning enabled. Based on the outcome of this
evaluation, it sets the version_aware value itself.
If a user sets this keys' value to False in fs_kwargs of the URL handler,
but then supplies a versioned URL, s3fs's attempt to set this key to True
is sabotaged, and the download crashes:

datalad_next.url_operations.UrlOperationsRemoteError: UrlOperationsRemoteError for 's3://mslw-datalad-test0-versioned/3versions-allversioned.txt?versionId=Tro_UjqVFJfr32v5tuPfjwtOzeqYCxi2'

To leave a trace of this, even if only temporary, this commit adds a short
paragraph to the URL handler's docstring to warn about setting this key.
Mostly to minimize the diff for conflict avoidance, after a lot of typos
were fixed via datalad#315
Get the updates to -core for the CI
If the props variable does not get populated during _get_fs(), access to the URL has
failed for some reason. Further processing of props in download will lead to crashes.
This change adds error handling to raise early and more informative
This is an MVP (minimal viable product) which
will be extended in the next commits. It
currently works on fsspec-URLs, but does not
yet support credentials. The type overwrite
is also not implemented yet.
@christian-monch christian-monch requested a review from mih as a code owner May 8, 2023 07:10
@christian-monch christian-monch marked this pull request as draft May 8, 2023 07:11
@mih mih mentioned this pull request May 8, 2023
13 tasks
@mih
Copy link
Member

mih commented May 17, 2023

Given that #183 is resolved, I am closing this PR. Further work can focus on ls-file-collections.

Thx!

@mih mih closed this May 17, 2023
@christian-monch christian-monch deleted the issue-183-list-collection branch July 16, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants