Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement alternative to add-archive-content #183

Closed
mih opened this issue Dec 12, 2022 · 10 comments · Fixed by #343
Closed

Implement alternative to add-archive-content #183

mih opened this issue Dec 12, 2022 · 10 comments · Fixed by #343

Comments

@mih
Copy link
Member

mih commented Dec 12, 2022

Note: Lots of thoughts here. The latest variant is #183 (comment)

This is one of the oldest datalad command. It core purpose is to register special URL for the datalad-archive special remote that indicate that particular keys can be obtained by extracting a file from an archive that is also registered as an annexed file in the dataset.

On top of that, this command is also a front-end to archive extraction and content re-layouting.

It would be good this disentangle these aspects and provide a more targeted (and likely simpler alternative).

Concept

Parse an archive (already an annexed file), run git annex calckey on each contained file, add a datalad-archive URL to the computed key, if it exists.

This command would not interact with a dataset's worktree. Consequently, it needs no parameterization re archive content re-layouting. It should should work with bare repos.

From a user perspective, the composition of a dataset's content will be a more independent process. A user could extract files from an existing archive any way they see fit, rename, filter, restructure by any means and/or tools. Importantly, it does not matter whether an archive is the actual starting point of the operation, or whether an archive (with any structure) was just created from existing dataset content (think ORA dataset archive), for the purpose of depositing this archive only (rather than individual files) on some storage system.

@mih
Copy link
Member Author

mih commented Jan 9, 2023

More thoughts.... There are essentially two fundamental use cases:

  1. The starting point is a dataset. An archive is created from (some of) its content (this is what export-archive does), and it is deposited somewhere.
  2. A dataset is created/populated from an existing archive that is available somewhere.

Orthogonal to these fundamental scenarios is the question whether the internal layout of an archive matches that of a datasets (work)tree, or not. If not, such transformations could be a selection filter (exclude/include a subset) or re-layouting.

This results in four main scenarios that could be captured in different ways (see below). In general it does not seem to be meaningful to include the deposition of an archive into any procedures covered here -- there are countless ways to do that (by special remote, remote, URL, ...), and covering them all would be a mess.

Import from an archive

I am skipping this for now. Most, if not all features I can think of are covered by add-archive-content. See below for an update on this.

Export (work)tree into an archive

We have export-archive for such an operation already. It lacks any support for filtering what is to be exported (it can only do the current worktree), and produces detached archives.

However, adding the ability to specify alternative trees to export and then register generated archives back with the source dataset would be straightforward.

In addition we have export-archive-ora. This actually does something very similar. The difference is that it fixes the layout of the generated archive to a specific format (7z and annex object tree with hashdir=lower content organization), but also adds flexibility re output filtering (annex wanted configuration support).

I believe the two commands could be unified into a single command that supports:

  • selection of what is to be exported (by (path in a) tree, (set of) keys)
  • selection of output formats (archive type, archive organization (hashtree-like of filename-based)

and this command would also gain the ability to register a generated archive back with a source dataset.

It may even be possible/better to split the actual archive generation/registration out into a dedicated low-level/plumbing tool. This tool would read JSON (as produced by status or diff or some other source) and produce a customizable archive via a layout path template that is instantiated from properties of individual records. This way annex object tree like layouts can be built, as long as hashdir_XXX is a provided property, etc. Likewise, annex-wanted expressions can be evaluated by other tool.

Once available, this helper can be used inside export-archive and export-archive-ora.

@mih
Copy link
Member Author

mih commented Jan 9, 2023

Analog to the "archive assembler" described above (reads content properties from a source, builds, and possibly registers an archive in a dataset), we could have a "dataset assembler" that reads properties from a source and registers the respective content in a dataset.

Such a command could understand the result records of download and in conjunction replace download-url. It could also have an archive parser companion that pulls content from archives, hashes content such that datalad+archive URLs could be registered for new content in a dataset. Any intermediate filter (e.g. jq) could be used to perform arbitrary sub-selection of path manipulations.

It would be also interesting to explore this in the context of datalad-ebrains, where this is done inside a single command (generate dataset and file infos from a metadata query and then record them in a dataset), but may also benefit from code-reuse and a less ad-hoc solution.

So maybe:

  • assemble-dataset
  • assemble-archive

as two new plumbing commands.

I wonder if that is an opportunity to reduce the complexity of save (in particular at the repo level). status could be the helper that feeds assemble-dataset.

@bpoldrack
Copy link
Member

I have no immediate suggestion, but I want to register that the idea sounds pretty good at a first look. :)

@mih
Copy link
Member Author

mih commented May 3, 2023

Yet another thought-iteration:

The leanest concept I can come up with is a command that generates inputs for addurls. With addurls it would not matter whether the ultimate goal is to generate dl+archive URLs for the datalad-archives special remote, or some other URLs to point to extracted content of some archive somewhere, or something entirely different and unrelated to addurls.

Such a command would not need to be able to assemble full URLs or full paths. It only needs to parse a collection (either already in an archive, or still in a dataset yet to be exported into an archive), and to report relevant properties of items in that collection, such as

  • (file)name and/or path inside the collection
  • checksum
  • size

A replacement for add-archive-content would then be a pipe between this property reporter and addurls -- both individually parameterized as desired. Generating dl+archive URLs would be nothing more than placing the respective annex key of an archive in the URL template given to addurls.

Brain storming

The command...

  • name could be list-collection
  • could have a --type argument that instructions what collection to process (archive, worktree (or just directory), ref, annex) -- and each type can have its own parser implementation for straightforward extensibility
  • would have a location argument that identifies the collection to process
  • would yield a result record per collection item with all its properties, and thereby automatically yield compatible input for addurls (even though it will require the jq slurp trick from the addurls help for CLI usage)

Out-of-scope features of add-archive-content

  • content layout mangling. However, content layout properties could be broken up into distinct items that enable such relayouting via addurls FILENAME-FORMAT declaration. I.e. report leading dirs, instead of stripping them
  • deletion of the source/parsed collection
  • any form of content dropping or other dataset modifications
  • any form of requirement of a "work-dataset", other than a dataset being present when a dataset is to be parsed as a source collection

@mih
Copy link
Member Author

mih commented May 6, 2023

Linking #323 that is providing one of the candidate traversers for a list-collection command.

Other existing traversers that could move into list-collection are

  • gooey-lsdir (for a single directory)
  • tree (for a directory tree)

Given that we also have for-each-dataset -- which is focused on dataset collections, we might consider drawing a line there, and do not implement a list-collection command, but rather a list-file-collection command, and implement it such that it yields type=file results (possibly by yielded the filtered output of other (internal) implementations).

@mih
Copy link
Member Author

mih commented May 7, 2023

I am starting to work on the list-collection aspect of this issue now.

@mih
Copy link
Member Author

mih commented May 9, 2023

With the code that is coming in #343 we can effectively replace add-archive-content. Demo:

datalad -f json ls-file-collection tarfile my.tar.gz --hash md5 \
    | jq '. | select(.type == "file")' \
    | jq --slurp . \
    | datalad addurls --key 'et:MD5-s{size}--{hash-md5}' - 'dl+archive:SOMEKEY#path={item}&size={size}' '{item}'

Explanation

  • datalad ls-file-collection is reporting on the content of a tar file. Its result records have all the things one would expect, including size info. --hash md5 causes a file hash for each tar member to be computed. This is done storage-efficiently without extracting the tar archive. The same command can report on other collection type too -- this type of metadata generation is no limited to tar archives
  • the first jq command sub-selects results on files (only needed because datalad's own facilities for that do not work; --report-type is not effecting -f json_pp output? datalad#5581)
  • the second jq command turns JSON lines into a single JSON array, as it is necessary for addurls (taken from its docs)
  • the final addurls command registers keys and URL based on the metadata provided by ls-file-collection. The choice of annex backend is flexible. TAR layout mangles is mapped onto the file format input (trivial to write into a subdirectory, or any other layout build from file properties). The choice of dl+archive URLs is arbitrary here (it replaced add-archive-content, but any other URLs could be set too. Let's say, for example, to a hash-based object store that is exposed via HTTP and requires a sha256 of the file content -- possible with minimal changes.

add-archive-content features that are not directly supported

  • --add-archive-leading-dir -- possible via FILEFORMAT specification
  • --leading-dirs-depth, --leading-dirs-consider, --strip-leading-dirs, --rename -- requires collection item name mangling/parsing that is not readily provided here
  • --use-current-dir nothing is extracted, hence not required, the rest if possible via FILEFORMAT specification
  • --delete there is no need for an archive to be around even, out of scope
  • --key only relevant for the dl+archive URL use case, not directly supported, but accessible via URLFORMAT
  • --existing -- all conflict handling is left to addurls which provides options for resolutions
  • --copy, --drop-after, --delete-after -- there is only metadata processing going on, nothing needs to be moved, or dropped as a consequence of this processing
  • --no-commit, --allow-dirty -- obsolete, this uses standard addurls behavior, which leaves a clean dataset clean

mih added a commit to mih/datalad-next that referenced this issue May 10, 2023
In comparison to the former, this is largely metadata driven, and
works without (local) extraction of a tarball. This saves storage
overhead, and makes it possible to run some parts of the ingestion
pipeline on a remote system.

Closes datalad#183
mih added a commit to mih/datalad-next that referenced this issue May 10, 2023
This is (also) an alternative approach to `add-archive-content`.

In comparison to the former, this is largely metadata driven, and
works without (local) extraction of a tarball. This saves storage
overhead, and makes it possible to run some parts of the ingestion
pipeline on a remote system.

Closes datalad#183
@mih mih closed this as completed in #343 May 11, 2023
@jsheunis
Copy link
Member

One thing I am uncertain of: for this replacement of add-archive-content to work, should the tarball be part of the same datalat dataset on which the addurls command operates?

@mih
Copy link
Member Author

mih commented May 12, 2023

This depends. There is a demo included in #343 that demos the case where the tarball is included. This is because the datalad-archives special remote from -core requires that. So this demo documents that.

However, with the coming fspec features (#215) a different style of URLs could be assigned, and a different special remote could act on them (e.g. uncurl) and enable other use cases.

Third, datalad-archives could be replaced/changed entirely and support a hybrid of these extremes, where it can use the registered URL of a known tarball to provide fspec flexibility, but as an option, in addition to the traditional behavior. This is what #223 is about.

@jsheunis
Copy link
Member

Thanks for the explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants