Implement alternative to `add-archive-content` #183

mih · 2022-12-12T09:10:30Z

Note: Lots of thoughts here. The latest variant is #183 (comment)

This is one of the oldest datalad command. It core purpose is to register special URL for the datalad-archive special remote that indicate that particular keys can be obtained by extracting a file from an archive that is also registered as an annexed file in the dataset.

On top of that, this command is also a front-end to archive extraction and content re-layouting.

It would be good this disentangle these aspects and provide a more targeted (and likely simpler alternative).

Concept

Parse an archive (already an annexed file), run git annex calckey on each contained file, add a datalad-archive URL to the computed key, if it exists.

This command would not interact with a dataset's worktree. Consequently, it needs no parameterization re archive content re-layouting. It should should work with bare repos.

From a user perspective, the composition of a dataset's content will be a more independent process. A user could extract files from an existing archive any way they see fit, rename, filter, restructure by any means and/or tools. Importantly, it does not matter whether an archive is the actual starting point of the operation, or whether an archive (with any structure) was just created from existing dataset content (think ORA dataset archive), for the purpose of depositing this archive only (rather than individual files) on some storage system.

The text was updated successfully, but these errors were encountered:

mih · 2023-01-09T14:15:14Z

More thoughts.... There are essentially two fundamental use cases:

The starting point is a dataset. An archive is created from (some of) its content (this is what export-archive does), and it is deposited somewhere.
A dataset is created/populated from an existing archive that is available somewhere.

Orthogonal to these fundamental scenarios is the question whether the internal layout of an archive matches that of a datasets (work)tree, or not. If not, such transformations could be a selection filter (exclude/include a subset) or re-layouting.

This results in four main scenarios that could be captured in different ways (see below). In general it does not seem to be meaningful to include the deposition of an archive into any procedures covered here -- there are countless ways to do that (by special remote, remote, URL, ...), and covering them all would be a mess.

Import from an archive

I am skipping this for now. Most, if not all features I can think of are covered by add-archive-content. See below for an update on this.

Export (work)tree into an archive

We have export-archive for such an operation already. It lacks any support for filtering what is to be exported (it can only do the current worktree), and produces detached archives.

However, adding the ability to specify alternative trees to export and then register generated archives back with the source dataset would be straightforward.

In addition we have export-archive-ora. This actually does something very similar. The difference is that it fixes the layout of the generated archive to a specific format (7z and annex object tree with hashdir=lower content organization), but also adds flexibility re output filtering (annex wanted configuration support).

I believe the two commands could be unified into a single command that supports:

selection of what is to be exported (by (path in a) tree, (set of) keys)
selection of output formats (archive type, archive organization (hashtree-like of filename-based)

and this command would also gain the ability to register a generated archive back with a source dataset.

It may even be possible/better to split the actual archive generation/registration out into a dedicated low-level/plumbing tool. This tool would read JSON (as produced by status or diff or some other source) and produce a customizable archive via a layout path template that is instantiated from properties of individual records. This way annex object tree like layouts can be built, as long as hashdir_XXX is a provided property, etc. Likewise, annex-wanted expressions can be evaluated by other tool.

Once available, this helper can be used inside export-archive and export-archive-ora.

mih · 2023-01-09T14:30:39Z

Analog to the "archive assembler" described above (reads content properties from a source, builds, and possibly registers an archive in a dataset), we could have a "dataset assembler" that reads properties from a source and registers the respective content in a dataset.

Such a command could understand the result records of download and in conjunction replace download-url. It could also have an archive parser companion that pulls content from archives, hashes content such that datalad+archive URLs could be registered for new content in a dataset. Any intermediate filter (e.g. jq) could be used to perform arbitrary sub-selection of path manipulations.

It would be also interesting to explore this in the context of datalad-ebrains, where this is done inside a single command (generate dataset and file infos from a metadata query and then record them in a dataset), but may also benefit from code-reuse and a less ad-hoc solution.

So maybe:

assemble-dataset
assemble-archive

as two new plumbing commands.

I wonder if that is an opportunity to reduce the complexity of save (in particular at the repo level). status could be the helper that feeds assemble-dataset.

bpoldrack · 2023-01-11T07:28:50Z

I have no immediate suggestion, but I want to register that the idea sounds pretty good at a first look. :)

mih · 2023-05-03T12:25:52Z

Yet another thought-iteration:

The leanest concept I can come up with is a command that generates inputs for addurls. With addurls it would not matter whether the ultimate goal is to generate dl+archive URLs for the datalad-archives special remote, or some other URLs to point to extracted content of some archive somewhere, or something entirely different and unrelated to addurls.

Such a command would not need to be able to assemble full URLs or full paths. It only needs to parse a collection (either already in an archive, or still in a dataset yet to be exported into an archive), and to report relevant properties of items in that collection, such as

(file)name and/or path inside the collection
checksum
size

A replacement for add-archive-content would then be a pipe between this property reporter and addurls -- both individually parameterized as desired. Generating dl+archive URLs would be nothing more than placing the respective annex key of an archive in the URL template given to addurls.

Brain storming

The command...

name could be list-collection
could have a --type argument that instructions what collection to process (archive, worktree (or just directory), ref, annex) -- and each type can have its own parser implementation for straightforward extensibility
would have a location argument that identifies the collection to process
would yield a result record per collection item with all its properties, and thereby automatically yield compatible input for addurls (even though it will require the jq slurp trick from the addurls help for CLI usage)

Out-of-scope features of `add-archive-content`

content layout mangling. However, content layout properties could be broken up into distinct items that enable such relayouting via addurls FILENAME-FORMAT declaration. I.e. report leading dirs, instead of stripping them
deletion of the source/parsed collection
any form of content dropping or other dataset modifications
any form of requirement of a "work-dataset", other than a dataset being present when a dataset is to be parsed as a source collection

mih · 2023-05-06T13:27:00Z

Linking #323 that is providing one of the candidate traversers for a list-collection command.

Other existing traversers that could move into list-collection are

gooey-lsdir (for a single directory)
tree (for a directory tree)

Given that we also have for-each-dataset -- which is focused on dataset collections, we might consider drawing a line there, and do not implement a list-collection command, but rather a list-file-collection command, and implement it such that it yields type=file results (possibly by yielded the filtered output of other (internal) implementations).

mih · 2023-05-07T11:39:43Z

I am starting to work on the list-collection aspect of this issue now.

mih · 2023-05-09T10:47:13Z

With the code that is coming in #343 we can effectively replace add-archive-content. Demo:

datalad -f json ls-file-collection tarfile my.tar.gz --hash md5 \
    | jq '. | select(.type == "file")' \
    | jq --slurp . \
    | datalad addurls --key 'et:MD5-s{size}--{hash-md5}' - 'dl+archive:SOMEKEY#path={item}&size={size}' '{item}'

Explanation

datalad ls-file-collection is reporting on the content of a tar file. Its result records have all the things one would expect, including size info. --hash md5 causes a file hash for each tar member to be computed. This is done storage-efficiently without extracting the tar archive. The same command can report on other collection type too -- this type of metadata generation is no limited to tar archives
the first jq command sub-selects results on files (only needed because datalad's own facilities for that do not work; --report-type is not effecting -f json_pp output? datalad#5581)
the second jq command turns JSON lines into a single JSON array, as it is necessary for addurls (taken from its docs)
the final addurls command registers keys and URL based on the metadata provided by ls-file-collection. The choice of annex backend is flexible. TAR layout mangles is mapped onto the file format input (trivial to write into a subdirectory, or any other layout build from file properties). The choice of dl+archive URLs is arbitrary here (it replaced add-archive-content, but any other URLs could be set too. Let's say, for example, to a hash-based object store that is exposed via HTTP and requires a sha256 of the file content -- possible with minimal changes.

add-archive-content features that are not directly supported

--add-archive-leading-dir -- possible via FILEFORMAT specification
--leading-dirs-depth, --leading-dirs-consider, --strip-leading-dirs, --rename -- requires collection item name mangling/parsing that is not readily provided here
--use-current-dir nothing is extracted, hence not required, the rest if possible via FILEFORMAT specification
--delete there is no need for an archive to be around even, out of scope
--key only relevant for the dl+archive URL use case, not directly supported, but accessible via URLFORMAT
--existing -- all conflict handling is left to addurls which provides options for resolutions
--copy, --drop-after, --delete-after -- there is only metadata processing going on, nothing needs to be moved, or dropped as a consequence of this processing
--no-commit, --allow-dirty -- obsolete, this uses standard addurls behavior, which leaves a clean dataset clean

In comparison to the former, this is largely metadata driven, and works without (local) extraction of a tarball. This saves storage overhead, and makes it possible to run some parts of the ingestion pipeline on a remote system. Closes datalad#183

This is (also) an alternative approach to `add-archive-content`. In comparison to the former, this is largely metadata driven, and works without (local) extraction of a tarball. This saves storage overhead, and makes it possible to run some parts of the ingestion pipeline on a remote system. Closes datalad#183

jsheunis · 2023-05-11T18:38:34Z

One thing I am uncertain of: for this replacement of add-archive-content to work, should the tarball be part of the same datalat dataset on which the addurls command operates?

mih · 2023-05-12T03:53:17Z

This depends. There is a demo included in #343 that demos the case where the tarball is included. This is because the datalad-archives special remote from -core requires that. So this demo documents that.

However, with the coming fspec features (#215) a different style of URLs could be assigned, and a different special remote could act on them (e.g. uncurl) and enable other use cases.

Third, datalad-archives could be replaced/changed entirely and support a hybrid of these extremes, where it can use the registered URL of a known tarball to provide fspec flexibility, but as an option, in addition to the traditional behavior. This is what #223 is about.

jsheunis · 2023-05-12T06:33:29Z

Thanks for the explanation

mih mentioned this issue Dec 12, 2022

Implement ArchiveOperations #184

Closed

mih mentioned this issue Jan 15, 2023

Re-Implement datalad-archives special remote #185

Closed

mih assigned christian-monch and unassigned christian-monch May 4, 2023

christian-monch mentioned this issue May 8, 2023

Draft for Issue 183 list collection #342

Closed

mih mentioned this issue May 8, 2023

iter_collections and ls-file-collection #343

Merged

13 tasks

mih closed this as completed in #343 May 11, 2023

loj mentioned this issue Oct 26, 2023

archivist special remote: add support for tar archives with .tgz extension #517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement alternative to `add-archive-content` #183

Implement alternative to `add-archive-content` #183

mih commented Dec 12, 2022 •

edited

Loading

mih commented Jan 9, 2023 •

edited

Loading

mih commented Jan 9, 2023 •

edited

Loading

bpoldrack commented Jan 11, 2023

mih commented May 3, 2023 •

edited

Loading

mih commented May 6, 2023 •

edited

Loading

mih commented May 7, 2023

mih commented May 9, 2023

jsheunis commented May 11, 2023

mih commented May 12, 2023

jsheunis commented May 12, 2023

Implement alternative to add-archive-content #183

Implement alternative to add-archive-content #183

Comments

mih commented Dec 12, 2022 • edited Loading

mih commented Jan 9, 2023 • edited Loading

Import from an archive

Export (work)tree into an archive

mih commented Jan 9, 2023 • edited Loading

bpoldrack commented Jan 11, 2023

mih commented May 3, 2023 • edited Loading

Brain storming

Out-of-scope features of add-archive-content

mih commented May 6, 2023 • edited Loading

mih commented May 7, 2023

mih commented May 9, 2023

jsheunis commented May 11, 2023

mih commented May 12, 2023

jsheunis commented May 12, 2023

Implement alternative to `add-archive-content` #183

Implement alternative to `add-archive-content` #183

mih commented Dec 12, 2022 •

edited

Loading

mih commented Jan 9, 2023 •

edited

Loading

mih commented Jan 9, 2023 •

edited

Loading

mih commented May 3, 2023 •

edited

Loading

Out-of-scope features of `add-archive-content`

mih commented May 6, 2023 •

edited

Loading