-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement alternative to add-archive-content
#183
Comments
More thoughts.... There are essentially two fundamental use cases:
Orthogonal to these fundamental scenarios is the question whether the internal layout of an archive matches that of a datasets (work)tree, or not. If not, such transformations could be a selection filter (exclude/include a subset) or re-layouting. This results in four main scenarios that could be captured in different ways (see below). In general it does not seem to be meaningful to include the deposition of an archive into any procedures covered here -- there are countless ways to do that (by special remote, remote, URL, ...), and covering them all would be a mess. Import from an archiveI am skipping this for now. Most, if not all features I can think of are covered by Export (work)tree into an archiveWe have However, adding the ability to specify alternative trees to export and then register generated archives back with the source dataset would be straightforward. In addition we have I believe the two commands could be unified into a single command that supports:
and this command would also gain the ability to register a generated archive back with a source dataset. It may even be possible/better to split the actual archive generation/registration out into a dedicated low-level/plumbing tool. This tool would read JSON (as produced by Once available, this helper can be used inside |
Analog to the "archive assembler" described above (reads content properties from a source, builds, and possibly registers an archive in a dataset), we could have a "dataset assembler" that reads properties from a source and registers the respective content in a dataset. Such a command could understand the result records of It would be also interesting to explore this in the context of datalad-ebrains, where this is done inside a single command (generate dataset and file infos from a metadata query and then record them in a dataset), but may also benefit from code-reuse and a less ad-hoc solution. So maybe:
as two new plumbing commands. I wonder if that is an opportunity to reduce the complexity of |
I have no immediate suggestion, but I want to register that the idea sounds pretty good at a first look. :) |
Yet another thought-iteration: The leanest concept I can come up with is a command that generates inputs for Such a command would not need to be able to assemble full URLs or full paths. It only needs to parse a collection (either already in an archive, or still in a dataset yet to be exported into an archive), and to report relevant properties of items in that collection, such as
A replacement for Brain stormingThe command...
Out-of-scope features of
|
Linking #323 that is providing one of the candidate traversers for a Other existing traversers that could move into
Given that we also have |
I am starting to work on the |
With the code that is coming in #343 we can effectively replace
Explanation
|
In comparison to the former, this is largely metadata driven, and works without (local) extraction of a tarball. This saves storage overhead, and makes it possible to run some parts of the ingestion pipeline on a remote system. Closes datalad#183
This is (also) an alternative approach to `add-archive-content`. In comparison to the former, this is largely metadata driven, and works without (local) extraction of a tarball. This saves storage overhead, and makes it possible to run some parts of the ingestion pipeline on a remote system. Closes datalad#183
One thing I am uncertain of: for this replacement of |
This depends. There is a demo included in #343 that demos the case where the tarball is included. This is because the However, with the coming fspec features (#215) a different style of URLs could be assigned, and a different special remote could act on them (e.g. Third, |
Thanks for the explanation |
Note: Lots of thoughts here. The latest variant is #183 (comment)
This is one of the oldest datalad command. It core purpose is to register special URL for the
datalad-archive
special remote that indicate that particular keys can be obtained by extracting a file from an archive that is also registered as an annexed file in the dataset.On top of that, this command is also a front-end to archive extraction and content re-layouting.
It would be good this disentangle these aspects and provide a more targeted (and likely simpler alternative).
Concept
Parse an archive (already an annexed file), run
git annex calckey
on each contained file, add adatalad-archive
URL to the computed key, if it exists.This command would not interact with a dataset's worktree. Consequently, it needs no parameterization re archive content re-layouting. It should should work with bare repos.
From a user perspective, the composition of a dataset's content will be a more independent process. A user could extract files from an existing archive any way they see fit, rename, filter, restructure by any means and/or tools. Importantly, it does not matter whether an archive is the actual starting point of the operation, or whether an archive (with any structure) was just created from existing dataset content (think ORA dataset archive), for the purpose of depositing this archive only (rather than individual files) on some storage system.
The text was updated successfully, but these errors were encountered: