Rework export archive extraction #3242

CasperWA · 2019-08-05T15:52:43Z

Fixes #3199
Fixes #3193 (since Archive is now utilized)

The function extract_tree now returns a Folder object, pointing to
the archive folder, which should not automatically erase the folder
after import.

The difference between the functions extract_zip, extract_tar and extract_tree now, is that extract_zip and extract_tar unpacks the compressed files to a SandboxFolder under .aiida/repository/<AiiDA user>/sandbox (or similar), and then uses the files from there to whatever purpose and finally erases the extracted files (and folders) again.

extract_tree will instead return a Folder object (parent class of SandboxFolder), which points to the "original" folder/tree - the source.
This can potentially create some problems, since one can use extract_tree to get a Folder object and then use the erase method to remove the only known instance of the data.
This may be remedied by copying the contents of the specified path for extract_tree to a temporary SandboxFolder (as is essentially done when unpacking the zip or tar files) and then exclusively operate on the SandboxFolder. The problem, however, with this is the extra space this will take up for large archives/folders/trees. Therefore, I have chosen the solution of pointing to the "original" folder/tree, making sure erase is not called upon __exit__. The caveat being that I could have missed a specific call to erase somewhere, however, this does not seem to be the case - according to the tests :)

Update
extract_tree is now get_tree and all extraction functions have had their uses cut down to a single place; the Archive class. This class will henceforth be used in import/export to deal with export archives. I believe this was also @sphuber's original intent for this class(?)

To do
~~- [ ] Thoroughly introduce Archive to the export function~~ No need.

Update migrations to rely on Archive
Move some code from verdi export migrate to the aiida.tools.importexport.migration module
Update all tests to use either Archive or the extraction functions directly, where it may be needed
Consider updating test utility functions to use Archive

CasperWA · 2019-08-07T10:05:47Z

Found an error with this.
When the import functions copy in the new repository subfolder for each Node, it creates a SandboxFolder object for each subfolder, meaning the "source" subfolder will be deleted upon a successful transfer.
This should either be fixed in the import functions, when creating the subfolder objects, or in the extract_tree function, where the files should first be copied and then used.
Personally, I still prefer not to copy the tree, and will instead implement a "switch"/if-else in the import functions, when creating the subfolder objects.

Edit: It is not that a SandboxFolder object is created, instead it is that the move parameter is set to True when creating the Node's repository folder.

CasperWA · 2019-08-29T20:37:56Z

I am currently rewriting this, in order to properly utilize the "new" Archive class made by @sphuber. The idea is that the extraction functions will be used solely by Archive, and the import and migration functions will use an Archive instance/object to load the files needed. This should reduce lines of code, but have no effect on performance (i.e., speed and stability should be roughly the same as before).

CasperWA · 2019-09-11T07:52:20Z

The Archive class has bloated a bit - I have tried to keep it reasonable, but wanting to make it "repackable", I had to introduce several minor functions and properties.
On the other side, it is now repackable, i.e., one can use the repack() method (when in a context) to write the meta_data and data properties/dictionaries back to their respective JSON files in the Archive.folder.abspath and subsequently automatically write it all to a specified compressed tar or zip file. The format cannot be chosen manually, but will automatically be chosen based on the original archive format (if it is a folder/tree, the format will be zip).

This new repack() method makes it easy to use the Archive class for migrations. Hence, a new migration workflow has been introduced that utilizes this method and minimizes the amount of "backend" code in the click function for verdi export migrate.

sphuber

I agree with your suspicion that the class has become too bloated with these changes. I can see the advantage of leveraging it to write archives in addition to just reading them. Also the solution you chose for the difference between the packed archives and the plain directory ones is not ideal. Both of these problems stem from the same concept: mutability. Maybe a better solution would be to keep the Archive base class as read only, as it used to be (note that will have to remove the data and metadata properties to prevent people from changing their content from the outside). You can then make a sub class WritableArchive that adds the possibility of mutating the content and saving the changes, either to a new archive or overwriting the original. The solution for the directory archive then is to not change the Folder type, but simply when opening the context, copying its contents to a SandboxFolder same as for the packed archives. This also solves the problem that your current solution does not tackle, as to how revert changes of a directory archive. When you make it mutable but also might want to keep the original you _in any case_have to create a clone to work on.

Happy to discuss this in more detail in person before you start implementing

sphuber · 2019-09-17T09:47:14Z