Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework export archive extraction #3242

Closed
wants to merge 4 commits into from

Conversation

CasperWA
Copy link
Contributor

@CasperWA CasperWA commented Aug 5, 2019

Fixes #3199
Fixes #3193 (since Archive is now utilized)

The function extract_tree now returns a Folder object, pointing to
the archive folder, which should not automatically erase the folder
after import.

The difference between the functions extract_zip, extract_tar and extract_tree now, is that extract_zip and extract_tar unpacks the compressed files to a SandboxFolder under .aiida/repository/<AiiDA user>/sandbox (or similar), and then uses the files from there to whatever purpose and finally erases the extracted files (and folders) again.

extract_tree will instead return a Folder object (parent class of SandboxFolder), which points to the "original" folder/tree - the source.
This can potentially create some problems, since one can use extract_tree to get a Folder object and then use the erase method to remove the only known instance of the data.
This may be remedied by copying the contents of the specified path for extract_tree to a temporary SandboxFolder (as is essentially done when unpacking the zip or tar files) and then exclusively operate on the SandboxFolder. The problem, however, with this is the extra space this will take up for large archives/folders/trees. Therefore, I have chosen the solution of pointing to the "original" folder/tree, making sure erase is not called upon __exit__. The caveat being that I could have missed a specific call to erase somewhere, however, this does not seem to be the case - according to the tests :)

Update
extract_tree is now get_tree and all extraction functions have had their uses cut down to a single place; the Archive class. This class will henceforth be used in import/export to deal with export archives. I believe this was also @sphuber's original intent for this class(?)

To do
- [ ] Thoroughly introduce Archive to the export function No need.

  • Update migrations to rely on Archive
  • Move some code from verdi export migrate to the aiida.tools.importexport.migration module
  • Update all tests to use either Archive or the extraction functions directly, where it may be needed
  • Consider updating test utility functions to use Archive

@CasperWA
Copy link
Contributor Author

CasperWA commented Aug 7, 2019

Found an error with this.
When the import functions copy in the new repository subfolder for each Node, it creates a SandboxFolder object for each subfolder, meaning the "source" subfolder will be deleted upon a successful transfer.
This should either be fixed in the import functions, when creating the subfolder objects, or in the extract_tree function, where the files should first be copied and then used.
Personally, I still prefer not to copy the tree, and will instead implement a "switch"/if-else in the import functions, when creating the subfolder objects.

Edit: It is not that a SandboxFolder object is created, instead it is that the move parameter is set to True when creating the Node's repository folder.

@CasperWA CasperWA changed the title Rewrite extract_tree for archive import [WIP] Rewrite extract_tree for archive import Aug 7, 2019
@CasperWA CasperWA changed the title [WIP] Rewrite extract_tree for archive import Rewrite extract_tree for archive import Aug 7, 2019
@CasperWA CasperWA force-pushed the fix_3199_extract_tree branch from aaad01d to 1b4dac0 Compare August 12, 2019 14:10
@CasperWA CasperWA changed the title Rewrite extract_tree for archive import [WIP] Rewrite extract_tree for archive import Aug 29, 2019
@CasperWA
Copy link
Contributor Author

I am currently rewriting this, in order to properly utilize the "new" Archive class made by @sphuber. The idea is that the extraction functions will be used solely by Archive, and the import and migration functions will use an Archive instance/object to load the files needed. This should reduce lines of code, but have no effect on performance (i.e., speed and stability should be roughly the same as before).

@CasperWA CasperWA changed the title [WIP] Rewrite extract_tree for archive import [WIP] Rework export archive extraction Aug 29, 2019
@CasperWA CasperWA force-pushed the fix_3199_extract_tree branch from 1b4dac0 to af3aed2 Compare August 29, 2019 20:43
@CasperWA CasperWA force-pushed the fix_3199_extract_tree branch 4 times, most recently from 6876b1a to 6f16fe6 Compare September 10, 2019 12:37
@CasperWA CasperWA changed the title [WIP] Rework export archive extraction Rework export archive extraction Sep 10, 2019
@CasperWA
Copy link
Contributor Author

The Archive class has bloated a bit - I have tried to keep it reasonable, but wanting to make it "repackable", I had to introduce several minor functions and properties.
On the other side, it is now repackable, i.e., one can use the repack() method (when in a context) to write the meta_data and data properties/dictionaries back to their respective JSON files in the Archive.folder.abspath and subsequently automatically write it all to a specified compressed tar or zip file. The format cannot be chosen manually, but will automatically be chosen based on the original archive format (if it is a folder/tree, the format will be zip).

This new repack() method makes it easy to use the Archive class for migrations. Hence, a new migration workflow has been introduced that utilizes this method and minimizes the amount of "backend" code in the click function for verdi export migrate.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your suspicion that the class has become too bloated with these changes. I can see the advantage of leveraging it to write archives in addition to just reading them. Also the solution you chose for the difference between the packed archives and the plain directory ones is not ideal. Both of these problems stem from the same concept: mutability. Maybe a better solution would be to keep the Archive base class as read only, as it used to be (note that will have to remove the data and metadata properties to prevent people from changing their content from the outside). You can then make a sub class WritableArchive that adds the possibility of mutating the content and saving the changes, either to a new archive or overwriting the original. The solution for the directory archive then is to not change the Folder type, but simply when opening the context, copying its contents to a SandboxFolder same as for the packed archives. This also solves the problem that your current solution does not tackle, as to how revert changes of a directory archive. When you make it mutable but also might want to keep the original you _in any case_have to create a clone to work on.

Happy to discuss this in more detail in person before you start implementing

"""Decorator to ensure that the instance is called within a context manager."""
if instance and not instance.folder:
raise InvalidOperation('the Archive class should be used within a context')
def _ensure_within_context(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you move the definition of the decorator outside of the class? It only makes sense to be applied to class methods. In any case, now you are no longer using this internal _ensure_within_context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True - it comes from the desire to have a class used only for tests that subclasses Archive, but otherwise does not need everything to be in a context. For this I needed the decorator to call an "overridable" class method.

self._folder = None
self._unpacked = False
self._keep = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce another attribute keep to control the deletion of the folder if you also kind of encode this in what type of folder is used, i.e. SandboxFolder or normal Folder. If you only use this keep to distinguish between these two cases then one of the two is doubling up. If you need keep also in the SandboxFolder cases, i.e. for the zipped/tarred archives, then we should just get rid of the SandboxFolder and control deletion ourselves through keep. Maybe this makes more sense anyway since it was the usage of SandboxFolder that was giving you problems to begin with. Always use a Folder and for the non directory archives set _keep to False such that the unpacked is automatically cleaned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very sensible idea!

However, this is somewhat coupled with the fact that I am still trying to get around having to copy all the files (if they are an extracted archive - a folder on the hdd) before repacking. But maybe there is no getting around this, if we wish to use this class for migrations and the archived node repository files need to be altered. If it was only the JSON files, it may have been possible. Unless the class could handle a selective copying and altering during migrations that the migration functions do not need to know about?

self._data = None
self._meta_data = None
self._archive_format = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not using this consistently. For starters, I cannot find where you set this. But then you do use the value to switch on in unpack and repack. You are also sometimes using the attribute and sometimes the property

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it in unpack, where it is set to zip, unless the archive was a tar file, in which case it is set to tar.

I will go through its uses. Also, I think this is an attribute that should be completely reworked from a design point of view, in order to make it possible to choose the archive you are outputting (if repacking).

with io.open(self.folder.get_abs_path(filename), 'wb') as fhandle:
json.dump(data, fhandle)

self._keep = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to change _keep here? Is the idea that as soon as anything is written, it should always be kept?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

@CasperWA CasperWA force-pushed the fix_3199_extract_tree branch 3 times, most recently from 2f4db75 to aac1552 Compare September 17, 2019 14:09
@CasperWA CasperWA added this to the v1.1.0 milestone Oct 30, 2019
@CasperWA CasperWA force-pushed the fix_3199_extract_tree branch from aac1552 to e430588 Compare December 3, 2019 00:36
@CasperWA CasperWA force-pushed the fix_3199_extract_tree branch from e430588 to 2d36c93 Compare December 3, 2019 14:31
@CasperWA CasperWA force-pushed the fix_3199_extract_tree branch from 2d36c93 to 0f7f537 Compare December 3, 2019 23:46
The function `extract_tree` now returns a `Folder` object, pointing to
the archive folder, which should not automatically erase the folder
after import.

Correct import functions for extract_tree.
This was a problem when creating the new Node repository, since the
import functions would move the source directory (i.e. delete the source
directory after a successful copy).
The extraction of export archives is now handled by the `Archive` class,
which will point to a `SandboxFolder` if the archive is not a folder, in
which case it will point to the folder, making it a `Folder` object
instead - as to not erase the archive when done with it.

The use of the extraction functions now solely fall upon the `Archive`
class.
The Archive class has been introduced to migrations and tests alike,
with the purpose of reducing "noise" and making the focus of Archive
clear for imports and dealing with export archives.
This commit handles the more easily implemented suggestions and should
later be merged with other commits, where it makes sense.
@CasperWA CasperWA force-pushed the fix_3199_extract_tree branch from 0f7f537 to bda6519 Compare December 4, 2019 19:56
@sphuber sphuber removed this from the v1.1.0 milestone Feb 28, 2020
@CasperWA CasperWA mentioned this pull request Apr 3, 2020
2 tasks
@sphuber
Copy link
Contributor

sphuber commented May 7, 2020

Since this has been open for very long and we first need to find some time to discuss the redesign, is it ok to close this for now @CasperWA ?

@CasperWA
Copy link
Contributor Author

CasperWA commented May 7, 2020

Since this has been open for very long and we first need to find some time to discuss the redesign, is it ok to close this for now @CasperWA ?

Since this needs redesign. Yes, it can be closed (if you absolutely do not wish it to be open here).

@sphuber sphuber closed this May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants