-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework export archive extraction #3242
Conversation
Found an error with this. Edit: It is not that a |
extract_tree
for archive importextract_tree
for archive import
extract_tree
for archive importextract_tree
for archive import
aaad01d
to
1b4dac0
Compare
extract_tree
for archive importextract_tree
for archive import
I am currently rewriting this, in order to properly utilize the "new" |
extract_tree
for archive import1b4dac0
to
af3aed2
Compare
6876b1a
to
6f16fe6
Compare
The This new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your suspicion that the class has become too bloated with these changes. I can see the advantage of leveraging it to write archives in addition to just reading them. Also the solution you chose for the difference between the packed archives and the plain directory ones is not ideal. Both of these problems stem from the same concept: mutability. Maybe a better solution would be to keep the Archive
base class as read only, as it used to be (note that will have to remove the data
and metadata
properties to prevent people from changing their content from the outside). You can then make a sub class WritableArchive
that adds the possibility of mutating the content and saving the changes, either to a new archive or overwriting the original. The solution for the directory archive then is to not change the Folder
type, but simply when opening the context, copying its contents to a SandboxFolder
same as for the packed archives. This also solves the problem that your current solution does not tackle, as to how revert changes of a directory archive. When you make it mutable but also might want to keep the original you _in any case_have to create a clone to work on.
Happy to discuss this in more detail in person before you start implementing
"""Decorator to ensure that the instance is called within a context manager.""" | ||
if instance and not instance.folder: | ||
raise InvalidOperation('the Archive class should be used within a context') | ||
def _ensure_within_context(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you move the definition of the decorator outside of the class? It only makes sense to be applied to class methods. In any case, now you are no longer using this internal _ensure_within_context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - it comes from the desire to have a class used only for tests that subclasses Archive
, but otherwise does not need everything to be in a context. For this I needed the decorator to call an "overridable" class method.
self._folder = None | ||
self._unpacked = False | ||
self._keep = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introduce another attribute keep
to control the deletion of the folder if you also kind of encode this in what type of folder is used, i.e. SandboxFolder
or normal Folder
. If you only use this keep
to distinguish between these two cases then one of the two is doubling up. If you need keep
also in the SandboxFolder
cases, i.e. for the zipped/tarred archives, then we should just get rid of the SandboxFolder
and control deletion ourselves through keep
. Maybe this makes more sense anyway since it was the usage of SandboxFolder
that was giving you problems to begin with. Always use a Folder
and for the non directory archives set _keep
to False
such that the unpacked is automatically cleaned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very sensible idea!
However, this is somewhat coupled with the fact that I am still trying to get around having to copy all the files (if they are an extracted archive - a folder on the hdd) before repacking. But maybe there is no getting around this, if we wish to use this class for migrations and the archived node repository files need to be altered. If it was only the JSON files, it may have been possible. Unless the class could handle a selective copying and altering during migrations that the migration functions do not need to know about?
self._data = None | ||
self._meta_data = None | ||
self._archive_format = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not using this consistently. For starters, I cannot find where you set this. But then you do use the value to switch on in unpack
and repack
. You are also sometimes using the attribute and sometimes the property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set it in unpack
, where it is set to zip
, unless the archive was a tar
file, in which case it is set to tar
.
I will go through its uses. Also, I think this is an attribute that should be completely reworked from a design point of view, in order to make it possible to choose the archive you are outputting (if repacking).
with io.open(self.folder.get_abs_path(filename), 'wb') as fhandle: | ||
json.dump(data, fhandle) | ||
|
||
self._keep = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to change _keep
here? Is the idea that as soon as anything is written, it should always be kept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly.
2f4db75
to
aac1552
Compare
aac1552
to
e430588
Compare
e430588
to
2d36c93
Compare
2d36c93
to
0f7f537
Compare
The function `extract_tree` now returns a `Folder` object, pointing to the archive folder, which should not automatically erase the folder after import. Correct import functions for extract_tree. This was a problem when creating the new Node repository, since the import functions would move the source directory (i.e. delete the source directory after a successful copy).
The extraction of export archives is now handled by the `Archive` class, which will point to a `SandboxFolder` if the archive is not a folder, in which case it will point to the folder, making it a `Folder` object instead - as to not erase the archive when done with it. The use of the extraction functions now solely fall upon the `Archive` class.
The Archive class has been introduced to migrations and tests alike, with the purpose of reducing "noise" and making the focus of Archive clear for imports and dealing with export archives.
This commit handles the more easily implemented suggestions and should later be merged with other commits, where it makes sense.
0f7f537
to
bda6519
Compare
Since this has been open for very long and we first need to find some time to discuss the redesign, is it ok to close this for now @CasperWA ? |
Since this needs redesign. Yes, it can be closed (if you absolutely do not wish it to be open here). |
Fixes #3199
Fixes #3193 (since
Archive
is now utilized)The function
extract_tree
now returns aFolder
object, pointing tothe archive folder, which should not automatically erase the folder
after import.
The difference between the functions
extract_zip
,extract_tar
andextract_tree
now, is thatextract_zip
andextract_tar
unpacks the compressed files to aSandboxFolder
under.aiida/repository/<AiiDA user>/sandbox
(or similar), and then uses the files from there to whatever purpose and finally erases the extracted files (and folders) again.extract_tree
will instead return aFolder
object (parent class ofSandboxFolder
), which points to the "original" folder/tree - the source.This can potentially create some problems, since one can use
extract_tree
to get aFolder
object and then use theerase
method to remove the only known instance of the data.This may be remedied by copying the contents of the specified path for
extract_tree
to a temporarySandboxFolder
(as is essentially done when unpacking the zip or tar files) and then exclusively operate on theSandboxFolder
. The problem, however, with this is the extra space this will take up for large archives/folders/trees. Therefore, I have chosen the solution of pointing to the "original" folder/tree, making sureerase
is not called upon__exit__
. The caveat being that I could have missed a specific call toerase
somewhere, however, this does not seem to be the case - according to the tests :)Update
extract_tree
is nowget_tree
and all extraction functions have had their uses cut down to a single place; theArchive
class. This class will henceforth be used in import/export to deal with export archives. I believe this was also @sphuber's original intent for this class(?)To do
- [ ] Thoroughly introduceNo need.Archive
to the export functionArchive
verdi export migrate
to theaiida.tools.importexport.migration
moduleArchive
or the extraction functions directly, where it may be neededArchive