Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient order of checks in verdi import #2979

Closed
sphuber opened this issue Jun 8, 2019 · 6 comments · Fixed by #4510
Closed

Inefficient order of checks in verdi import #2979

sphuber opened this issue Jun 8, 2019 · 6 comments · Fixed by #4510

Comments

@sphuber
Copy link
Contributor

sphuber commented Jun 8, 2019

verdi import currently first extracts the node data before checking the metadata. When trying to import a big archive with an old version, one first has to wait for the data to be extracted before getting the message that the version is incompatible. Only then does it go automatically to verdi export migrate, but there the same problem happens. Version compatibility happens after all the data is extracted. The Archive class provides useful context managers to read just a single file. This should be used in both these commands to first read the metadata.json and report any errors if it is incompatible before doing anything else.

@ltalirz
Copy link
Member

ltalirz commented Jul 6, 2020

Other issues/feature requests:

  • verdi export inspect has a way of looking inside the metadata and we should simply be reusing this for verdi export migrate (I just checked and it doesn't seem to be terribly fast either; perhaps the same mistake is made there)

  • For archives that already are at the latest version, verdi export migrate currently fails with:

$ verdi export migrate aiida_tutorial_2020_07_perovskites_v0.9.aiida out.aiida
Critical: Your export file is already at the version 0.9

This is not the behavior you want e.g. when you try migrating files automatically - this should be considered a Success

  • I would like to see an -i/--in-place flag to migrate the archive in place

Since we're about to migrate a number of archives for Materials Cloud, I will have a brief look into this.

@ltalirz
Copy link
Member

ltalirz commented Jul 6, 2020

The issue seems to be here:

@ensure_within_context
@ensure_unpacked
def _read_json_file(self, filename):
"""Read the contents of a JSON file from the unpacked archive contents.
:param filename: the filename relative to the sandbox folder
:return: a dictionary with the loaded JSON content
"""
with open(self.folder.get_abs_path(filename), 'r', encoding='utf8') as fhandle:
return json.load(fhandle)

The @ensure_unpacked decorator extracts the whole archive although we would just like to read a single file.

The Archive class provides useful context managers to read just a single file.

@sphuber Could you please point me to this context manager?

@sphuber
Copy link
Contributor Author

sphuber commented Jul 6, 2020

That phrase might be a bit misleading. It does allow you to retrieve the contents of a single file, but it still requires unpacking the whole thing. Currently, AiiDA archives can be either .zip or .tar.gz archives, and I am not sure if it is possible to even extract just a single file out of those. I think for .zip files it is, but not sure for tarballs.

@ltalirz
Copy link
Member

ltalirz commented Jul 6, 2020

I am not sure if it is possible to even extract just a single file out of those.

It certainly is possible, both for .zip and for .tar.gz files.

Python syntax:

tarfile.open('tarfile.tar').extract('file/to/extract', path='dest')

for zipfile see https://stackoverflow.com/a/17729939/1069467

So it seems that we should generalize the extraction code a bit to allow extracting single files and then let this functionality bubble up here...

@ltalirz
Copy link
Member

ltalirz commented Aug 26, 2020

Just to mention that

I.e. we can keep discussions on this thread to speed issues in inspecting/loading archive files.

@ltalirz
Copy link
Member

ltalirz commented Oct 29, 2020

Thanks to #4510, the archive reader is now just reading (and caching) the files that are needed.

In particular, the version check will just read the metadata

def check_version(self):
"""Check the version compatibility of the archive.
:raises: `~aiida.tools.importexport.common.exceptions.IncompatibleArchiveVersionError`:
If the version is not compatible
"""
file_version = StrictVersion(self.export_version)
expected_version = StrictVersion(self.compatible_export_version)
try:
if file_version != expected_version:
msg = f'Archive file version is {file_version}, can read only version {expected_version}'
if file_version < expected_version:
msg += "\nUse 'verdi export migrate' to update this archive file."
else:
msg += '\nUpdate your AiiDA version in order to import this file.'
raise IncompatibleArchiveVersionError(msg)
except AttributeError:
msg = (
f'Archive file version is {self.export_version}, '
f'can read only version {self.compatible_export_version}'
)
raise IncompatibleArchiveVersionError(msg)

At the moment verdi export inspect defaults to a statistics overview that requires loading the data.json as well.
This is certainly useful but I propose to change the default to a mode that only requires reading the metadata.json (e.g. printing just the export version and the AiiDA version that was used to generate it.
Opened #4531 for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants