Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop PPDataProxy accessing the file when no data is needed. #3659

Merged
merged 12 commits into from
Feb 19, 2020

Conversation

pp-mo
Copy link
Member

@pp-mo pp-mo commented Feb 14, 2020

We just found that PP and Fieldsfile loading has been slow and heavy on memory, apparently since Dask 2.0.

In these cases, the PPDataProxy does not actually use the file data, so this fix allows it to return a suitable empty array object of the expected type and shape, without reading the file.


The reason appears to be that dask.array.from_array(arraylike) now extracts an empty slice of the argument, like arraylike[0:0, 0:0...].

Note, this is even worse than it sounds :

  1. the PP field will often need to be unpacked, which is time consuming
  2. since the 0-size slice is stored within the dask object, and it is a view referencing the full data array, hence all field data is loaded, and remains in memory !

See this line in meta_from_array, called from here in Array.__new__, as used in from_array.

Speculation : I think dask is doing this to capture the metadata of a slice from the argument, which it presumably believes can be different from that of the argument itself. I'm guessing this is needed for generalised array types e.g. dask, CUDA

@pp-mo pp-mo requested a review from trexfeathers February 14, 2020 19:02
Copy link
Contributor

@trexfeathers trexfeathers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pp-mo, I follow the logic of all this. There are a couple of changes to address 🙂

@pp-mo
Copy link
Member Author

pp-mo commented Feb 18, 2020

Hi @trexfeathers sorry for the delay on this -- I forgot to push it up !

Following offline discussion, I've now reduced the scope of the behaviour, in favour of simpler code that only supports the exact current need (e.g. it no longer catches "[1:1]" as an empty slice case)

However, I've also rather increased the complexity of the testing.
Again sorry, but I did spot some existing problems!
e.g. the previous code could not handle a single index key, because Python passes that as a bare object + not a 1-tuple.

As usual, it's tough to be comprehensive because numpy has so many indexing styles.
So, I'm taking it for granted that Dask will only use slices and integer indices
This excludes (e.g.) :

  • keys which are arrays or lists (fancy indexing)
  • ellipsis
  • None / newaxis
  • boolean arrays

I'm just not considering those as possibilities.
Does this seem sound ??

@pp-mo
Copy link
Member Author

pp-mo commented Feb 18, 2020

Note: when we have agreed on an implementation here, we should also adopt it in iris-grib.

Copy link
Contributor

@trexfeathers trexfeathers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments @pp-mo, although as we discussed you will be abstracting this code to be outside the PPDataProxy specifically, so that it can also serve for GRIB files too

@pp-mo
Copy link
Member Author

pp-mo commented Feb 19, 2020

Thanks @trexfeathers
I think I have it all done now. Review away !

@trexfeathers
Copy link
Contributor

All looks great, great idea with the abstraction into util 👏

@trexfeathers trexfeathers merged commit 9b8f9cb into SciTools:v2.4.x Feb 19, 2020
trexfeathers added a commit that referenced this pull request Feb 21, 2020
* Stop PPDataProxy accessing the file when no data is needed. (#3659)

* Add 2.4 whatsnew into full whatsnew list.

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>
trexfeathers added a commit that referenced this pull request Mar 18, 2020
* Fixed tests since Numpy 1.18 deprecation of non-int num arguments for linspace. (#3655)

* remove redundant tests (#3650)

* Remove obsolete test. (#3662)

* Remove grib-specific test. (#3663)

* Removed ununused skipIf. (#3632)

* Remove test_grib_save_rules.py which has been moved to iris-grib (#3666)

* 2v4 mergeback picks (#3668)

* Stop PPDataProxy accessing the file when no data is needed. (#3659)

* Add 2.4 whatsnew into full whatsnew list.

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

* Remove uri callback test which is moved to iris-grib (#3665)

* Remove test_grib2 integration tests (#3664)

* Remove test_grib_save.py (#3669)

* Remove cube iter (#3656)

* Fixed asv project name to 'scitools-iris'. (#3660)

* Removed grib-specific test to iris-grib. (#3671)

* Removed iris.tests.integration.test_grib_load and related CML files. (#3670)

* Remove TestGribMessage (#3672)

* Switched use of datetime.weekday() to datetime.dayofwk. (#3687)

* New image hashes for mpl 3x2 (#3682)

* New image hash for iris.test.test_plot.TestSymbols.test_cloud_cover with matplotlib 3.2.0.

* Further images changes for mpl3x2.

* Yet more updated image results.

* Correct and improve dev-guide section on fixing graphics-tests. (#3683)

* Correct and improve dev-guide section on fixing graphics-tests.

* Review changes + general rethink.

* Reduce duplication between 'graphics-tests' and general 'tests' page.

* Update docs/iris/src/developers_guide/graphics_tests.rst

Co-Authored-By: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

* Update docs/iris/src/developers_guide/graphics_tests.rst

Co-Authored-By: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

* Update docs/iris/src/developers_guide/graphics_tests.rst

Co-Authored-By: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

* Update docs/iris/src/developers_guide/graphics_tests.rst

Co-Authored-By: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

* Update docs/iris/src/developers_guide/graphics_tests.rst

Co-Authored-By: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>
Co-authored-by: Bill Little <bill.james.little@gmail.com>
Co-authored-by: lbdreyer <lbdreyer@users.noreply.github.com>
Co-authored-by: Jon Seddon <17068361+jonseddon@users.noreply.github.com>
@trexfeathers trexfeathers linked an issue May 18, 2020 that may be closed by this pull request
tkknight pushed a commit to tkknight/iris that referenced this pull request Jun 29, 2020
* Stop PPDataProxy accessing the file when no data is needed. (SciTools#3659)

* Add 2.4 whatsnew into full whatsnew list.

Co-authored-by: Martin Yeo <40734014+trexfeathers@users.noreply.github.com>
@pp-mo pp-mo deleted the pp_proxy_emptyslice branch March 18, 2022 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants