Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow when writing to Zarr #29

Closed
aolt opened this issue Nov 9, 2018 · 6 comments
Closed

Slow when writing to Zarr #29

aolt opened this issue Nov 9, 2018 · 6 comments
Labels
question Further information is requested

Comments

@aolt
Copy link

aolt commented Nov 9, 2018

I am trying to convert a multiple files grib files Xarray dataset into Zarr. Reading these files is relatively quick, but writing to Zarr is going very slow. What I am doing:

import xarray as xr
chunks = {"time": 24, "latitude": 103, "longitude": 180}
ds = xr.open_mfdataset("{}/*/*/*_{}_*.grb".format(source, var), chunks=chunks, engine='cfgrib', backend_kwargs = {'indexpath': '/mnt/test/era5/grib_index/idx'}, concat_dim='time')
ds.to_zarr("mypath")

It writes about 1MB per 10 seconds and it is using 100% CPU. There is a lot performance capacity on the disk, so I assume it should be about how it reads grib files.

>>> xarray.__version__
'0.11.0'

python -V
Python 3.7.0

pip list |grep cfgrib
cfgrib           0.9.4.1  

Each file is about 1.5GB and I have about 216 files to write as Zarr

@alexamici
Copy link
Contributor

@aolt first guess the 'indexpath': '/mnt/test/era5/grib_index/idx' attempts to write all indices to the same file so you don't have index persistence and this typically kills performance.

You can use '{path}' in indexpath to differentiate files in the open_mfdataset call. Something like:

>>> ds = xr.open_mfdataset("{}/*/*/*_{}_*.grb".format(source, var), chunks=chunks, engine='cfgrib', backend_kwargs = {'indexpath': '/tmp/{path}.idx'}, concat_dim='time')
>>> ds.to_zarr("mypath")

Also chunking on latitude / longitude is not very useful with GRIB files in general, avoid it if you can.

@alexamici alexamici added the question Further information is requested label Nov 9, 2018
@aolt
Copy link
Author

aolt commented Nov 12, 2018

Thanks for the trick with 'indexpath', It didn't help though :(
Avoiding chunking on lat/lon helps indeed, but still I am on 100% CPU load and about 20MB/s write where I have plenty performance left on the disk.

@alexamici
Copy link
Contributor

@aolt improving performance, especially when using dask, is in our todo list, but not very high as long as we work on stabilising the API, sorry.

@alexamici
Copy link
Contributor

I opened #33 for general dask performance issue, I leave this issue open for now, but my guess is that there is nothing specific to Zarr.

@aolt
Copy link
Author

aolt commented Nov 13, 2018

thanks @alexamici, I think this is specific to cfgrib. I get much higher performance with reading NetCDF with xarray/scipy and writing into Zarr.

@alexamici
Copy link
Contributor

alexamici commented Dec 5, 2019

I close the issue as it looks like a generic disk performance issue.

shahramn added a commit that referenced this issue Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants