Slow when writing to Zarr #29

aolt · 2018-11-09T13:37:40Z

I am trying to convert a multiple files grib files Xarray dataset into Zarr. Reading these files is relatively quick, but writing to Zarr is going very slow. What I am doing:

import xarray as xr
chunks = {"time": 24, "latitude": 103, "longitude": 180}
ds = xr.open_mfdataset("{}/*/*/*_{}_*.grb".format(source, var), chunks=chunks, engine='cfgrib', backend_kwargs = {'indexpath': '/mnt/test/era5/grib_index/idx'}, concat_dim='time')
ds.to_zarr("mypath")

It writes about 1MB per 10 seconds and it is using 100% CPU. There is a lot performance capacity on the disk, so I assume it should be about how it reads grib files.

>>> xarray.__version__
'0.11.0'

python -V
Python 3.7.0

pip list |grep cfgrib
cfgrib           0.9.4.1

Each file is about 1.5GB and I have about 216 files to write as Zarr

The text was updated successfully, but these errors were encountered:

alexamici · 2018-11-09T15:14:51Z

@aolt first guess the 'indexpath': '/mnt/test/era5/grib_index/idx' attempts to write all indices to the same file so you don't have index persistence and this typically kills performance.

You can use '{path}' in indexpath to differentiate files in the open_mfdataset call. Something like:

>>> ds = xr.open_mfdataset("{}/*/*/*_{}_*.grb".format(source, var), chunks=chunks, engine='cfgrib', backend_kwargs = {'indexpath': '/tmp/{path}.idx'}, concat_dim='time')
>>> ds.to_zarr("mypath")

Also chunking on latitude / longitude is not very useful with GRIB files in general, avoid it if you can.

aolt · 2018-11-12T11:48:35Z

Thanks for the trick with 'indexpath', It didn't help though :(
Avoiding chunking on lat/lon helps indeed, but still I am on 100% CPU load and about 20MB/s write where I have plenty performance left on the disk.

alexamici · 2018-11-12T13:06:04Z

@aolt improving performance, especially when using dask, is in our todo list, but not very high as long as we work on stabilising the API, sorry.

alexamici · 2018-11-12T18:08:42Z

I opened #33 for general dask performance issue, I leave this issue open for now, but my guess is that there is nothing specific to Zarr.

aolt · 2018-11-13T07:49:34Z

thanks @alexamici, I think this is specific to cfgrib. I get much higher performance with reading NetCDF with xarray/scipy and writing into Zarr.

alexamici · 2019-12-05T14:44:53Z

I close the issue as it looks like a generic disk performance issue.

alexamici added the question Further information is requested label Nov 9, 2018

alexamici closed this as completed Dec 5, 2019

shahramn added a commit that referenced this issue Jul 5, 2024

Dependabot alerts #29: jinja2

2b4474b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow when writing to Zarr #29

Slow when writing to Zarr #29

aolt commented Nov 9, 2018 •

edited

Loading

alexamici commented Nov 9, 2018

aolt commented Nov 12, 2018 •

edited

Loading

alexamici commented Nov 12, 2018

alexamici commented Nov 12, 2018

aolt commented Nov 13, 2018

alexamici commented Dec 5, 2019 •

edited

Loading

Slow when writing to Zarr #29

Slow when writing to Zarr #29

Comments

aolt commented Nov 9, 2018 • edited Loading

alexamici commented Nov 9, 2018

aolt commented Nov 12, 2018 • edited Loading

alexamici commented Nov 12, 2018

alexamici commented Nov 12, 2018

aolt commented Nov 13, 2018

alexamici commented Dec 5, 2019 • edited Loading

aolt commented Nov 9, 2018 •

edited

Loading

aolt commented Nov 12, 2018 •

edited

Loading

alexamici commented Dec 5, 2019 •

edited

Loading