Try using chunks to speed up data access #127

nicholas512 · 2022-11-01T21:50:45Z

Reading data from netcdf files takes forever and is probably the step that slows down the interpolate step.

I believe the access pattern (long time series) is extremely non-contiguous relative to the shape of the array(p11 of this report)

For ERA5 at least, it looks like the downloaded data is NETCDF3 model and not chunked.

It may be possible to speed up access by using chunking:

make copies of a bunch of data using NCCOPY , changing data type to netcdf4 and adding chunking. for era5 try chunksizes:

time 1y, lon 1, lat 1, level (all)
*or*
time 1y, lon 1, lat 1, level (all)

The text was updated successfully, but these errors were encountered:

nicholas512 · 2022-11-09T18:41:13Z

era5 comes in 1-month files (by default , so limited to 1-month / size 744)
use big chunk_cache, try 100M

nicholas512 · 2022-11-09T20:51:15Z

nccopy -k 4 -d 0 -w -h 100M -c "longitude/4,latitude/4,level/15,time/744" input.nc output.nc - sped up about 10x - 45x for accessing all levels and times for pl files depending how many lat/lon cells were incliuded in slice

nicholas512 · 2022-11-10T20:09:19Z

Script for era5 pressure-level:

for f in `ls ./era5 | grep era5_pl_.*`

 do 
  
  fullpath=./era5/$f
  newpath=./era5_chunked/$f
  
  if [[ -f $newpath ]]
  
  then 
   echo "$f already chunked"
  
  else
   NTIME=`ncdump -h $fullpath | grep "time = .*"  | grep -oEi  "[0-9]+"`
   NLEV=`ncdump -h $fullpath | grep "level = .*"  | grep -oEi  "[0-9]+"`
   echo "Chunking $f with $NTIME times and $NLEV levels" 
   nccopy -k 4 -d 0 -w -h 100M -c "longitude/4,latitude/4,level/$NLEV,time/$NTIME" $fullpath $newpath
  
  fi    

done

Speedup with MFDataset: reading [:, :, slice(20,21), slice(20,21)] (1x1 lat lon cells):

original : 48.4 s
chunked: 0.79 s
60x faster

Speedup with MFDataset: reading [:, :, slice(0,5), slice(0,5)] (4x4 lat lon cells):

original : 10.1 s
chunked: 1.55 s
6.7 times faster

Speedup with MFDataset: reading [:, :, slice(5,15), slice(5,15)] (9x9 lat lon cells):

original : 28.45 s
chunked: 4.30 s
6.6 times faster

nicholas512 · 2022-11-11T17:29:43Z

more generic script

DIRECTORY="era5"
LEV_VAR="level"
TIME_VAR="time"

for f in `ls ./$DIRECTORY | grep .*nc$`

## Process each nc file found
 do 
  
  fullpath="./$DIRECTORY/$f"
  newpath="./${DIRECTORY}_chunked/$f"

  if [[ -f $newpath ]]
  
  then 
   echo "${f} already chunked"
  
  else
   NTIME=`ncdump -h $fullpath | grep "${TIME_VAR} = .*"  | grep -oEi  "[0-9]+"`
   NLEV=`ncdump -h $fullpath | grep "${LEV_VAR} = .*"  | grep -oEi  "[0-9]+"`
   
   if [ -z "$NLEV" ]  # does the level variable not exist
   
    then 
     echo "Chunking $f with ${NTIME} times and ${NLEV} levels" 
     nccopy -k 4 -d 0 -w -h 100M -c "longitude/4,latitude/4,level/${NLEV},time/${NTIME}" $fullpath $newpath
   
    else
     echo "Chunking $f with $NTIME times" 
     nccopy -k 4 -d 0 -w -h 100M -c "longitude/4,latitude/4,time/${NTIME}" $fullpath $newpath
   
   fi
   
  
  fi    

done

nicholas512 · 2025-01-30T18:54:16Z

#162

nicholas512 self-assigned this Nov 9, 2022

nicholas512 mentioned this issue Jan 31, 2025

Dimension order sorcery #162

Merged

nicholas512 closed this as completed in #162 Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try using chunks to speed up data access #127

Try using chunks to speed up data access #127

nicholas512 commented Nov 1, 2022

nicholas512 commented Nov 9, 2022

nicholas512 commented Nov 9, 2022

nicholas512 commented Nov 10, 2022

nicholas512 commented Nov 11, 2022

nicholas512 commented Jan 30, 2025

Try using chunks to speed up data access #127

Try using chunks to speed up data access #127

Comments

nicholas512 commented Nov 1, 2022

nicholas512 commented Nov 9, 2022

nicholas512 commented Nov 9, 2022

nicholas512 commented Nov 10, 2022

nicholas512 commented Nov 11, 2022

nicholas512 commented Jan 30, 2025