Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try using chunks to speed up data access #127

Closed
nicholas512 opened this issue Nov 1, 2022 · 5 comments · Fixed by #162
Closed

Try using chunks to speed up data access #127

nicholas512 opened this issue Nov 1, 2022 · 5 comments · Fixed by #162
Assignees

Comments

@nicholas512
Copy link
Contributor

Reading data from netcdf files takes forever and is probably the step that slows down the interpolate step.

I believe the access pattern (long time series) is extremely non-contiguous relative to the shape of the array(p11 of this report)

For ERA5 at least, it looks like the downloaded data is NETCDF3 model and not chunked.

It may be possible to speed up access by using chunking:

make copies of a bunch of data using NCCOPY , changing data type to netcdf4 and adding chunking. for era5 try chunksizes:

time 1y, lon 1, lat 1, level (all)
*or*
time 1y, lon 1, lat 1, level (all)
@nicholas512
Copy link
Contributor Author

  • era5 comes in 1-month files (by default , so limited to 1-month / size 744)
  • use big chunk_cache, try 100M

@nicholas512
Copy link
Contributor Author

nccopy -k 4 -d 0 -w -h 100M -c "longitude/4,latitude/4,level/15,time/744" input.nc output.nc - sped up about 10x - 45x for accessing all levels and times for pl files depending how many lat/lon cells were incliuded in slice

@nicholas512 nicholas512 self-assigned this Nov 9, 2022
@nicholas512
Copy link
Contributor Author

Script for era5 pressure-level:

for f in `ls ./era5 | grep era5_pl_.*`

 do 
  
  fullpath=./era5/$f
  newpath=./era5_chunked/$f
  
  if [[ -f $newpath ]]
  
  then 
   echo "$f already chunked"
  
  else
   NTIME=`ncdump -h $fullpath | grep "time = .*"  | grep -oEi  "[0-9]+"`
   NLEV=`ncdump -h $fullpath | grep "level = .*"  | grep -oEi  "[0-9]+"`
   echo "Chunking $f with $NTIME times and $NLEV levels" 
   nccopy -k 4 -d 0 -w -h 100M -c "longitude/4,latitude/4,level/$NLEV,time/$NTIME" $fullpath $newpath
  
  fi    

done

Speedup with MFDataset: reading [:, :, slice(20,21), slice(20,21)] (1x1 lat lon cells):

  • original : 48.4 s
  • chunked: 0.79 s
  • 60x faster

Speedup with MFDataset: reading [:, :, slice(0,5), slice(0,5)] (4x4 lat lon cells):

  • original : 10.1 s
  • chunked: 1.55 s
  • 6.7 times faster

Speedup with MFDataset: reading [:, :, slice(5,15), slice(5,15)] (9x9 lat lon cells):

  • original : 28.45 s
  • chunked: 4.30 s
  • 6.6 times faster

@nicholas512
Copy link
Contributor Author

more generic script

DIRECTORY="era5"
LEV_VAR="level"
TIME_VAR="time"

for f in `ls ./$DIRECTORY | grep .*nc$`

## Process each nc file found
 do 
  
  fullpath="./$DIRECTORY/$f"
  newpath="./${DIRECTORY}_chunked/$f"

  if [[ -f $newpath ]]
  
  then 
   echo "${f} already chunked"
  
  else
   NTIME=`ncdump -h $fullpath | grep "${TIME_VAR} = .*"  | grep -oEi  "[0-9]+"`
   NLEV=`ncdump -h $fullpath | grep "${LEV_VAR} = .*"  | grep -oEi  "[0-9]+"`
   
   if [ -z "$NLEV" ]  # does the level variable not exist
   
    then 
     echo "Chunking $f with ${NTIME} times and ${NLEV} levels" 
     nccopy -k 4 -d 0 -w -h 100M -c "longitude/4,latitude/4,level/${NLEV},time/${NTIME}" $fullpath $newpath
   
    else
     echo "Chunking $f with $NTIME times" 
     nccopy -k 4 -d 0 -w -h 100M -c "longitude/4,latitude/4,time/${NTIME}" $fullpath $newpath
   
   fi
   
  
  fi    

done

@nicholas512
Copy link
Contributor Author

#162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant