quickcdc-cuda
is a fast content defined chunker for &[u8]
slices with CUDA acceleration.
- For some background information, see AE: An Asymmetric Extremum Content Defined Chunking Algorithm by Yucheng Zhang.
- Modification(s):
- User may provide salt, introducing entropy / cutpoint variation (i.e. files re-processed with different salt values will produce different cutpoints).
- Warp forward (reduced window size), skipping some unnecessary processing that happens before minimum chunk size is reached.
- CUDA acceleration for parallel processing of large data sets.
This implementation leverages CUDA for parallel processing, which can significantly improve performance on systems with NVIDIA GPUs. For systems without CUDA support, a CPU fallback implementation is provided.
- Original quickcdc: James Howard jrobhoward@gmail.com
- CUDA Implementation: Sayantan Das sdas.codes@gmail.com
The CUDA-accelerated version can provide significant speedups compared to the CPU-only version, especially for large datasets. Performance will vary based on your GPU hardware.
In our testing, the CUDA implementation showed:
- 2-5x speedup for files larger than 100MB
- Best performance with chunk sizes between 64KB and 256KB
- Diminishing returns for very small files due to GPU data transfer overhead
- Rust 2021 edition or later
- CUDA toolkit (for CUDA acceleration)
- NVIDIA GPU with CUDA support (for CUDA acceleration)
- Install the CUDA toolkit from NVIDIA's website: https://developer.nvidia.com/cuda-downloads
- Make sure the CUDA toolkit is in your PATH
- Set the CUDA_PATH environment variable to your CUDA installation directory
For Ubuntu/Debian:
sudo apt-get install nvidia-cuda-toolkit
export CUDA_PATH=/usr/local/cuda
For macOS:
brew install cuda
export CUDA_PATH=/usr/local/cuda
For Windows:
# Install CUDA toolkit from NVIDIA website
set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.x
# Clone the repository
git clone https://github.com/yourusername/quickcdc-cuda.git
cd quickcdc-cuda
# Build the project
cargo build --release
use quickcdc_cuda;
use rand::Rng;
// Initialize CUDA (only needed once per application)
quickcdc_cuda::Chunker::init_cuda().unwrap();
let mut rng = rand::thread_rng();
let mut sample = [0u8; 1024];
rng.fill(&mut sample[..]);
let target_size = 64;
let max_chunksize = 128;
let salt = 15222894464462204665;
// Use CUDA-accelerated version
let chunker = quickcdc_cuda::Chunker::with_cuda(&sample[..], target_size, max_chunksize, salt).unwrap();
for x in chunker {
println!("{}", x.len());
}
// Or use CPU version
let chunker = quickcdc_cuda::Chunker::with_params(&sample[..], target_size, max_chunksize, salt).unwrap();
for x in chunker {
println!("{}", x.len());
}
use quickcdc_cuda;
use std::fs::File;
use std::io::Read;
// Initialize CUDA
quickcdc_cuda::Chunker::init_cuda().unwrap();
// Read a file
let mut file = File::open("large_file.bin").unwrap();
let mut buffer = Vec::new();
file.read_to_end(&mut buffer).unwrap();
// Process with CUDA
let target_size = 128 * 1024; // 128KB target chunk size
let max_size = 512 * 1024; // 512KB maximum chunk size
let salt = quickcdc_cuda::Chunker::get_random_salt();
let chunker = quickcdc_cuda::Chunker::with_cuda(&buffer, target_size, max_size, salt).unwrap();
// Process chunks
for (i, chunk) in chunker.enumerate() {
println!("Chunk {}: {} bytes", i, chunk.len());
// Process chunk data...
}
The project includes a command-line example that can process directories of files:
# CPU version
cargo run --release --example chunkdir_cuda -- /path/to/directory
# CUDA version
cargo run --release --example chunkdir_cuda -- /path/to/directory --cuda
- For general purpose use, a target chunk size of 64KB-128KB works well
- For large files (>1GB), larger chunk sizes (256KB-512KB) may improve performance
- For small files (<10MB), smaller chunk sizes (16KB-32KB) may be more appropriate
- The CUDA implementation performs best with large datasets
- For small files, the CPU implementation may be faster due to GPU data transfer overhead
- If processing many small files, consider batching them together before processing
- The CUDA implementation requires additional memory for GPU buffers
- For very large files, ensure your GPU has sufficient memory
- If processing files larger than GPU memory, consider chunking the file first
The CUDA implementation parallelizes the chunking process by:
- Transferring the input data to the GPU
- Running a CUDA kernel that identifies potential chunk boundaries in parallel
- Collecting the results and sorting them to ensure correct order
- Iterating through the pre-computed boundaries when chunks are requested
This approach is particularly effective for large files where the overhead of GPU data transfer is outweighed by the parallel processing benefits.
The CUDA kernel divides the input data into blocks and processes them in parallel:
- Each thread examines a window of bytes to find potential chunk boundaries
- The kernel uses atomic operations to collect the results
- The host code sorts the boundaries to ensure correct ordering
- The chunker iterator uses these pre-computed boundaries to yield chunks
- Ensure CUDA toolkit is properly installed
- Check that your GPU supports CUDA
- Verify that CUDA_PATH environment variable is set correctly
- Make sure you have the correct CUDA toolkit version
- Check that your Rust version is 2021 edition or later
- Ensure all dependencies are installed
- Try different chunk sizes
- Check GPU utilization with nvidia-smi
- For large files, ensure your GPU has sufficient memory
quickcdc-cuda is dual licensed under the MIT and Apache 2.0 licenses, the same licenses as the Rust compiler.