Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add --worker Flag for Path-Hash-Based Partitioned Transfers in rclone #8400

Open
zackees opened this issue Feb 16, 2025 · 1 comment

Comments

@zackees
Copy link
Contributor

zackees commented Feb 16, 2025

Feature Request: Add --worker Flag for Partitioned, Hash-Based File Transfers

I'm going to do this myself in my python API to increase throughput. I thought i'd write a feature request for completeness. Feel free to close this feature request if not applicable. I may be able to implement this feature myself in rclone if this is something you are interested in.

Overview

I'd like to request a new feature that allows rclone to transfer only a portion of a server's content. This feature would enable users to run multiple rclone instances concurrently, with each instance responsible for a distinct subset of files. The goal is to facilitate distributed transfers and avoid duplicate work when syncing or copying large datasets.

Proposed Approach

Introduce a new flag, --worker, where the argument is formatted as worker_id:(n_workers-1). For example:

  • rclone copy ... --worker 0:1
  • rclone copy ... --worker 1:1

In the above example, two workers are deployed, and each will handle roughly 50% of the files.

How It Works

For each file to be transferred, rclone will calculate a hash based on the file's path (e.g., using MD5). Then, using the worker parameters, it determines if the current worker should process the file based on the following pseudocode:

worker_id = [provided worker id]
n_workers = [total number of workers]

for each file in files_to_copy:
    md5_hash = md5(file.path)
    # The addition of worker_id helps in balancing the distribution
    if (md5_hash + worker_id) % n_workers == 0:
         transfer(file)
    else:
         skip(file)
@zackees zackees changed the title Feature Request: Add --worker Flag for Hash-Based Partitioned Transfers in rclone Feature Request: Add --worker Flag for Path-Hash-Based Partitioned Transfers in rclone Feb 16, 2025
@ncw
Copy link
Member

ncw commented Feb 17, 2025

This is a great idea. So great that I'm actually already in the middle of implementing it :-)

Here is the proposal I made - comments welcome

Hash Filter

This proposal describes a new flag --hash-filter which is used to make a deterministic selection of a random subset of files.

Uses include:

  1. Running a big sync on multiple machines
  2. Checking a subset of files for bitrot

The flag takes two parameters expressed as a fraction, so --hash-filter 1/3 for example. Here the 3 represents the total number of subsets of files and the 1 represents which subset to select. So --hash-filter 1/3, --hash-filter 2/3 and --hash-filter 3/3 will all select different non-overlapping subsets of files.

Note that rclone will still have to traverse all directories to select these files.

The first parameter can be replaced with @ to select a random subset of files. In the example above --hash-filter @/3 means rclone will substitute the @ for a random number between 1 and 3 inclusive. The @ will be chosen and remain constant throughout the life of that set of filters, so any retries that are needed will use the same value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants