You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature Request: Add --worker Flag for Partitioned, Hash-Based File Transfers
I'm going to do this myself in my python API to increase throughput. I thought i'd write a feature request for completeness. Feel free to close this feature request if not applicable. I may be able to implement this feature myself in rclone if this is something you are interested in.
Overview
I'd like to request a new feature that allows rclone to transfer only a portion of a server's content. This feature would enable users to run multiple rclone instances concurrently, with each instance responsible for a distinct subset of files. The goal is to facilitate distributed transfers and avoid duplicate work when syncing or copying large datasets.
Proposed Approach
Introduce a new flag, --worker, where the argument is formatted as worker_id:(n_workers-1). For example:
rclone copy ... --worker 0:1
rclone copy ... --worker 1:1
In the above example, two workers are deployed, and each will handle roughly 50% of the files.
How It Works
For each file to be transferred, rclone will calculate a hash based on the file's path (e.g., using MD5). Then, using the worker parameters, it determines if the current worker should process the file based on the following pseudocode:
worker_id= [providedworkerid]
n_workers= [totalnumberofworkers]
foreachfileinfiles_to_copy:
md5_hash=md5(file.path)
# The addition of worker_id helps in balancing the distributionif (md5_hash+worker_id) %n_workers==0:
transfer(file)
else:
skip(file)
The text was updated successfully, but these errors were encountered:
zackees
changed the title
Feature Request: Add --worker Flag for Hash-Based Partitioned Transfers in rclone
Feature Request: Add --worker Flag for Path-Hash-Based Partitioned Transfers in rclone
Feb 16, 2025
This is a great idea. So great that I'm actually already in the middle of implementing it :-)
Here is the proposal I made - comments welcome
Hash Filter
This proposal describes a new flag --hash-filter which is used to make a deterministic selection of a random subset of files.
Uses include:
Running a big sync on multiple machines
Checking a subset of files for bitrot
The flag takes two parameters expressed as a fraction, so --hash-filter 1/3 for example. Here the 3 represents the total number of subsets of files and the 1 represents which subset to select. So --hash-filter 1/3, --hash-filter 2/3 and --hash-filter 3/3 will all select different non-overlapping subsets of files.
Note that rclone will still have to traverse all directories to select these files.
The first parameter can be replaced with @ to select a random subset of files. In the example above --hash-filter @/3 means rclone will substitute the @ for a random number between 1 and 3 inclusive. The @ will be chosen and remain constant throughout the life of that set of filters, so any retries that are needed will use the same value.
Feature Request: Add
--worker
Flag for Partitioned, Hash-Based File TransfersI'm going to do this myself in my python API to increase throughput. I thought i'd write a feature request for completeness. Feel free to close this feature request if not applicable. I may be able to implement this feature myself in rclone if this is something you are interested in.
Overview
I'd like to request a new feature that allows rclone to transfer only a portion of a server's content. This feature would enable users to run multiple rclone instances concurrently, with each instance responsible for a distinct subset of files. The goal is to facilitate distributed transfers and avoid duplicate work when syncing or copying large datasets.
Proposed Approach
Introduce a new flag,
--worker
, where the argument is formatted asworker_id:(n_workers-1)
. For example:rclone copy ... --worker 0:1
rclone copy ... --worker 1:1
In the above example, two workers are deployed, and each will handle roughly 50% of the files.
How It Works
For each file to be transferred, rclone will calculate a hash based on the file's path (e.g., using MD5). Then, using the worker parameters, it determines if the current worker should process the file based on the following pseudocode:
The text was updated successfully, but these errors were encountered: