-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copy from one filesystem to the other #909
Comments
I just found this method: import fsspec
from tqdm import tqdm
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
for k in tqdm(a):
b[k]=a[k] working well! same with threads: from multiprocessing.pool import ThreadPool
from tqdm import tqdm
import fsspec
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
def f(k):
b[k]=a[k]
with ThreadPool(32) as p:
keys = list(a.keys())
for _ in tqdm(p.imap_unordered(f, keys), total=len(keys)):
pass can be faster depending on the filesystems |
I still would be curious about any other ways to interact between 2 filesystems with fsspec |
That's an elegant way o do it that would not have occurred to me, although I suppose |
I should add that working with the mappers assumes that every file fits in memory, and will iterate through the files serially. Closing this as a duplicate. |
Does it mean 'copy' operation will firstly load the source file in memory and write the data to the destination? |
If you use the mappers, then indeed whole files are passed in memory. The filesystems' copy() method reads and writes a chunk at a time. |
Yes, It looks like the 'PyArrowHDFS' filesystem use shutil.copyfileobj to load chunk from source file to buffer and then write it to destination file.
Is it possible to use copy operation provided by hdfs client? It may reduce cost from copying. BTW, is there any interface other than 'mapper', which can be used like a file system? |
See also the copy function in the generics filesystem, specially designed for inter-filesystem copy: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.generic.GenericFileSystem
This is what you will end up doing. You can use shutil directly if you like, or manually like:
(
You mean the filesystem instance itself, perhaps? Also, there is universal_pathlib and other packages built on top of fsspec, if you want them. |
Thanks for your reply.
Is there any way for intra-filesystem copy without loading data to memory by using fsspec? For example, copy source file from hdfs to destination on hdfs. |
Most filesystems implement a copy which doesn't need reading into memory. The title of this issue is about copies between filesystems. |
@hangweiqiang-uestc , you probably wanted the HadoopFileSystem rather than PyArrowHDFS. We are due to switch from the old to the new when we get to it. |
^ protocol would be "arrow_hdfs" |
Thanks for your code sample. This works. I have tried using fsspec.generic.GenericFileSystem, but I just can't make it works |
Perhaps you'd like to raise a new issue showing what you tried and how it failed? GenericFileSystem is still new and experimental, I'm sure we can fix it. |
Hi,
Thanks for creating this lib, it's really convenient and make code using multiple file systems clean!
Is there a way to copy a file (or even a folder?) from one filesystem to the other using fsspec natively?
It's of course possible to implement it by using ls/walk and copying to local then from local to the other fs, but I'm wondering if there's a native way to do it.
The text was updated successfully, but these errors were encountered: