Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: #462 parallel uploads to the same blob storage #485

Conversation

cmp0xff
Copy link
Contributor

@cmp0xff cmp0xff commented Oct 15, 2024

Closes #462.

@cmp0xff
Copy link
Contributor Author

cmp0xff commented Nov 5, 2024

May I ask someone for a review? Maybe @TomAugspurger @martindurant since you reviewed the last PRs? Thank you in advance.

@martindurant
Copy link
Member

Is it possible to test this change, something that would fail without it?

@cmp0xff
Copy link
Contributor Author

cmp0xff commented Nov 7, 2024

Hi @martindurant , thank you for the reply. In the original issue #462, we discovered that

There seems to be an issue when 2 instances of this file system write to the same blob storage from 2 different processes in parallel, where one of the uploads fails

I think it is not easy to write a pytest to show this issue explicitly.

@martindurant
Copy link
Member

I'm afraid I don't know anything about the function of these IDs... The problem was with assigning a constant ID to multiple uploads? I'm not sure I understand "who wins" in the case that two processes are writing at the same time.

@martindurant
Copy link
Member

Please add

from __future__ import annotations

Someone should also update the CI to a newer version of python, I would say at least 3.10.

@cmp0xff cmp0xff force-pushed the hotfix/cmp0xff/462-parallel-uploads-to-the-same-blob branch from 31cc6ab to d83487a Compare November 11, 2024 22:13
@cmp0xff
Copy link
Contributor Author

cmp0xff commented Nov 11, 2024

Hi @martindurant ,

I'm afraid I don't know anything about the function of these IDs... The problem was with assigning a constant ID to multiple uploads? I'm not sure I understand "who wins" in the case that two processes are writing at the same time.

The problem happens even when two different blocks are being written to the blob storage at the same time, because the original implementation only take into account the number of blocks. This is the case we want to solve here.

Please add

from __future__ import annotations

Someone should also update the CI to a newer version of python, I would say at least 3.10.

d83487a should have fixed the pipeline for py38. Please run the pipeline again, thanks.

@cmp0xff cmp0xff changed the title fix: #462 parallel uploads to the same blob fix: #462 parallel uploads to the same blob storage Nov 11, 2024
@martindurant
Copy link
Member

Rather than hashing the data (which might be expensive), would any random value do?

@cmp0xff cmp0xff force-pushed the hotfix/cmp0xff/462-parallel-uploads-to-the-same-blob branch from d83487a to da89523 Compare November 13, 2024 08:51
@martindurant martindurant merged commit e6cfe24 into fsspec:main Nov 13, 2024
4 checks passed
@cmp0xff cmp0xff deleted the hotfix/cmp0xff/462-parallel-uploads-to-the-same-blob branch November 13, 2024 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with parallel uploads to the same blob
2 participants