-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update spec data in parallel #1743
Conversation
Using queue and threads. The bulk of the logic is from the queue example given in https://docs.python.org/3/library/queue.html#queue.Queue.join. Usage of os.sched_getaffinity is explained in https://docs.python.org/3/library/os.html?highlight=os#os.cpu_count. I manually verified the correctness of this by comparing bikeshed/spec-data/ of the current update with the one generated with this patch applied. $ time bikeshed update # current real 2m17.526s user 0m21.730s sys 0m1.682s $ time bikeshed update # with this patch, ran on 28 cores real 0m13.869s user 0m32.152s sys 0m3.807s We can probably speed this up more, maybe multiprocessing? But looks good enough for me for now. Fixed speced#1741.
# This worker will download from remote, and update success if it fails. | ||
def worker(): | ||
# Needed to update status and progress. | ||
nonlocal success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even Python has race conditions between threads, so results should be sent back using a queue also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. There is only 1 result, success is initialized to True
, and any threads that sees a failure will set it to False
, and won't touch it otherwise.
Since we use a .join()
below, any threads that have encountered a failure would have set success
to False
, and by the time we read success
in the main thread, threads would have joined, so the value is consistent. Lmk if my understanding of this is wrong. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, since the only effect of worker()
with the sync world is (a) possibly setting success
to False, and (b) modifying a unique file, I think we're safe here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the race on lastMsgTime
will probably just result in writing the "Updated n/m" message twice. And, of course, the number of files the output claims to have already fetched can actually go down if HTTP responses come back out of order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yup the printing is indeed racy.
|
||
# Create as many threads as we can. | ||
for i in range(len(os.sched_getaffinity(0))): | ||
t = threading.Thread(target=worker, daemon=True).start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.aiohttp.org/en/stable/ suggests that we can avoid threads entirely by using a different fetching library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point to where in that page it mentions avoiding threads?
Also I'm hesitant to pull in an entire library just to make this parallel, but if @tabatkins is okay with it, I can do that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These threads don't seem too complicated, so I'm fine with using them straight for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.aiohttp.org/en/stable/glossary.html#term-asyncio mentions that it's single-threaded, which reduces how many race conditions you need to handle.
# This worker will download from remote, and update success if it fails. | ||
def worker(): | ||
# Needed to update status and progress. | ||
nonlocal success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the race on lastMsgTime
will probably just result in writing the "Updated n/m" message twice. And, of course, the number of files the output claims to have already fetched can actually go down if HTTP responses come back out of order.
|
||
# Create as many threads as we can. | ||
for i in range(len(os.sched_getaffinity(0))): | ||
t = threading.Thread(target=worker, daemon=True).start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.aiohttp.org/en/stable/glossary.html#term-asyncio mentions that it's single-threaded, which reduces how many race conditions you need to handle.
updateQueue.put((i, filePath)) | ||
|
||
# Create as many threads as we can. | ||
for i in range(len(os.sched_getaffinity(0))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not you use threads, the number of parallel HTTP fetches shouldn't be based on the number of usable CPUs. HTTP handling isn't CPU-bound, and Python threads run under a global lock that prevents them from using more than 1 net CPU anyway. (That's why multiprocessing
exists, but again, because HTTP handling isn't CPU-bound, it won't help here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not you use threads, the number of parallel HTTP fetches shouldn't be based on the number of usable CPUs.
Sure, this is merely a heuristic. If I have 10 cores, I should be able to run up to 10 threads that do blocking work. It's a quick way of running blocking network calls in parallel.
Python threads run under a global lock that prevents them from using more than 1 net CPU anyway.
I don't know how the GIL interacts here, but from my testing, I see a speedup. If the GIL prevents these threads from being parallelized, why do we see a speedup?
HTTP handling isn't CPU-bound, it won't help here
Yea, it's a network call. But in this case the network call blocks, you can imagine it's a CPU busy wait (for network response).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't write that clearly. The GIL prevents multiple Python threads from using CPU concurrently, but it gets released while a Python thread is waiting on an OS function, which is what happens while one of these HTTP fetches waits for the response to come back. That's not a busy-wait.
So, even with just 1 core, you can efficiently run lots of threads here to get that many HTTP fetches to be sent in parallel. They can't actually process results in parallel, but that's fine because processing results takes much less wall-time than waiting for the server to respond.
Okay, I ended up rewriting this myself, relying on aiohttp (and aiofiles for the file creation). Thanks so much for the impetus to get this done, @ngzhian, and the pointer to aiohttp, @jyasskin! On my laptop, updating went from ~3 files/second, to, uh, about 200 files/second. I can regen the entire set of files now (about 1700) in less than 10 seconds. This is astonishing, I had no idea I was blocking so long on network like that, phew. Anyway, this was great! Yay! |
Amazing, thanks Tab and Jeffrey! |
Heads up, I see that some of our builds are hanging, see https://travis-ci.org/github/WebAssembly/simd/builds/716629866 for example. Could be related to this. |
A local run of |
Seems like f539c4a fixed it, we were pulling a broken alpha version (maybe) |
Using queue and threads. The bulk of the logic is from the queue example
given in https://docs.python.org/3/library/queue.html#queue.Queue.join.
Usage of os.sched_getaffinity is explained
in https://docs.python.org/3/library/os.html?highlight=os#os.cpu_count.
I manually verified the correctness of this by comparing
bikeshed/spec-data/ of the current update with the one generated with
this patch applied.
$ time bikeshed update # current
real 2m17.526s
user 0m21.730s
sys 0m1.682s
$ time bikeshed update # with this patch, ran on 28 cores
real 0m13.869s
user 0m32.152s
sys 0m3.807s
We can probably speed this up more, maybe multiprocessing? But looks
good enough for me for now.
Fixed #1741.