Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Write and XOF #25

Merged
merged 3 commits into from
Feb 24, 2025
Merged

Parallel Write and XOF #25

merged 3 commits into from
Feb 24, 2025

Conversation

lukechampine
Copy link
Owner

Adds guts.CompressEigentree for concurrently compressing 2^n buffers. Write also spawns a goroutines for each eigentree. This should get us close to full CPU load for large inputs.

Also parallelizes XOF while we're at it, because why not.

@glycerine, would you mind running go test -bench=.? I don't have access to an AVX-512 machine at the moment and I'm curious about the improvement here.

lukechampine and others added 2 commits February 10, 2025 18:32
Co-authored-by: Jason E. Aten, Ph.D. <jason@devnull>
@glycerine
Copy link

here you go. I'm not sure if its working...

$ go test -bench=.
go test -bench=.
goos: darwin
goarch: amd64
pkg: lukechampine.com/blake3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkWrite-8         1387850               862.0 ns/op      1187.98 MB/s        1118 B/op          0 allocs/op
BenchmarkXOF/64-8        5405406               222.8 ns/op       287.22 MB/s           0 B/op          0 allocs/op
BenchmarkXOF/1024-8      5443389               216.7 ns/op      4726.10 MB/s           0 B/op          0 allocs/op
BenchmarkXOF/65536-8               37803             27599 ns/op        2374.58 MB/s       11281 B/op        129 allocs/op
BenchmarkXOF/1048576-8              3523            328438 ns/op        3192.61 MB/s      180558 B/op       2049 allocs/op
BenchmarkSum256/64-8            10230352               113.8 ns/op       562.19 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/1024-8            699298              1666 ns/op         614.49 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/65536-8            33650             35670 ns/op        1837.28 MB/s       39451 B/op         33 allocs/op
BenchmarkSum256/1048576-8           9456            123944 ns/op        8460.07 MB/s       48587 B/op        174 allocs/op
PASS
ok      lukechampine.com/blake3 12.947s
jaten@Js-MacBook-Pro ~/go/src/github.com/lukechampine/blake3 (parallel) $ 

jaten@Js-MacBook-Pro ~/go/src/github.com/lukechampine/blake3 (parallel) $ git checkout master
git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
jaten@Js-MacBook-Pro ~/go/src/github.com/lukechampine/blake3 (master) $ go test -bench=.
go test -bench=.
goos: darwin
goarch: amd64
pkg: lukechampine.com/blake3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkWrite-8         4052662               279.3 ns/op      3666.64 MB/s           0 B/op          0 allocs/op
BenchmarkXOF-8           5653120               203.8 ns/op      5024.51 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/64-8            10829556               105.3 ns/op       607.59 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/1024-8            717979              1631 ns/op         627.76 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/65536-8            61836             18020 ns/op        3636.80 MB/s           0 B/op          0 allocs/op
PASS
ok      lukechampine.com/blake3 6.828s
jaten@Js-MacBook-Pro ~/go/src/github.com/lukechampine/blake3 (master) $ 

@lukechampine
Copy link
Owner Author

ah, the tests changed too; trying testing master with the parallel version of blake3_test.go

@lukechampine
Copy link
Owner Author

Seems like the parallelism in XOF is backfiring:

BenchmarkXOF/64-8        5405406               222.8 ns/op       287.22 MB/s           0 B/op          0 allocs/op
BenchmarkXOF/1024-8      5443389               216.7 ns/op      4726.10 MB/s           0 B/op          0 allocs/op
BenchmarkXOF/65536-8               37803             27599 ns/op        2374.58 MB/s       11281 B/op        129 allocs/op
BenchmarkXOF/1048576-8              3523            328438 ns/op        3192.61 MB/s      180558 B/op       2049 allocs/op

Probably because it spawns a separate goroutine for each 1KB. I'll try switching this to at most runtime.NumCPU.

@glycerine
Copy link

Yeah that will hurt.

I did a study of how much to feed each goroutine on the original PR. You can see how much the final speed depends on this. Its pretty dramatic.

https://github.com/glycerine/blake3/blob/51f61af988805dbdad54fc4e2ffdb56609bc1e1b/parallel_test.go#L127

I've heard that Hertzner's VPS (might?!) have AVX and can be rented for $5/month. Probably worth chatting with them first to confirm.

@lukechampine
Copy link
Owner Author

ok, should be better now. I explored adding a similar limit to (Hasher).Write, but there was no significant benefit, probably because there will be at most 2*log(n) eigentrees (as opposed to n XOF blocks).

Thanks for the tip about Hetzner, I'll look into that!

@lukechampine lukechampine merged commit 02493b4 into master Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants