Parallel Write and XOF #25

lukechampine · 2025-02-10T23:39:03Z

Adds guts.CompressEigentree for concurrently compressing 2^n buffers. Write also spawns a goroutines for each eigentree. This should get us close to full CPU load for large inputs.

Also parallelizes XOF while we're at it, because why not.

@glycerine, would you mind running go test -bench=.? I don't have access to an AVX-512 machine at the moment and I'm curious about the improvement here.

Co-authored-by: Jason E. Aten, Ph.D. <jason@devnull>

glycerine · 2025-02-11T17:04:37Z

here you go. I'm not sure if its working...

$ go test -bench=.
go test -bench=.
goos: darwin
goarch: amd64
pkg: lukechampine.com/blake3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkWrite-8         1387850               862.0 ns/op      1187.98 MB/s        1118 B/op          0 allocs/op
BenchmarkXOF/64-8        5405406               222.8 ns/op       287.22 MB/s           0 B/op          0 allocs/op
BenchmarkXOF/1024-8      5443389               216.7 ns/op      4726.10 MB/s           0 B/op          0 allocs/op
BenchmarkXOF/65536-8               37803             27599 ns/op        2374.58 MB/s       11281 B/op        129 allocs/op
BenchmarkXOF/1048576-8              3523            328438 ns/op        3192.61 MB/s      180558 B/op       2049 allocs/op
BenchmarkSum256/64-8            10230352               113.8 ns/op       562.19 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/1024-8            699298              1666 ns/op         614.49 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/65536-8            33650             35670 ns/op        1837.28 MB/s       39451 B/op         33 allocs/op
BenchmarkSum256/1048576-8           9456            123944 ns/op        8460.07 MB/s       48587 B/op        174 allocs/op
PASS
ok      lukechampine.com/blake3 12.947s
jaten@Js-MacBook-Pro ~/go/src/github.com/lukechampine/blake3 (parallel) $ 

jaten@Js-MacBook-Pro ~/go/src/github.com/lukechampine/blake3 (parallel) $ git checkout master
git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
jaten@Js-MacBook-Pro ~/go/src/github.com/lukechampine/blake3 (master) $ go test -bench=.
go test -bench=.
goos: darwin
goarch: amd64
pkg: lukechampine.com/blake3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkWrite-8         4052662               279.3 ns/op      3666.64 MB/s           0 B/op          0 allocs/op
BenchmarkXOF-8           5653120               203.8 ns/op      5024.51 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/64-8            10829556               105.3 ns/op       607.59 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/1024-8            717979              1631 ns/op         627.76 MB/s           0 B/op          0 allocs/op
BenchmarkSum256/65536-8            61836             18020 ns/op        3636.80 MB/s           0 B/op          0 allocs/op
PASS
ok      lukechampine.com/blake3 6.828s
jaten@Js-MacBook-Pro ~/go/src/github.com/lukechampine/blake3 (master) $

lukechampine · 2025-02-12T16:56:49Z

ah, the tests changed too; trying testing master with the parallel version of blake3_test.go

lukechampine · 2025-02-13T16:16:55Z

Seems like the parallelism in XOF is backfiring:

BenchmarkXOF/64-8        5405406               222.8 ns/op       287.22 MB/s           0 B/op          0 allocs/op
BenchmarkXOF/1024-8      5443389               216.7 ns/op      4726.10 MB/s           0 B/op          0 allocs/op
BenchmarkXOF/65536-8               37803             27599 ns/op        2374.58 MB/s       11281 B/op        129 allocs/op
BenchmarkXOF/1048576-8              3523            328438 ns/op        3192.61 MB/s      180558 B/op       2049 allocs/op

Probably because it spawns a separate goroutine for each 1KB. I'll try switching this to at most runtime.NumCPU.

glycerine · 2025-02-13T21:41:37Z

Yeah that will hurt.

I did a study of how much to feed each goroutine on the original PR. You can see how much the final speed depends on this. Its pretty dramatic.

https://github.com/glycerine/blake3/blob/51f61af988805dbdad54fc4e2ffdb56609bc1e1b/parallel_test.go#L127

I've heard that Hertzner's VPS (might?!) have AVX and can be rented for $5/month. Probably worth chatting with them first to confirm.

lukechampine · 2025-02-14T15:54:42Z

ok, should be better now. I explored adding a similar limit to (Hasher).Write, but there was no significant benefit, probably because there will be at most 2*log(n) eigentrees (as opposed to n XOF blocks).

Thanks for the tip about Hetzner, I'll look into that!

lukechampine and others added 2 commits February 10, 2025 18:32

implement parallel eigentree compression

a5e025b

Co-authored-by: Jason E. Aten, Ph.D. <jason@devnull>

implement parallel XOF output

de42635

lukechampine mentioned this pull request Feb 10, 2025

Parallel hashing for 6-9x speedup #24

Closed

spawn at most runtime.NumCPU goroutines in XOF

84bf553

lukechampine merged commit 02493b4 into master Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Write and XOF #25

Parallel Write and XOF #25

lukechampine commented Feb 10, 2025

glycerine commented Feb 11, 2025

lukechampine commented Feb 12, 2025

lukechampine commented Feb 13, 2025

glycerine commented Feb 13, 2025

lukechampine commented Feb 14, 2025

Parallel Write and XOF #25

Parallel Write and XOF #25

Conversation

lukechampine commented Feb 10, 2025

glycerine commented Feb 11, 2025

lukechampine commented Feb 12, 2025

lukechampine commented Feb 13, 2025

glycerine commented Feb 13, 2025

lukechampine commented Feb 14, 2025