-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instruction-level parallelism (MMX, SSE, AVX, NEON, SVE, ...) and "restrict" #1839
Comments
@dumblob I think you need to invoke v compiler with -prod argument to use optimizations for the intermediate c-code - later on V will have own code optimizer built-in by a chance |
Other optimizations are not that important as |
I see now what you mean, passing by value is "copying the source content" in my understanding where by ref its just "copying the pointer to that value" - which shall be faster almost all of the time especially for bigger structures (bigger than the pointer itself). Thanks for clarifying! |
Not necessarily "bigger than the pointer itself" - actually the "inflection point" seems to be strictly bigger than the size of the cache line (nowadays most commonly 128 bytes or sometimes still just 64 bytes), but it's rather more than that due to the need to "randomly" jump over memory, thus eliminating the whole cache effect severely degrading performance. That seems to be the major reason why V is so fast nowadays despite copying nearly everything. Anyway, Btw. generation of instructions is super complicated for special cases like spin locks (which will be needed e.g. for #1868 ). It would also make sense to give programmers full control over caches in CPU to squeeze another 10-20% of speed in tight loops. I mean not just by inline assembly, but rather with some intrinsics which play well with V's internals (e.g. the alignment of structs/arrays/integers/etc.). |
There is an interesting attempt (well implemented, tested and soon used in production) to optimize nested (tight) loops in a different way than just by smartly unrolling them (loop unrolling seems less efficient for nested loops than for non-nested loops, but this quasi-novel approach maintains its efficiency). See mratsim/weave#34 . |
Actually what I did in mratsim/weave#34 is just parallelizing reductions on multiple cores. That does not remove the effect or need of instruction level parallelism. For reductions, due to the heavy latency of register level data dependencies, naive floating reductions are often 2x-3x slower, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE&expand=158 for i in 0 ..< N:
result[i] = a[i] But if there is data dependency, it's only 1 every 3 or 4 cycle so a slow down of 3 to 8x depending on the CPU architecture. In practice the actual slowdown is 2-3x due to the loop becoming bound by memory speed. Why floating point? Plenty of detailed benchmarks (multiple accumulators, OpenMP, SSE, AVX, ffastmath or combinations thereof) on reduction latencies in my High-Performance Computing primitives repo: https://github.com/numforge/laser/tree/d1e6ae6106564bfb350d4e566261df97dbb578b3/benchmarks/fp_reduction_latency |
Another optimization for |
Not many people have the time, space, or patience for BinTuner, especially if you have to regenerate all the intermediate steps. There's nothing stopping people with those options from running BinTuner on their own, but I don't think I'd want to see it as a standard feature of |
Somewhat distantly related to the generation of |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
A quick glance over the code suggested there is currently no support for instruction-level parallelism in the C code generation.
To my greatest surprise though, even the "simpliest" thing is not being used. Namely the keyword
restrict
is missing everywhere. Without this any benchmarks involving pointers do not make that much sense as it might seem (https://github.com/vlang/v/blob/master/compiler/tests/bench/val_vs_ptr.c ). Not talking about very important language-level decisions between "copy value instead of passing as reference" which I've seen somewhere in the discussions.IMHO the path to instruction-level parallelism shall start with properly implementing the actual
V
semantics when it comes to passing without explicitly specifying&
(and later also in some cases where&
aka references are being passed around). In these casesrestrict
shall be used nearly everywhere.restrict
improves performance significantly as memory stores invalidate the entry in all CPU caches making any work with pointers super slow (orders of magnitude in the worst case).Using
restrict
basically everywhere where pointers are used shall be IMHO the very first step. And those very few places where pointers really overlap aka "are aliased" (and thusrestrict
can't be used) shall probably be often changed in a way to make them not overlap.The second step could be some tiny loop preparation (inlining, padding, etc.) to allow vectorization. I don't mean
V
to do the vectorization (loop unrolling, etc.) itself, but useV
semantics to enhance the generated C code to make sure it's more easily vectorizable.Last but not least, there should be some portable API for SIMD - see e.g. how Rust does it: https://github.com/rust-lang/project-portable-simd/blob/master/CHARTER.md .
The text was updated successfully, but these errors were encountered: