Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instruction-level parallelism (MMX, SSE, AVX, NEON, SVE, ...) and "restrict" #1839

Closed
dumblob opened this issue Sep 3, 2019 · 9 comments
Closed
Labels
Feature/Enhancement Request This issue is made to request a feature or an enhancement to an existing one.

Comments

@dumblob
Copy link
Contributor

dumblob commented Sep 3, 2019

A quick glance over the code suggested there is currently no support for instruction-level parallelism in the C code generation.

To my greatest surprise though, even the "simpliest" thing is not being used. Namely the keyword restrict is missing everywhere. Without this any benchmarks involving pointers do not make that much sense as it might seem (https://github.com/vlang/v/blob/master/compiler/tests/bench/val_vs_ptr.c ). Not talking about very important language-level decisions between "copy value instead of passing as reference" which I've seen somewhere in the discussions.

IMHO the path to instruction-level parallelism shall start with properly implementing the actual V semantics when it comes to passing without explicitly specifying & (and later also in some cases where & aka references are being passed around). In these cases restrict shall be used nearly everywhere. restrict improves performance significantly as memory stores invalidate the entry in all CPU caches making any work with pointers super slow (orders of magnitude in the worst case).

Using restrict basically everywhere where pointers are used shall be IMHO the very first step. And those very few places where pointers really overlap aka "are aliased" (and thus restrict can't be used) shall probably be often changed in a way to make them not overlap.

The second step could be some tiny loop preparation (inlining, padding, etc.) to allow vectorization. I don't mean V to do the vectorization (loop unrolling, etc.) itself, but use V semantics to enhance the generated C code to make sure it's more easily vectorizable.

Last but not least, there should be some portable API for SIMD - see e.g. how Rust does it: https://github.com/rust-lang/project-portable-simd/blob/master/CHARTER.md .

@dumblob dumblob added the Feature/Enhancement Request This issue is made to request a feature or an enhancement to an existing one. label Sep 3, 2019
@gslicer
Copy link

gslicer commented Sep 3, 2019

@dumblob I think you need to invoke v compiler with -prod argument to use optimizations for the intermediate c-code - later on V will have own code optimizer built-in by a chance

@dumblob
Copy link
Contributor Author

dumblob commented Sep 3, 2019

I think you need to invoke v compiler with -prod argument to use optimizations for the intermediate c-code

restrict is not used at all (disregarding whether with or without -prod) but should be as explained above.

Other optimizations are not that important as restrict as restrict might change the opinion of @medvednikov on passing by value or passing by reference in certain scenarios (sure, passing by value will be faster for small structs and ints, but it'll get utterly slow suddenly - especially in loops - after crossing some boundary - and this boundary we should avoid by passing by reference).

@gslicer
Copy link

gslicer commented Sep 3, 2019

Other optimizations are not that important as restrict as restrict might change the opinion of @medvednikov on passing by value or passing by reference in certain scenarios (sure, passing by value will be faster for small structs and ints, but it'll get utterly slow suddenly - especially in loops - after crossing some boundary - and this boundary we should avoid by passing by reference).

I see now what you mean, passing by value is "copying the source content" in my understanding where by ref its just "copying the pointer to that value" - which shall be faster almost all of the time especially for bigger structures (bigger than the pointer itself).

Thanks for clarifying!

@dumblob
Copy link
Contributor Author

dumblob commented Nov 29, 2019

where by ref its just "copying the pointer to that value" - which shall be faster almost all of the time especially for bigger structures (bigger than the pointer itself).

Not necessarily "bigger than the pointer itself" - actually the "inflection point" seems to be strictly bigger than the size of the cache line (nowadays most commonly 128 bytes or sometimes still just 64 bytes), but it's rather more than that due to the need to "randomly" jump over memory, thus eliminating the whole cache effect severely degrading performance. That seems to be the major reason why V is so fast nowadays despite copying nearly everything.

Anyway, restrict is by far not the silver bullet of optimizations. Those interested in instruction level parallelism shall dive into recommendation from the famous guru Agner Fog and his ForwardCom - it pays off immediately (V compiler could e.g. easily unroll loops which C compiler can't do easily itself from the by-V generated source, because it's way easier for the C compiler to compose small steps into one AVX instruction rather then to split one line of code into several steps which can be then mapped to an AVX instruction).

Btw. generation of instructions is super complicated for special cases like spin locks (which will be needed e.g. for #1868 ).

It would also make sense to give programmers full control over caches in CPU to squeeze another 10-20% of speed in tight loops. I mean not just by inline assembly, but rather with some intrinsics which play well with V's internals (e.g. the alignment of structs/arrays/integers/etc.).

@dumblob
Copy link
Contributor Author

dumblob commented Dec 8, 2019

There is an interesting attempt (well implemented, tested and soon used in production) to optimize nested (tight) loops in a different way than just by smartly unrolling them (loop unrolling seems less efficient for nested loops than for non-nested loops, but this quasi-novel approach maintains its efficiency). See mratsim/weave#34 .

@mratsim
Copy link

mratsim commented Dec 8, 2019

Actually what I did in mratsim/weave#34 is just parallelizing reductions on multiple cores. That does not remove the effect or need of instruction level parallelism.

For reductions, due to the heavy latency of register level data dependencies, naive floating reductions are often 2x-3x slower, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE&expand=158
Latencies for FP add are at 3 or 4 and the throughput is at 1 or 0.5. The way it's read is that you can issue 1 or 2 independent add per cycle, for example

for i in 0 ..< N:
  result[i] = a[i]

But if there is data dependency, it's only 1 every 3 or 4 cycle so a slow down of 3 to 8x depending on the CPU architecture. In practice the actual slowdown is 2-3x due to the loop becoming bound by memory speed.

Why floating point?
Because for integers addition is associative, i.e. (a+b)+c == a+(b+c) meaning the compiler can reorder computations by using multiple accumulators.
This is not true for floating points due to rounding errors so the compiler only does that when -ffastmath is used.

Plenty of detailed benchmarks (multiple accumulators, OpenMP, SSE, AVX, ffastmath or combinations thereof) on reduction latencies in my High-Performance Computing primitives repo: https://github.com/numforge/laser/tree/d1e6ae6106564bfb350d4e566261df97dbb578b3/benchmarks/fp_reduction_latency

@dumblob
Copy link
Contributor Author

dumblob commented Apr 15, 2021

Another optimization for -prod build might be BinTuner (genetically evolved transpilation of final compiled binaries into more efficient ones).

@JalonSolov
Copy link
Contributor

Not many people have the time, space, or patience for BinTuner, especially if you have to regenerate all the intermediate steps.

There's nothing stopping people with those options from running BinTuner on their own, but I don't think I'd want to see it as a standard feature of -prod.

@dumblob
Copy link
Contributor Author

dumblob commented May 10, 2021

Somewhat distantly related to the generation of restrict together with using C stdlib restrict-enabled functions (__STDC_LIB_EXT1__ feature checking macro - e.g. memcpy_s() strncpy_s() ...) is the new [heap] attribute in V - see e.g. PR #10033 (comment) .

@vlang vlang locked and limited conversation to collaborators Sep 22, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Feature/Enhancement Request This issue is made to request a feature or an enhancement to an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants