Instruction-level parallelism (MMX, SSE, AVX, NEON, SVE, ...) and "restrict" #1839

dumblob · 2019-09-03T13:45:33Z

A quick glance over the code suggested there is currently no support for instruction-level parallelism in the C code generation.

To my greatest surprise though, even the "simpliest" thing is not being used. Namely the keyword restrict is missing everywhere. Without this any benchmarks involving pointers do not make that much sense as it might seem (https://github.com/vlang/v/blob/master/compiler/tests/bench/val_vs_ptr.c ). Not talking about very important language-level decisions between "copy value instead of passing as reference" which I've seen somewhere in the discussions.

IMHO the path to instruction-level parallelism shall start with properly implementing the actual V semantics when it comes to passing without explicitly specifying & (and later also in some cases where & aka references are being passed around). In these cases restrict shall be used nearly everywhere. restrict improves performance significantly as memory stores invalidate the entry in all CPU caches making any work with pointers super slow (orders of magnitude in the worst case).

Using restrict basically everywhere where pointers are used shall be IMHO the very first step. And those very few places where pointers really overlap aka "are aliased" (and thus restrict can't be used) shall probably be often changed in a way to make them not overlap.

The second step could be some tiny loop preparation (inlining, padding, etc.) to allow vectorization. I don't mean V to do the vectorization (loop unrolling, etc.) itself, but use V semantics to enhance the generated C code to make sure it's more easily vectorizable.

Last but not least, there should be some portable API for SIMD - see e.g. how Rust does it: https://github.com/rust-lang/project-portable-simd/blob/master/CHARTER.md .

The text was updated successfully, but these errors were encountered:

gslicer · 2019-09-03T14:37:47Z

@dumblob I think you need to invoke v compiler with -prod argument to use optimizations for the intermediate c-code - later on V will have own code optimizer built-in by a chance

dumblob · 2019-09-03T14:52:34Z

I think you need to invoke v compiler with -prod argument to use optimizations for the intermediate c-code

restrict is not used at all (disregarding whether with or without -prod) but should be as explained above.

Other optimizations are not that important as restrict as restrict might change the opinion of @medvednikov on passing by value or passing by reference in certain scenarios (sure, passing by value will be faster for small structs and ints, but it'll get utterly slow suddenly - especially in loops - after crossing some boundary - and this boundary we should avoid by passing by reference).

gslicer · 2019-09-03T17:07:03Z

Other optimizations are not that important as restrict as restrict might change the opinion of @medvednikov on passing by value or passing by reference in certain scenarios (sure, passing by value will be faster for small structs and ints, but it'll get utterly slow suddenly - especially in loops - after crossing some boundary - and this boundary we should avoid by passing by reference).

I see now what you mean, passing by value is "copying the source content" in my understanding where by ref its just "copying the pointer to that value" - which shall be faster almost all of the time especially for bigger structures (bigger than the pointer itself).

Thanks for clarifying!

dumblob · 2019-11-29T10:01:07Z

where by ref its just "copying the pointer to that value" - which shall be faster almost all of the time especially for bigger structures (bigger than the pointer itself).

Not necessarily "bigger than the pointer itself" - actually the "inflection point" seems to be strictly bigger than the size of the cache line (nowadays most commonly 128 bytes or sometimes still just 64 bytes), but it's rather more than that due to the need to "randomly" jump over memory, thus eliminating the whole cache effect severely degrading performance. That seems to be the major reason why V is so fast nowadays despite copying nearly everything.

Anyway, restrict is by far not the silver bullet of optimizations. Those interested in instruction level parallelism shall dive into recommendation from the famous guru Agner Fog and his ForwardCom - it pays off immediately (V compiler could e.g. easily unroll loops which C compiler can't do easily itself from the by-V generated source, because it's way easier for the C compiler to compose small steps into one AVX instruction rather then to split one line of code into several steps which can be then mapped to an AVX instruction).

Btw. generation of instructions is super complicated for special cases like spin locks (which will be needed e.g. for #1868 ).

It would also make sense to give programmers full control over caches in CPU to squeeze another 10-20% of speed in tight loops. I mean not just by inline assembly, but rather with some intrinsics which play well with V's internals (e.g. the alignment of structs/arrays/integers/etc.).

dumblob · 2019-12-08T12:25:41Z

There is an interesting attempt (well implemented, tested and soon used in production) to optimize nested (tight) loops in a different way than just by smartly unrolling them (loop unrolling seems less efficient for nested loops than for non-nested loops, but this quasi-novel approach maintains its efficiency). See mratsim/weave#34 .

mratsim · 2019-12-08T15:27:57Z

Actually what I did in mratsim/weave#34 is just parallelizing reductions on multiple cores. That does not remove the effect or need of instruction level parallelism.

For reductions, due to the heavy latency of register level data dependencies, naive floating reductions are often 2x-3x slower, see https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE&expand=158
Latencies for FP add are at 3 or 4 and the throughput is at 1 or 0.5. The way it's read is that you can issue 1 or 2 independent add per cycle, for example

for i in 0 ..< N:
  result[i] = a[i]

But if there is data dependency, it's only 1 every 3 or 4 cycle so a slow down of 3 to 8x depending on the CPU architecture. In practice the actual slowdown is 2-3x due to the loop becoming bound by memory speed.

Why floating point?
Because for integers addition is associative, i.e. (a+b)+c == a+(b+c) meaning the compiler can reorder computations by using multiple accumulators.
This is not true for floating points due to rounding errors so the compiler only does that when -ffastmath is used.

Plenty of detailed benchmarks (multiple accumulators, OpenMP, SSE, AVX, ffastmath or combinations thereof) on reduction latencies in my High-Performance Computing primitives repo: https://github.com/numforge/laser/tree/d1e6ae6106564bfb350d4e566261df97dbb578b3/benchmarks/fp_reduction_latency

dumblob · 2021-04-15T14:18:01Z

Another optimization for -prod build might be BinTuner (genetically evolved transpilation of final compiled binaries into more efficient ones).

JalonSolov · 2021-04-15T14:24:24Z

Not many people have the time, space, or patience for BinTuner, especially if you have to regenerate all the intermediate steps.

There's nothing stopping people with those options from running BinTuner on their own, but I don't think I'd want to see it as a standard feature of -prod.

dumblob · 2021-05-10T09:04:31Z

Somewhat distantly related to the generation of restrict together with using C stdlib restrict-enabled functions (__STDC_LIB_EXT1__ feature checking macro - e.g. memcpy_s() strncpy_s() ...) is the new [heap] attribute in V - see e.g. PR #10033 (comment) .

dumblob added the Feature/Enhancement Request This issue is made to request a feature or an enhancement to an existing one. label Sep 3, 2019

dumblob mentioned this issue Sep 4, 2019

Proper support for distributed computing, parallelism and concurrency #1868

Closed

dumblob mentioned this issue Nov 4, 2019

Parallelization SIMD support like un JuliaLang #1079

Closed

This was referenced Jan 2, 2020

goroutine style to create a new thread? #3307

Closed

Add operator overloading for |, at least in .vsh files, so we can pipe shit together #3191

Closed

This was referenced Jan 17, 2020

Regex engines and grammars in the wild dumblob/tmp#1

Open

Hash algorithm (for hashmap, strings, etc.) #3521

Closed

dumblob mentioned this issue Apr 23, 2020

Consider implement lambdas instead of making filter, map etc. compiler magic #4406

Closed

dumblob mentioned this issue May 13, 2020

Make v.c C89 compatible #4867

Closed

dumblob mentioned this issue Oct 30, 2020

Fortran-like performance (use of "restrict") vlang/vtl#3

Open

dumblob changed the title ~~Instruction-level parallelism (MMX, SSE, AVX, NEON, ...) and "restrict"~~ Instruction-level parallelism (MMX, SSE, AVX, NEON, SVE, ...) and "restrict" Mar 3, 2021

dumblob mentioned this issue Mar 17, 2021

V should warn if -prod used and only tcc is available. #9261

Closed

vlang locked and limited conversation to collaborators Sep 22, 2021

medvednikov closed this as completed Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Instruction-level parallelism (MMX, SSE, AVX, NEON, SVE, ...) and "restrict" #1839

Instruction-level parallelism (MMX, SSE, AVX, NEON, SVE, ...) and "restrict" #1839

dumblob commented Sep 3, 2019 •

edited

Loading

gslicer commented Sep 3, 2019

dumblob commented Sep 3, 2019

gslicer commented Sep 3, 2019 •

edited

Loading

dumblob commented Nov 29, 2019 •

edited

Loading

dumblob commented Dec 8, 2019

mratsim commented Dec 8, 2019

dumblob commented Apr 15, 2021

JalonSolov commented Apr 15, 2021

dumblob commented May 10, 2021 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Instruction-level parallelism (MMX, SSE, AVX, NEON, SVE, ...) and "restrict" #1839

Instruction-level parallelism (MMX, SSE, AVX, NEON, SVE, ...) and "restrict" #1839

Comments

dumblob commented Sep 3, 2019 • edited Loading

gslicer commented Sep 3, 2019

dumblob commented Sep 3, 2019

gslicer commented Sep 3, 2019 • edited Loading

dumblob commented Nov 29, 2019 • edited Loading

dumblob commented Dec 8, 2019

mratsim commented Dec 8, 2019

dumblob commented Apr 15, 2021

JalonSolov commented Apr 15, 2021

dumblob commented May 10, 2021 • edited Loading

This issue was moved to a discussion.

dumblob commented Sep 3, 2019 •

edited

Loading

gslicer commented Sep 3, 2019 •

edited

Loading

dumblob commented Nov 29, 2019 •

edited

Loading

dumblob commented May 10, 2021 •

edited

Loading