Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for SSE4 intrinsics by RyuJIT #14781

Closed
redknightlois opened this issue Jun 30, 2015 · 78 comments
Closed

Support for SSE4 intrinsics by RyuJIT #14781

redknightlois opened this issue Jun 30, 2015 · 78 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Numerics
Milestone

Comments

@redknightlois
Copy link

Support for many of the interesting instructions like popcnt (technically SSE4a) could be an interesting addition and prove to be useful to avoid using unmanaged code in certain performance sensitive applications.

Many (technically all) of the operations can be emulated in CPU when not available with specific optimizations for the target platform or even have the ability with specially crafted if-then-else optimizations. That would allow to even switch to an entirely different algorithm without any runtime impact (if properly done at the jitting phase).

@mellinoe
Copy link
Contributor

+@CarolEidt

@mellinoe
Copy link
Contributor

Do you have any specific suggestions or use cases in mind here? If you're curious/interested in general JIT optimizations, it may be more relevant to discuss over at https://github.com/dotnet/coreclr (Runtime repo). But if there are more specific use cases that could be exposed through some sort of API or client library, that would be interesting to discuss here (and there as well, probably).

@CarolEidt
Copy link
Contributor

It would be great to hear how & where this might be used. It would not be difficult to add an intrinsic, but we would probably want to avoid adding yet another configuration to support (in order to avoid exploding the test matrix) - so perhaps it would be something that could be enabled with AVX2.

For the SIMD intrinsics, we have an IsHardwareAccelerated property on the Vector class that allows the developer to select a different path. Perhaps something similar could be done here, as you seem to suggest.

That said, this is the first request that I've seen, so this is probably not something that would be high on our list.

@redknightlois
Copy link
Author

This request actually came from one particular place where I could have seen insane performance differences. A popcnt enabled select and rank implementation can have massive improvements on very low level database indexing tech. For example this was the actual algorithm I was looking into when I opened the issue: http://link.springer.com/chapter/10.1007%2F978-3-642-38527-8_15

But that is certainly not the only place where hardware intrinsics can make a huge difference. Not long ago I required a very fast non-cryptographic algorithm and ended up building xxHash just because I could achieve "decent" performance without SSE bit packing instructions. If I remember correctly, "decent" was about 70% performance of the memory bandwidth on my i5 (processing 2.5 Gb/sec in hashes) for the 64bits variant. That can certainly be improved with SSE operations.

My biggest gripe with IsHardwareAccelerated is that it is not fine grained enough. I wouldnt mind to have specific "libraries" with Microsoft approved JIT extensions if that helps alliviate the test matrix issues.

About specific use cases, some can be found in Roslyn. We, in the managed world, know for fact we dont have access to low-level primitives, so we end up building stuff like this: http://odetocode.com/blogs/scott/archive/2015/02/19/roslyn-code-gems-counting-bits.aspx ... that is popcnt, the operation that motivated opening the discussion :)

Another example, @stephentoub has opened this issue not too long ago (https://github.com/dotnet/corefx/issues/2025) offloading the crc32 operation to a hardware intrinsic has huge impact on commonly used framework functionality like DefrateStream. A 1.8x speedup on a general use routine like that is not to be taken lightly.

Why would I like to stick with managed code? Because the jump to unmanaged is very costly. Not long ago I was able to gain 30% just replacing the native memcmp (all safeguards off) with unsafe managed code. Mainly, because the jump to unmanaged code for a tight routine that could be called in the billions in just 2 minutes makes a huge different. I wrote a whole series about memory comparisons up to the point of finding the best unmanaged solution (http://ayende.com/blog/169825/excerpts-from-the-ravendb-performance-team-report-optimizing-memory-compare-copy-costs). After that I could get 30% on top of that just because of how the JIT was able to optimize the call-site when going full managed (even at the expense of losing 0.6% in the general case to unmanaged code). The managed code in question: https://github.com/Corvalius/ravendb/blob/master/Raven.Sparrow/Sparrow/Memory.cs

I believe that supports the use cases part. Why there is probably not many requests? I guess because asking for SIMD could be read as access to special purpose operations. Math intrinsics are just a bunch of those (very important and very welcomed) but there are other types like bit packing and manipulation instructions that are very important in other domains, but as of now not many are looking to implement high-performance code in .Net; but with the introduction of SIMD and open sourcing of the CLR which implies support for other platforms will certainly change that.

Most of the optimization issues are related to the JIT emitting better code when it really matters:
https://github.com/dotnet/coreclr/labels/optimization

Interest for performance is out there in the requests, and many of the issues are rooted in sub-par support for dealing with unsafe code or access to exploit the hardware:
dotnet/roslyn#1798
dotnet/roslyn#120
https://github.com/dotnet/coreclr/issues/916
https://github.com/dotnet/coreclr/issues/1015
https://github.com/dotnet/corefx/issues/1168
dotnet/roslyn#166

And those are the ones I am tracking, I am pretty sure with some work we can dig others.

In my dream world I would be able to write memcpy|memcmp|hashes|etc routines in unsafe (but portable) managed code when I need them and compete with the fastest routines available in the C world; while continue writing safe code with the flexibility and productivity I already have. I would also be able to compile specially crafted MSIL to OpenCL/Cuda too, but that is another topic :P

EDIT: @CarolEidt I just noticed you said: "so perhaps it would be something that could be enabled with AVX2". If you plan to implement the whole AVX2 instruction set, I will be VERY HAPPY!!! 😃

EDIT2: More issues.

@CarolEidt
Copy link
Contributor

@redknightlois - thanks! It's really helpful to have such a good articulation of the need. Just to be clear, I don't think there's any chance that I/we will implement the whole AVX2 instruction set, but just that enabling something like popcnt only for the AVX2 target (presuming, I think correctly, though I haven't verified, that AVX2 hardware would always support SSE4a) would allow us to support it without adding another target to test.
Regarding the granularity of IsHardwareAccelerated - I agree that it is too coarse. What do you think of something like a HardwarePopCount that took a reference to an int for the return value, and returned a bool indicating whether it was successful. So you could write code like:
if (HardwarePopCount(long source, ref int count))
{
// code that depends on popcnt
}
else
{
// alternate implementation
}
I don't think it's ideal, but it's certainly finer granularity. The non-accelerated version (i.e. the one that lives in the IL) would always simply return false. One could then also provide a PopCount that looked like the above, but had a managed implementation in the else clause. But providing the HardwarePopCount would allow the developer to choose a completely different algorithm (not counting bits) if popcnt wasn't accelerated.

Thoughts?

@mburbea
Copy link

mburbea commented Jul 1, 2015

I think it would be useful to offer a series of constants that acted as a means of feature detection. That way I could just write.

if( Feature.SupportHardwarePopCount)
{
        // code uses popCount goes here
}
else
{
    // code that can't counts bit.
}

RyuJit can optimize away the never visited branch like always based on the value of the constant.

The current implementation is too rigid, and some algorithms with the lack of intrinsics become difficult or impossible to beat a non-simd implementation, or require writing ugly code.

There is unfortunately little in the way of documentation for writing high-performance code that plays well with the JIT. Pointer tricks that work great in C/C++ do not always get optimized as you would expect. And you're pretty much forced to go and spend lots of time doing trial and error to see if the IL emitted gets turned into quality machine code.

@redknightlois
Copy link
Author

@mburbea I was actually thinking along those lines (even if today the response for those is always false and the code is library call):

Hardware.SSE4.IsAccelerated
Hardware.AVX2.IsAccelerated

If then we have special cases like:

Hardware.SSE4.IsPopCountAccelerated
Hardware.SSE4.IsCrc32Accelerated
Hardware.SSE4.IsBitPackingAccelerated

for groups of funcionality it will give far greater flexibility without losing generality.

@redknightlois
Copy link
Author

@CarolEidt Other intrinsecs that are very important for succinct and compact data structures (along with compression algorithms and indexing algorithms) while not SSE are:

Count the number of leading zeroes in variable (byte, int, long). In GCC: __builtin_clz();
Count the number of trainling zeroes in variable (byte, int, long). In GCC: __builtin_ctz();
Most significative 1 Bit. In VC++ https://msdn.microsoft.com/en-us/library/fbxyd7zd.aspx
Least Significative 1 Bit. In VC++ https://msdn.microsoft.com/en-us/library/wfd9z0bb.aspx
Byte Swaps. In VC++ https://msdn.microsoft.com/en-us/library/a3140177.aspx

Without those you have to go and implement something like this (instead of a single CPU operation):

int LeadingZeros(int x)
{
        x |= (x >> 1);
        x |= (x >> 2);
        x |= (x >> 4);
        x |= (x >> 8);
        x |= (x >> 16);
        return(sizeof(int)*8 -Ones(x));
}

int Ones(int x)
{
        x -= ((x >> 1) & 0x55555555);
        x = (((x >> 2) & 0x33333333) + (x & 0x33333333));
        x = (((x >> 4) + x) & 0x0f0f0f0f);
        x += (x >> 8);
        x += (x >> 16);
        return(x & 0x0000003f);
} 

for every word size.

Given these types of operations are typically used in very hot-paths the difference of having an intrinsic is INSANE!!! :) ... There are plenty framework places where such things are done by hand, specially the byte swapping. Having that available would be a huge win in many situations, they shouldnt either complicate the test matrix.

Good thing is that on platforms that are not available a forced inline library call can be used. Either the platform supports it, or it doesnt... reverting to a library call is just fine.

@benaadams
Copy link
Member

I'd be very interested in Hamming weight/popcnt and bitscan

On Intel the came in with Nehalem (Q4 2008); and have been in the chips since then Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell and now Skylake; AMD since Barcelona (Q4? 2007); ARM in NEON Cortex A8/9 (2007?). So the fallback would probably be the road less taken.

Probably could have better names than the intrinsics though :)

@benaadams
Copy link
Member

@CarolEidt an example of where popcnt would be helpful in the aspnet code: https://github.com/aspnet/KestrelHttpServer/blob/dev/src/Microsoft.AspNet.Server.Kestrel/Http/FrameHeaders.Generated.cs#L66

Could replace with single instruction

@redknightlois
Copy link
Author

@benaadams I just hope that implementation is not in a hot-path, it is 15x slower than the naive (shift, add, and) implementation, and almost 30x of the optimized one using 12 arithmetic operations and one multiply. o.O

BenchmarkDotNet=v0.7.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz, ProcessorCount=4
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Type=Algo_BitCount Mode=Throughput Platform=HostPlatform Jit=HostJit .NET=HostFramework

Method AvrTime StdDev op/s
PopCount1 6.7466 us 0.9003 us 148,221.90
PopCount2 4.3872 us 0.0174 us 227,933.86
PopCount3 3.8315 us 0.0394 us 260,996.33
PopCountParallel2 3.0998 us 0.0256 us 322,604.58
Asp.Net 99.8271 us 0.7559 us 10,017.32

@redknightlois
Copy link
Author

@dadhi
Copy link

dadhi commented Nov 27, 2015

Another required use of popcount are persistent data structures like ideal hash tries HAMT or CHAMP.

This is a foundation for very efficient immutable data structures, that could provide an alternative to current AVL tree based collections in BCL.

The Clojure collections for instance are based on HAMT.

Using Hamming Weight instead of native popcount drastically degrades performance of such structures.

So ±100 @redknightlois

@ghost
Copy link

ghost commented Dec 16, 2015

👍 would be nice to have both variants in runtime:

Without codegen, we can do something like this in native:

#include <nmmintrin.h>
static inline bool HasPopcntIntrincis()
{
    static bool is_capable(false), capability_tested(false);

    if (capability_tested)
        return is_capable;

    capability_tested = true

    // see more example at https://msdn.microsoft.com/en-us/library/hskdteyh.aspx
    int CPUInfo[4] = {-1};
    __cpuid(CPUInfo, 0);
    is_capable = (CPUInfo[2] >> 23) & 1;
    return is_capable;
}

static inline int BitCountWithoutPOPCNT(uint64_t x)
{
    x -= ((x >> 1) & 0x5555555555555555ULL);
    x = (((x >> 2) & 0x3333333333333333ULL) + (x & 0x3333333333333333ULL));
    x = (((x >> 4) + x) & 0x0F0F0F0F0F0F0F0FULL);
    x *= 0x0101010101010101ULL;
    return static_cast<int>(x >> 56);
}

static inline int GetBitCount(uint64_t x)
{
    if(HasPopcntIntrincis()) // runtime check
        return _mm_popcnt_u64(x);

    return BitCountWithoutPOPCNT(x);
}

Then expose GetBitCount to managed surface area.

Alternatively, RyuJIT codegen can be equipped with AVX2 instruction set with fallback code to do the same thing bit more efficiently.

@redknightlois
Copy link
Author

Yet another place where JAVA is beating .Net in indexing technology because we don't have popcnt support. It is actually specifically called of as the reason of the performance improvement.

Better bitmap performance with Roaring bitmaps.
http://arxiv.org/pdf/1402.6407.pdf

BTW. I cannot implement this method because I don't have the supporting HW operations.

@jonathanmarston
Copy link

I'd be very interested in seeing support for popcnt. I have a project that I'm working on that heavily uses bitmaps and would benefit greatly. Right now I'm looking at needing to break down and write it in C++ instead of C#...

@CarolEidt
Copy link
Contributor

I don't think that any of these requests would be difficult to implement as intrinsics. The main issue is to define the appropriate API. The "path of least resistance" would probably be to put them in System.Numerics.Vectors.dll, but I'm not sure that's the best place from a design perspective. However, to get the conversation started (and admitting up front that API design is not my field), here is a preliminary proposal for 4 method that might be added to System.Numerics.Vector (the static Vector class):

public static int BitCount(long bits);
public static bool BitCountAccelerated();

public static int FirstSetBit(long bits);
public static bool FirstSetBitAccelerated();

This fixes the length of the "bit vector" at long, but has the attraction of simplicity.

I would not be in favor of a global "Feature" class that subsumed the responsibility for all "is feature XX accelerated", because I think it is better to associate them with the class that exposes the feature. I'm not invested in the "Accelerated" suffix, but I think it would be good to have a standard naming convention for these. One issue would be what "Accelerated" means - what if there is a JIT-generated code sequence that takes multiple instructions, but is otherwise more efficient than one could do in C#/F#/IL?

@redknightlois
Copy link
Author

@CarolEidt I agree with you, "Accelerated" should mean "Better than, even if we do this writing the IL directly".

I can try to build a few examples of how I envision such an API to work (as I have already the stock implementation for a few of the most important routines). But, I have a few questions:

  • Is the idea to provide "stock / even if not hw accelerated" implementations to avoid every single project to repeat itself? Ex, Roslyn, CoreFX, Kestrel, all have/had their own implemention for PopCount/BitCount.
  • Should we focus on "feature-set" or "behavior"? Feature-set: Is it SSE2, AVX, etc or Behavior: "Logical Shift, etc"
  • Should we priorize some to implement an small subset first but have room for improvement API wise?
  • Should we look into leveraging Vector itself (having 256/512 bits implementations of popcount and create an API that is restricted to 64bits does not look like a good choice to me).

@GSPP
Copy link

GSPP commented Feb 19, 2016

Maybe, the JIT can accelerate based on a well-known IL sequence instead of based on a method name. For example, the sequence

c = (v & 0x55555555) + ((v >> 1) & 0x55555555);
c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
c = (c & 0x0000FFFF) + ((c >> 16)& 0x0000FFFF);

could be converted to bitcount everywhere, no matter where it is defined. There should be documentation specifying the exact patterns being accelerated. That way there is no need to define an intrinsic method in the framework assemblies at all. Each project that wants to make use of these instructions can just copy and paste this implementation and achieve accelerated performance. This is a zero surface area approach.

I believe GCC and LLVM recognize these "magic" implementations and replace them with intrinsics. This is to create a portable way to implement a fast bitcount.

For each instruction to be exposed that way, the most common 1-3 patterns should be supported. That way user code can pick the fastest unaccelerated pattern for their case and still get it accelerated where possible.

For testing feature availability there could be a method JitCapabilities.IsFeaturePresent(string). User code can pull the result of that into static readonly bool variables. The JIT is currently already capable of inlining the value of such variables and eliminating dead code. User code could be:

static readonly bool isBitcountAccelerated = JitCapabilities.IsFeaturePresent("IsBitcountAccelerated");

if (isBitcountAccelerated) {
c = (v & 0x55555555) + ((v >> 1) & 0x55555555);
c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
c = (c & 0x0000FFFF) + ((c >> 16)& 0x0000FFFF);
} else {
 //some other approach
}

After optimizations this should collapse to c = bitcount(c). The if goes away.

Whatever design is chosen, it should be suitable to expose many intrinsics. There are many useful x86 instructions to be exposed. Parallel extract comes to mind. It is very versatile.

@redknightlois
Copy link
Author

@GSPP the problem there is that you have to write an specialized morpher for such complex chains of calls (and all their variations) which cost resources in runtime, giving less time to the JIT to do the rest of the work. In AoT compilers you wont care, but in JIT compilers you have to be very careful about the time it takes to handle that. The use of a library call has the advantage that now everybody will be able to use it from the same place, whether it is accelerated by HW/JIT or not. And only those that really require the performance will need to do JitCapabilities.Features.IsBitCountSupported kind of calls to use alternative codepaths.

@GSPP
Copy link

GSPP commented Feb 19, 2016

@redknightlois If it is documented that only specific sequences will be supported, then shouldn't it be rather quick to check for them? This particular code that I posted starts with (v & 0x55555555) + ... which is very rare. This first check will almost always immediately fail and conserve performance.

I believe the JIT uses a tree format internally. This should be fast to match with that. In SSA form it would be fast, too. Not sure if the JIT uses SSA, though.

Anyway, this is just an idea and I'm not qualified to argue further.

@jonathanmarston
Copy link

I'm not against the JIT pick up optimizations like @GSPP suggests, as long as it doesn't dramatically affect jitting time, but I'd personally like to use intrinsic methods instead. I don't want to have to copy and paste a supported bit counting algorithm, I just want to call a library method and be done.

Would it be best to have intrinsic methods (or extension methods) on the primitive integer types themselves?

This:

long number = 0x6E66399B032839E6;

int count = number.BitCount();
// count == 31 

int first = number.FirstSetBit();
// first == 1

Feels like a natural extension to the framework.

I also agree that there should be support for instructions larger than 64-bits. Most CPUs can at least do 256-bit instructions these days, and there are many use cases that would greatly benefit from taking advantage of that. Using System.Numerics.Vector<> is the most logical way to enable this by simply adding BitCount() and the like to the existing structure.

@redknightlois
Copy link
Author

@GSPP While the check may be simple it introduce other problems. What would you do with a developer that just rearrange the operations to make them look nicer (nice can be symetrical or whatever)?. If you dont want to do that you need to write a complex morpher like the one built to pick up bits rotations (https://github.com/dotnet/coreclr/issues/1619). And even though that one was is quite simple, it had many different ways to express it (some, non so obvious).

In the end, it is far better to just do as @jonathanmarston suggest (which to me is the approach) and support all basic types and Vector<T> with an appropriate method call.

@CarolEidt
Copy link
Contributor

@jonathanmarston - It may be that I have missed something, but I am not aware of an implementation of 256-bit (or even 128-bit) popcount. The x86-64 version is 64-bits only (and, a bit oddly, defined using an SSE FLT encoding, although it operates on memory and general purpose registers). I kind of like the idea of supporting a broad range of numeric types, but I don't think that extending the primitive types is really a practical approach. The easiest would be to make them static extension methods in the static System.Numerics.Vector class. It would avoid having to recognize yet another "special" assembly, and would also avoid having to change the core types.
The idea of a pattern-match is not a bad one, but is a bit fraught with the issues that @redknightlois mentions. Too easy to make a change (either to the user's code, or to the JIT), and have the pattern-match suddenly not work properly. Currently the JIT doesn't have a canonicalization phase for expressions, and it is prone to sometimes splitting expressions if the trees are large. Maybe some time in the future (we envision an SSA-based arch-specific peeps phase "someday") we could add the pattern match.

@redknightlois
Copy link
Author

@CarolEidt the HW instruction is 64bits but extending that to support Vector<T> is straight-forward.

Wouldn't an extension method in the System.Numerics assembly suffice to 'extend' the primitive types?

@CarolEidt
Copy link
Contributor

The problem with putting an extension method on a class in the System.Numerics assembly is that it is part of the "core", and would be dependent on a simultaneous update of the JIT and library on the desktop version. Perhaps that not a big issue. @mellinoe can you comment on that?

@redknightlois
Copy link
Author

@CarolEidt BTW just for context. If you know the size of the Vector<T> for the case of popcount you can provide hand-coded assembly performance (which probably would require a morpher to do properly in "client code") using something like this:

popcntq %r10, %r10
addq    %r10, %rcx
popcntq %r11, %r11
addq    %r11, %r9
popcntq %r14, %r14
addq    %r14, %r8
popcntq %rbx, %rbx

Which bypasses a false dependency bug in Intel HW.

A very detailed analysis of this particular issue can be found at: http://danluu.com/assembly-intrinsics/

@mellinoe
Copy link
Contributor

Perhaps that not a big issue. @mellinoe can you comment on that?

I think we'll have the same "timing issue" wherever we ship the library, since the desktop framework will need a new JIT to recognize the methhod regardless of where it is. Right now, the static Vector class is implemented in System.Numerics.Vectors on all platforms, including .NET Framework, so if we were going to include it with the rest of the SIMD support, that is where it would go, and it would share the same implementation on all platforms. I think having an "extension" to the primitive types is going to be a no-go in general however, assuming we are talking about something like this:

public static int BitCount(this long value) { ... }
public static int LeadingZeroCount(this long value) { ... }

I don't think we ever want to put extension methods on core types like that, especially from a relatively common namespace like System.Numerics. On the other hand, this is an extremely specialized operation, and I don't think we have anything quite like it, i.e. a fundamental operation on a primitive type but which is implemented separately from it / on top of it. So this may not fit anywhere cleanly in our current design guidelines.

@terrajobst Any thoughts on where an operation like this could live?

@redknightlois
Copy link
Author

@mellinoe we can still hide it a little bit. Instead of using the System.Numerics use something like System.Numerics.Intrisics, System.Numeric.Binary or even System.Intrinsics (or something along the lines). It wont show up for System.Numerics normal use case, but being available for use against fundamental types as an extension method. We can look into the rationale on Java to support it as an extension on the fundamental type to understand why they did it like that.

IMHO it makes sense it to be a fundamental operation. For example, bit rotations using ror operations now can be put in the ulong and not in the long type where such a thing doesnt exist.

@redknightlois
Copy link
Author

redknightlois commented Jun 22, 2017

@dsyme if we are going the intrisics support, I would add a few ones like prefetch, branch prediction and temporal loads and stores into the mix. They are kinda important for high-performance on certain data structures and algorithms.

It's difficult to propose an API without knowing what are the design constraints we are facing. Some ideas have been layed out on different issues like:

https://github.com/dotnet/corefx/issues/12425
https://github.com/dotnet/coreclr/issues/5025
https://github.com/dotnet/coreclr/issues/6024
https://github.com/dotnet/coreclr/issues/2725 (I kinda like the Contract.Assume.[operation] for those markers)

I remember having had a conversation probably with @mellinoe where I essentially proposed to make the low level register entities available outside of System.Numerics and then build this operations on top of those Register abstractions instead, because different instruction sets may have multiple word sizes available. And then build System.Numerics on top of that.

@tannergooding
Copy link
Member

@redknightlois, I still think something akin to your proposal here: https://github.com/dotnet/coreclr/issues/6906 (I commented my thoughts on it, herehttps://github.com/dotnet/coreclr/issues/6906#issuecomment-307164495) is probably the best route overall.

The API shape for some of these are easier than others (several of these 'fit' in a general BitManipulation class, but others like Prefetch don't really have an API fit).

@redknightlois
Copy link
Author

@tannergooding yes, from all I have witness there is agreement among the ones needing those intrinsics is that having a simple straight-to-the-metal approach with a very big you can shoot yourself in the foot warning label across the namespace would be the one that will provide flexibility to build upon. So in essense the API issues boils down to the actual static class name and method name than design abstraction per se.

@damageboy
Copy link
Contributor

I want to add to what @redknightlois wrote and say that I can't think of a single PL / environment where at the very least, when intrinsics are supported at all, they are at least supported with the straight-to-opcode approach.

MS needs not go anyfurther than revisit its own C++ compiler to witness that.

I'm all for a more generalized (a-la System.Numerics) approach for an XP experience where that make sense. But that cannot come instead of having the straight-to-opcode versions provided....

There are multiple reasons for straight-to-opcode approach:

  • Programmers already wanting to do intrinsics would probably also want to manually control unrolling and inteleaving different intrinsics to accomplish higher IPC, having simple, conventional naming allows them to actually understand more intuitively what they are about to ask of the CPU to do, and be able to meaningfully consult resources like agner.org for reference...
  • .NET, via Nuget 3.x, already supports providing different implementation of managed code for different OS/arch, thus allowing library writers that actually do care, to provide different implementations for arm/x64 etc. via these requested straight-to-opcode intrinsics
  • Users are very likely to use some base-line implementation already written in C/C++ with intrinsics as starting point for whatever they do. While I do understand the immediate urge to throw up upon seeing something like System.Unsafe.Intrinsics.x64._pdep_u64() or worse-yet: System.Unsafe.Intrinsics.x64._mm256_slli_epi64() I think it is actually the right and possibly only sane way to present these to a would be user

@tannergooding
Copy link
Member

While I do understand the immediate urge to throw up upon seeing something like System.Unsafe.Intrinsics.x64._pdep_u64() or worse-yet: System.Unsafe.Intrinsics.x64._mm256_slli_epi64()

I really hope that if such a feature is implemented, we choose better names:

  • _pdep_u64 -> DepositContiguousLowBits or DepositContiguousBits
  • _mm256_slli_epi64 -> ShiftPackedInt64 or ShiftPacked or even just Shift

No reason why we have to make it hard to read 😉

@damageboy
Copy link
Contributor

damageboy commented Jul 21, 2017

@tannergooding I understand where you are coming for, and am definitely all for having readable/meaningful names...

However, people, in this specific case, are not going to use these sorts of intrinsics with a clean slate, or at least many of them will have "prior convictions" and baggage coming from C/C++....

So while having nice meaningful names is something I would definitely like, I do strongly feel that the "ugly" names should be supported, for code portability purposes if nothing else.

C# designers has the good instinct of not breaking with C/C++ where it wasn't required previously, and this allows for easier porting of existing code when needed...

I feel the same here..., and also feel that if anything, the GCC names and coverage of intrinsics is a better starting point than MSVC....

For example, if I have the following working piece of code:

static const int32_t CHUNKMASK_SHIFT = 6;

int32_t GetKeyForIndexIntrinsicsUnrolled(int64_t index, uint64_t *bits)
{
  index++;
  

  auto p = (uint64_t *) bits;

  for (; index >= 256; p += 4)
    index -= __popcntq(p[0]) + __popcntq(p[1]) + __popcntq(p[2]) + __popcntq(p[3]);

  // As long as we are still looking for more than 64 bits
  auto prevIndex = index;
  while (index > 0) {
    prevIndex = index;
    index -= __popcntq(*(p++));
  }

  auto pos = __bsfq(_pdep_u64(1ULL << (prevIndex - 1), *(p - 1)));
  return ((p - 1 - bits) << CHUNKMASK_SHIFT) + pos;
}

The last thing I care about, is finding out the exact correct name that the CLR guys thought the __bsfq or anything else here should get.

I just want the code to work... And given that this is a very niche API, I don't see a good reason to make it pretty over functional for the target audience...

@benaadams
Copy link
Member

benaadams commented Jul 22, 2017

The C++ intrinsics don't match the asm opcodes in name anyway.

Would it not be better to match the asm descriptions and merge opcodes with overloading? Casting to a defined clr type if needed can be done via the Unsafe class As you can do for Vector2, Vector4 and Vector<T> (Vector3 is an oddity instruction-wise, though a useful one).

While at the same time, not shying away from the use of vowels, but staying away from underscore exuberance?

@benaadams
Copy link
Member

benaadams commented Jul 22, 2017

@damageboy C# version could look something like this

using System.Numerics;

const int CHUNKMASK_SHIFT = 6;

unsafe int GetKeyForIndexIntrinsicsUnrolled(long index, ulong* bits)
{
    index++;

    var p = bits;

    for (; index >= 256; p += 4)
    {
        index -= Bits.Count(p[0]) + Bits.Count(p[1]) + Bits.Count(p[2]) + Bits.Count(p[3]);
    }

    // As long as we are still looking for more than 64 bits
    var prevIndex = index;
    while (index > 0)
    {
        prevIndex = index;
        index -= Bits.Count(*(p++));
    }
    // or Bits.ScanForward(...)
    var pos = Bits.First(Bits.Scatter(1UL << (prevIndex - 1), *(p - 1)));
    return ((p - 1 - bits) << CHUNKMASK_SHIFT) + pos;
}

or with using static

using static System.Numerics.Bits;

unsafe int GetKeyForIndexIntrinsicsUnrolled(long index, ulong* bits)
{
    index++;

    var p = bits;

    for (; index >= 256; p += 4)
    {
        index -= Count(p[0]) + Count(p[1]) + Count(p[2]) + Count(p[3]);
    }

    // As long as we are still looking for more than 64 bits
    var prevIndex = index;
    while (index > 0)
    {
        prevIndex = index;
        index -= Count(*(p++));
    }
    // or ScanForward(...)
    var pos = First(Scatter(1UL << (prevIndex - 1), *(p - 1)));
    return ((p - 1 - bits) << CHUNKMASK_SHIFT) + pos;
}

@benaadams
Copy link
Member

benaadams commented Jul 22, 2017

@dsyme @redknightlois @jonathanmarston @mellinoe @damageboy @CarolEidt @russellhadley @mgravell @terrajobst

API starter for comment/feedback (example use in comment above)

namespace System.Numerics
{
    public static class Bits
    {
        // POPCNT on Intel
        public static byte Count(byte value);
        public static ushort Count(ushort value);
        public static uint Count(uint value);
        public static ulong Count(ulong value);

        // +/- shift values to rotate left and right
        public static byte Rotate(byte value, sbyte shift);
        public static short Rotate(short value, sbyte shift);
        public static int Rotate(int value, sbyte shift);
        public static long Rotate(long value, sbyte shift);

        // BSF on Intel
        public static int First(int value);
        public static int First(long value);

        // BSR on Intel
        public static int Last(int value);
        public static int Last(long value);

        // PEXT on Intel 
        public static uint Gather(uint value, uint bitMask);
        public static ulong Gather(ulong value, ulong bitMask);

        // PDEP on Intel
        public static uint Scatter(uint value, uint bitMask);
        public static ulong Scatter(ulong value, ulong bitMask);

        public static byte Crc(byte crc, byte value);
        public static short Crc(short crc, short value);
        public static int Crc(int crc, int value);
        public static long Crc(long crc, long value);

        // Byteswap
        public static short SwitchEndianness(short value);
        public static int SwitchEndianness(int value);
        public static long SwitchEndianness(long value);

        // LZCNT on Intel
        public static int LeadingZeros(int bitMask);
        public static int LeadingZeros(long bitMask);

        // TZCNT on Intel
        public static int TrailingZeros(int bitMask);
        public static int TrailingZeros(long bitMask);
    }
}

None are too exotic, so probably could have software fallbacks - not sure about detection of HW support though.

@benaadams
Copy link
Member

benaadams commented Jul 22, 2017

Perhaps to address @damageboy's concerns also have a Intrinsics Interop

namespace System.Numerics.Intrinsics
{
    public static class Interop
    {
        uint _BitScanForward(uint value) => Bits.First(value);
        ulong _BitScanForward64(ulong value) => Bits.First(value);

        uint __bsfd(uint value) => Bits.First(value);
        ulong __bsfdq(ulong value) => Bits.First(value);

        uint _pdep_u32(uint source,  uint mask) => Bits.Scatter(source, mask);
        ulong _pdep_u64(ulong source, uint mask) => Bits.Scatter(source, mask);

        int __popcnt16(ushort value) => Bits.Count(value);
        int __popcnt(uint value) => Bits.Count(value);
        int __popcnt64(uint value) => Bits.Count(value);

        int __popcntd(uint __X) => Bits.Count(__X);
        int __popcntq(ulong __X) => Bits.Count(__X);

        // ...
    }
}

Then you just need to a the header using static System.Numerics.Intrinsics.Interop; and all the C-style functions are available? So this would then be vaild:

var pos = __bsfq(_pdep_u64(1UL << (prevIndex - 1), *(p - 1)));

Or if you were MSVC rather than gcc

var pos = _BitScanForward64(_pdep_u64(1UL << (prevIndex - 1), *(p - 1)));

@damageboy
Copy link
Contributor

@benaadams Having those two versions is basically what I meant...

I like meaningful names just like any sane person, but when porting or trying to implement some paper you may be reading it just makes some sense to have the interop version around.

Few comment though

  • Wouldn't it make more sense the have the interop version in some x64/x86 namespace, along side ARM ones?
  • Can/Should he canonical Bits stay uniform across archs?
  • There is some sort of need for a more higher level CPUID wrapper that would be able to report the various capabilities to the user, as in bits that inform the user when popcnt and friends are not supported...

@benaadams
Copy link
Member

benaadams commented Jul 23, 2017

Wouldn't it make more sense the have the interop version in some x64/x86 namespace, along side ARM ones?

Yes, throw for intrinsics of wrong platform, also some are x-plat so something like?

namespace System.Numerics.Intrinsics
{
    [Flags]
    public enum CpuPlatform
    {
        x86   = 1 << 0,
        x64   = 1 << 1 | x86,

        ARM   = 1 << 8,
        ARM64 = 1 << 9 | ARM
    }

    public static class Interop
    {
        uint _BitScanForward(uint value) => Bits.First(value);
        ulong _BitScanForward64(ulong value) => Bits.First(value);

        // ...
    }
}

namespace System.Numerics.Intrinsics.x64
{
    public static class Interop
    {
        private static void ThrowPlatformNotSupportedException()
            => throw new PlatformNotSupportedException();

        private static void CheckPlatform()
        {
            if (!Environment.Is64BitProcess 
                || Environment.CpuPlatform & CpuPlatform.x64 != CpuPlatform.x64)
                ThrowPlatformNotSupportedException();
        }

        public byte _mm_crc32_u8(byte crc, byte value)
        {
            CheckPlatform();
            Bits.Crc(crc, value);
        }

        public ushort _mm_crc32_u16(ushort crc, ushort value)
        {
            CheckPlatform();
            Bits.Crc(crc, value);
        }

        public uint _mm_crc32_u32(uint crc, uint value)
        {
            CheckPlatform();
            Bits.Crc(crc, value);
        }

        public ulong _mm_crc32_u64(ulong crc, ulong value)
        {
            CheckPlatform();
            Bits.Crc(crc, value);
        }

        // ...
    }

    namespace System.Numerics.Intrinsics.x86
    {
        public static class Interop
        {
            private static void CheckPlatform()
            {
                if (Environment.CpuPlatform & CpuPlatform.x86 != CpuPlatform.x86)
                    ThrowPlatformNotSupportedException();
            }
        }
    }

    namespace System.Numerics.Intrinsics.ARM
    {
        public static class Interop
        {
            private static void CheckPlatform()
            {
                if (Environment.CpuPlatform & CpuPlatform.ARM != CpuPlatform.ARM)
                    ThrowPlatformNotSupportedException();
            }
        }
    }
}

Can/Should he canonical Bits stay uniform across archs?

Yes. They are fairly universal functions and the software fallback is well known; so I'd think they sit well as platform independent "intrinsics".

Note this is different than interop intrinsics (as above) and platform/cpu specific intrinsics that either aren't common or have a complex software fallback (e.g. encryption opcodes) - but I think that's a different discussion.

There is some sort of need for a more higher level CPUID wrapper that would be able to report the various capabilities to the user, as in bits that inform the user when popcnt and friends are not supported...

For platform independent intrinsics should be a "is hardware accelerated" check; that is branch eliminated at Jit time. Something equivalent to a readonly static; that has prechecked CPUID rather than doing the expensive check always.

For platform specific intrinsics (always same cpu opcode; though with type overloading); same mechanism but "is hardware supported"; with a branch eliminated PNS exception path (as above)

Seem sensible?

Not sure on AoT

@damageboy
Copy link
Contributor

Seems pretty sensible to me so far, yes.

Not sure on AoT

Well, there is something like the intel way of doing things in ICC where they can generate functions for several archs and then they basically do a synamic dispatch to the appropriate function.

For anything that take a considerable amount of cycles that sort of approach is both inclusive as far as compiling once, running "everywhere"...

@benaadams
Copy link
Member

For detection I was hoping there was some way to directly tie to the method/method group itself with an extension like:

namespace System.Numerics.Intrinsics
{
    public static class IntrinsicExstensions
    {
        public static bool IsHardwareAccelerated(this MethodInfo intrinsicFunction);
        public static bool IsHardwareAccelerated(this MethodGroup intrinsicFunction);
    }
}

To do

int bits;
if (Bits.Count.IsHardwareAccelerated())
{
    bits = Bits.Count(value);
}
else
{
    // ...
}

But it doesn't seem that's valid C# 😞

@damageboy
Copy link
Contributor

On the other hand

Bits.IsHardwareAccelerated(Bits.Count)

Is really not that bad

@damageboy
Copy link
Contributor

One thing I'm not really clear about in this discussion is are we talking about numeric intrinsics per-se here, or general intrinsics?

All of the examples so far are fine for System.Numerics, but if we start going into prefetching and clearing cache instructions, then this becomes something completely different... at least from the title...

Maybe the whole thing needs to become slightly wider in scope and move into some future sounding System.Runtime.Unsafe of some sort...?

@tannergooding
Copy link
Member

@damageboy, there have been a few proposal on the subject of general intrinsics (https://github.com/dotnet/coreclr/issues/6906#issuecomment-307164495).

@benaadams
Copy link
Member

benaadams commented Jul 23, 2017

All of the examples so far are fine for System.Numerics, but if we start going into prefetching and clearing cache instructions, then this becomes something completely different... at least from the title...

Maybe the whole thing needs to become slightly wider in scope and move into some future sounding System.Runtime.Unsafe of some sort...?

Prefetching and clearing cache can be inadvisable, but its not strictly unsafe..? i.e. its only performance that can go wrong, not a failure in operation.

e.g. a prefetch byref would be safe; while a prefetch by pointer would be unsafe, but both are valid

@damageboy
Copy link
Contributor

@benaadams Right, bad naming...

@tannergooding Haven't seen that one before

There seem to be a few of these slogging around...

@danmoseley
Copy link
Member

@ericstj fyi.

@redknightlois
Copy link
Author

@benaadams That C# example would work right off the bat with software fallbacks I had to build because no intrinsics (even though perf sucks :D).
@damageboy While I agree with you about most of the post, I am not in the camp of using the C++ intrinsic names which are cryptic and do not give an idea... Having the Interop.XXX version that @benaadams has also the interesting extra that we can build Roslyn "suggestions" to replace them with the appropriate API if you run into them (porting code would be a bliss that way).

@damageboy
Copy link
Contributor

@redknightlois I really like attacking this through roslyn suggestions. I'm the last person to willingly push forward the cryptic naming, but it really helps with getting stuff off the ground...

@jnm2
Copy link
Contributor

jnm2 commented Jul 29, 2017

You could make Roslyn fixes that work without the interop class too.

@redknightlois
Copy link
Author

@jnm2 the downside of that is while you are coding an algorithm that has been published, you would use the interop naming just to be able to follow the algorithm properly. Later on you move to the better notation.

@tannergooding
Copy link
Member

@fiigii
Copy link
Contributor

fiigii commented Aug 3, 2017

Intel hardware intrinsic API proposal has been opened at dotnet/corefx#22940

@jkotas jkotas closed this as completed Nov 18, 2017
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Numerics
Projects
None yet
Development

No branches or pull requests