-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for SSE4 intrinsics by RyuJIT #14781
Comments
Do you have any specific suggestions or use cases in mind here? If you're curious/interested in general JIT optimizations, it may be more relevant to discuss over at https://github.com/dotnet/coreclr (Runtime repo). But if there are more specific use cases that could be exposed through some sort of API or client library, that would be interesting to discuss here (and there as well, probably). |
It would be great to hear how & where this might be used. It would not be difficult to add an intrinsic, but we would probably want to avoid adding yet another configuration to support (in order to avoid exploding the test matrix) - so perhaps it would be something that could be enabled with AVX2. For the SIMD intrinsics, we have an IsHardwareAccelerated property on the Vector class that allows the developer to select a different path. Perhaps something similar could be done here, as you seem to suggest. That said, this is the first request that I've seen, so this is probably not something that would be high on our list. |
This request actually came from one particular place where I could have seen insane performance differences. A popcnt enabled select and rank implementation can have massive improvements on very low level database indexing tech. For example this was the actual algorithm I was looking into when I opened the issue: http://link.springer.com/chapter/10.1007%2F978-3-642-38527-8_15 But that is certainly not the only place where hardware intrinsics can make a huge difference. Not long ago I required a very fast non-cryptographic algorithm and ended up building xxHash just because I could achieve "decent" performance without SSE bit packing instructions. If I remember correctly, "decent" was about 70% performance of the memory bandwidth on my i5 (processing 2.5 Gb/sec in hashes) for the 64bits variant. That can certainly be improved with SSE operations. My biggest gripe with IsHardwareAccelerated is that it is not fine grained enough. I wouldnt mind to have specific "libraries" with Microsoft approved JIT extensions if that helps alliviate the test matrix issues. About specific use cases, some can be found in Roslyn. We, in the managed world, know for fact we dont have access to low-level primitives, so we end up building stuff like this: http://odetocode.com/blogs/scott/archive/2015/02/19/roslyn-code-gems-counting-bits.aspx ... that is popcnt, the operation that motivated opening the discussion :) Another example, @stephentoub has opened this issue not too long ago (https://github.com/dotnet/corefx/issues/2025) offloading the crc32 operation to a hardware intrinsic has huge impact on commonly used framework functionality like DefrateStream. A 1.8x speedup on a general use routine like that is not to be taken lightly. Why would I like to stick with managed code? Because the jump to unmanaged is very costly. Not long ago I was able to gain 30% just replacing the native memcmp (all safeguards off) with unsafe managed code. Mainly, because the jump to unmanaged code for a tight routine that could be called in the billions in just 2 minutes makes a huge different. I wrote a whole series about memory comparisons up to the point of finding the best unmanaged solution (http://ayende.com/blog/169825/excerpts-from-the-ravendb-performance-team-report-optimizing-memory-compare-copy-costs). After that I could get 30% on top of that just because of how the JIT was able to optimize the call-site when going full managed (even at the expense of losing 0.6% in the general case to unmanaged code). The managed code in question: https://github.com/Corvalius/ravendb/blob/master/Raven.Sparrow/Sparrow/Memory.cs I believe that supports the use cases part. Why there is probably not many requests? I guess because asking for SIMD could be read as access to special purpose operations. Math intrinsics are just a bunch of those (very important and very welcomed) but there are other types like bit packing and manipulation instructions that are very important in other domains, but as of now not many are looking to implement high-performance code in .Net; but with the introduction of SIMD and open sourcing of the CLR which implies support for other platforms will certainly change that. Most of the optimization issues are related to the JIT emitting better code when it really matters: Interest for performance is out there in the requests, and many of the issues are rooted in sub-par support for dealing with unsafe code or access to exploit the hardware: And those are the ones I am tracking, I am pretty sure with some work we can dig others. In my dream world I would be able to write memcpy|memcmp|hashes|etc routines in unsafe (but portable) managed code when I need them and compete with the fastest routines available in the C world; while continue writing safe code with the flexibility and productivity I already have. I would also be able to compile specially crafted MSIL to OpenCL/Cuda too, but that is another topic :P EDIT: @CarolEidt I just noticed you said: "so perhaps it would be something that could be enabled with AVX2". If you plan to implement the whole AVX2 instruction set, I will be VERY HAPPY!!! 😃 EDIT2: More issues. |
@redknightlois - thanks! It's really helpful to have such a good articulation of the need. Just to be clear, I don't think there's any chance that I/we will implement the whole AVX2 instruction set, but just that enabling something like popcnt only for the AVX2 target (presuming, I think correctly, though I haven't verified, that AVX2 hardware would always support SSE4a) would allow us to support it without adding another target to test. Thoughts? |
I think it would be useful to offer a series of constants that acted as a means of feature detection. That way I could just write. if( Feature.SupportHardwarePopCount)
{
// code uses popCount goes here
}
else
{
// code that can't counts bit.
} RyuJit can optimize away the never visited branch like always based on the value of the constant. The current implementation is too rigid, and some algorithms with the lack of intrinsics become difficult or impossible to beat a non-simd implementation, or require writing ugly code. There is unfortunately little in the way of documentation for writing high-performance code that plays well with the JIT. Pointer tricks that work great in C/C++ do not always get optimized as you would expect. And you're pretty much forced to go and spend lots of time doing trial and error to see if the IL emitted gets turned into quality machine code. |
@mburbea I was actually thinking along those lines (even if today the response for those is always false and the code is library call): Hardware.SSE4.IsAccelerated If then we have special cases like: Hardware.SSE4.IsPopCountAccelerated for groups of funcionality it will give far greater flexibility without losing generality. |
@CarolEidt Other intrinsecs that are very important for succinct and compact data structures (along with compression algorithms and indexing algorithms) while not SSE are: Count the number of leading zeroes in variable (byte, int, long). In GCC: __builtin_clz(); Without those you have to go and implement something like this (instead of a single CPU operation): int LeadingZeros(int x)
{
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);
return(sizeof(int)*8 -Ones(x));
}
int Ones(int x)
{
x -= ((x >> 1) & 0x55555555);
x = (((x >> 2) & 0x33333333) + (x & 0x33333333));
x = (((x >> 4) + x) & 0x0f0f0f0f);
x += (x >> 8);
x += (x >> 16);
return(x & 0x0000003f);
} for every word size. Given these types of operations are typically used in very hot-paths the difference of having an intrinsic is INSANE!!! :) ... There are plenty framework places where such things are done by hand, specially the byte swapping. Having that available would be a huge win in many situations, they shouldnt either complicate the test matrix. Good thing is that on platforms that are not available a forced inline library call can be used. Either the platform supports it, or it doesnt... reverting to a library call is just fine. |
I'd be very interested in Hamming weight/popcnt and bitscan On Intel the came in with Nehalem (Q4 2008); and have been in the chips since then Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell and now Skylake; AMD since Barcelona (Q4? 2007); ARM in NEON Cortex A8/9 (2007?). So the fallback would probably be the road less taken. Probably could have better names than the intrinsics though :) |
@CarolEidt an example of where popcnt would be helpful in the aspnet code: https://github.com/aspnet/KestrelHttpServer/blob/dev/src/Microsoft.AspNet.Server.Kestrel/Http/FrameHeaders.Generated.cs#L66 Could replace with single instruction |
@benaadams I just hope that implementation is not in a hot-path, it is 15x slower than the naive (shift, add, and) implementation, and almost 30x of the optimized one using 12 arithmetic operations and one multiply. o.O BenchmarkDotNet=v0.7.7.0
|
Another use of PopCount at CoreFX. https://github.com/dotnet/corefx/blob/master/src/System.Reflection.Metadata/src/System/Reflection/Internal/Utilities/BitArithmetic.cs |
Another required use of popcount are persistent data structures like ideal hash tries HAMT or CHAMP. This is a foundation for very efficient immutable data structures, that could provide an alternative to current AVL tree based collections in BCL. The Clojure collections for instance are based on HAMT. Using Hamming Weight instead of native popcount drastically degrades performance of such structures. So ±100 @redknightlois |
👍 would be nice to have both variants in runtime: Without codegen, we can do something like this in native: #include <nmmintrin.h>
static inline bool HasPopcntIntrincis()
{
static bool is_capable(false), capability_tested(false);
if (capability_tested)
return is_capable;
capability_tested = true
// see more example at https://msdn.microsoft.com/en-us/library/hskdteyh.aspx
int CPUInfo[4] = {-1};
__cpuid(CPUInfo, 0);
is_capable = (CPUInfo[2] >> 23) & 1;
return is_capable;
}
static inline int BitCountWithoutPOPCNT(uint64_t x)
{
x -= ((x >> 1) & 0x5555555555555555ULL);
x = (((x >> 2) & 0x3333333333333333ULL) + (x & 0x3333333333333333ULL));
x = (((x >> 4) + x) & 0x0F0F0F0F0F0F0F0FULL);
x *= 0x0101010101010101ULL;
return static_cast<int>(x >> 56);
}
static inline int GetBitCount(uint64_t x)
{
if(HasPopcntIntrincis()) // runtime check
return _mm_popcnt_u64(x);
return BitCountWithoutPOPCNT(x);
} Then expose GetBitCount to managed surface area. Alternatively, RyuJIT codegen can be equipped with AVX2 instruction set with fallback code to do the same thing bit more efficiently. |
Yet another place where JAVA is beating .Net in indexing technology because we don't have popcnt support. It is actually specifically called of as the reason of the performance improvement. Better bitmap performance with Roaring bitmaps. BTW. I cannot implement this method because I don't have the supporting HW operations. |
I'd be very interested in seeing support for popcnt. I have a project that I'm working on that heavily uses bitmaps and would benefit greatly. Right now I'm looking at needing to break down and write it in C++ instead of C#... |
I don't think that any of these requests would be difficult to implement as intrinsics. The main issue is to define the appropriate API. The "path of least resistance" would probably be to put them in System.Numerics.Vectors.dll, but I'm not sure that's the best place from a design perspective. However, to get the conversation started (and admitting up front that API design is not my field), here is a preliminary proposal for 4 method that might be added to System.Numerics.Vector (the static Vector class):
This fixes the length of the "bit vector" at long, but has the attraction of simplicity. I would not be in favor of a global "Feature" class that subsumed the responsibility for all "is feature XX accelerated", because I think it is better to associate them with the class that exposes the feature. I'm not invested in the "Accelerated" suffix, but I think it would be good to have a standard naming convention for these. One issue would be what "Accelerated" means - what if there is a JIT-generated code sequence that takes multiple instructions, but is otherwise more efficient than one could do in C#/F#/IL? |
@CarolEidt I agree with you, "Accelerated" should mean "Better than, even if we do this writing the IL directly". I can try to build a few examples of how I envision such an API to work (as I have already the stock implementation for a few of the most important routines). But, I have a few questions:
|
Maybe, the JIT can accelerate based on a well-known IL sequence instead of based on a method name. For example, the sequence
could be converted to bitcount everywhere, no matter where it is defined. There should be documentation specifying the exact patterns being accelerated. That way there is no need to define an intrinsic method in the framework assemblies at all. Each project that wants to make use of these instructions can just copy and paste this implementation and achieve accelerated performance. This is a zero surface area approach. I believe GCC and LLVM recognize these "magic" implementations and replace them with intrinsics. This is to create a portable way to implement a fast For each instruction to be exposed that way, the most common 1-3 patterns should be supported. That way user code can pick the fastest unaccelerated pattern for their case and still get it accelerated where possible. For testing feature availability there could be a method
After optimizations this should collapse to Whatever design is chosen, it should be suitable to expose many intrinsics. There are many useful x86 instructions to be exposed. Parallel extract comes to mind. It is very versatile. |
@GSPP the problem there is that you have to write an specialized morpher for such complex chains of calls (and all their variations) which cost resources in runtime, giving less time to the JIT to do the rest of the work. In AoT compilers you wont care, but in JIT compilers you have to be very careful about the time it takes to handle that. The use of a library call has the advantage that now everybody will be able to use it from the same place, whether it is accelerated by HW/JIT or not. And only those that really require the performance will need to do |
@redknightlois If it is documented that only specific sequences will be supported, then shouldn't it be rather quick to check for them? This particular code that I posted starts with I believe the JIT uses a tree format internally. This should be fast to match with that. In SSA form it would be fast, too. Not sure if the JIT uses SSA, though. Anyway, this is just an idea and I'm not qualified to argue further. |
I'm not against the JIT pick up optimizations like @GSPP suggests, as long as it doesn't dramatically affect jitting time, but I'd personally like to use intrinsic methods instead. I don't want to have to copy and paste a supported bit counting algorithm, I just want to call a library method and be done. Would it be best to have intrinsic methods (or extension methods) on the primitive integer types themselves? This:
Feels like a natural extension to the framework. I also agree that there should be support for instructions larger than 64-bits. Most CPUs can at least do 256-bit instructions these days, and there are many use cases that would greatly benefit from taking advantage of that. Using |
@GSPP While the check may be simple it introduce other problems. What would you do with a developer that just rearrange the operations to make them look nicer (nice can be symetrical or whatever)?. If you dont want to do that you need to write a complex morpher like the one built to pick up bits rotations (https://github.com/dotnet/coreclr/issues/1619). And even though that one was is quite simple, it had many different ways to express it (some, non so obvious). In the end, it is far better to just do as @jonathanmarston suggest (which to me is the approach) and support all basic types and |
@jonathanmarston - It may be that I have missed something, but I am not aware of an implementation of 256-bit (or even 128-bit) popcount. The x86-64 version is 64-bits only (and, a bit oddly, defined using an SSE FLT encoding, although it operates on memory and general purpose registers). I kind of like the idea of supporting a broad range of numeric types, but I don't think that extending the primitive types is really a practical approach. The easiest would be to make them static extension methods in the static |
@CarolEidt the HW instruction is 64bits but extending that to support Wouldn't an extension method in the |
The problem with putting an extension method on a class in the |
@CarolEidt BTW just for context. If you know the size of the popcntq %r10, %r10
addq %r10, %rcx
popcntq %r11, %r11
addq %r11, %r9
popcntq %r14, %r14
addq %r14, %r8
popcntq %rbx, %rbx Which bypasses a false dependency bug in Intel HW. A very detailed analysis of this particular issue can be found at: http://danluu.com/assembly-intrinsics/ |
I think we'll have the same "timing issue" wherever we ship the library, since the desktop framework will need a new JIT to recognize the methhod regardless of where it is. Right now, the static public static int BitCount(this long value) { ... }
public static int LeadingZeroCount(this long value) { ... } I don't think we ever want to put extension methods on core types like that, especially from a relatively common namespace like System.Numerics. On the other hand, this is an extremely specialized operation, and I don't think we have anything quite like it, i.e. a fundamental operation on a primitive type but which is implemented separately from it / on top of it. So this may not fit anywhere cleanly in our current design guidelines. @terrajobst Any thoughts on where an operation like this could live? |
@mellinoe we can still hide it a little bit. Instead of using the System.Numerics use something like IMHO it makes sense it to be a fundamental operation. For example, bit rotations using |
@dsyme if we are going the intrisics support, I would add a few ones like prefetch, branch prediction and temporal loads and stores into the mix. They are kinda important for high-performance on certain data structures and algorithms. It's difficult to propose an API without knowing what are the design constraints we are facing. Some ideas have been layed out on different issues like: https://github.com/dotnet/corefx/issues/12425 I remember having had a conversation probably with @mellinoe where I essentially proposed to make the low level register entities available outside of |
@redknightlois, I still think something akin to your proposal here: https://github.com/dotnet/coreclr/issues/6906 (I commented my thoughts on it, herehttps://github.com/dotnet/coreclr/issues/6906#issuecomment-307164495) is probably the best route overall. The API shape for some of these are easier than others (several of these 'fit' in a general BitManipulation class, but others like |
@tannergooding yes, from all I have witness there is agreement among the ones needing those intrinsics is that having a simple straight-to-the-metal approach with a very big |
I want to add to what @redknightlois wrote and say that I can't think of a single PL / environment where at the very least, when intrinsics are supported at all, they are at least supported with the straight-to-opcode approach. MS needs not go anyfurther than revisit its own C++ compiler to witness that. I'm all for a more generalized (a-la System.Numerics) approach for an XP experience where that make sense. But that cannot come instead of having the straight-to-opcode versions provided.... There are multiple reasons for straight-to-opcode approach:
|
I really hope that if such a feature is implemented, we choose better names:
No reason why we have to make it hard to read 😉 |
@tannergooding I understand where you are coming for, and am definitely all for having readable/meaningful names... However, people, in this specific case, are not going to use these sorts of intrinsics with a clean slate, or at least many of them will have "prior convictions" and baggage coming from C/C++.... So while having nice meaningful names is something I would definitely like, I do strongly feel that the "ugly" names should be supported, for code portability purposes if nothing else. C# designers has the good instinct of not breaking with C/C++ where it wasn't required previously, and this allows for easier porting of existing code when needed... I feel the same here..., and also feel that if anything, the GCC names and coverage of intrinsics is a better starting point than MSVC.... For example, if I have the following working piece of code: static const int32_t CHUNKMASK_SHIFT = 6;
int32_t GetKeyForIndexIntrinsicsUnrolled(int64_t index, uint64_t *bits)
{
index++;
auto p = (uint64_t *) bits;
for (; index >= 256; p += 4)
index -= __popcntq(p[0]) + __popcntq(p[1]) + __popcntq(p[2]) + __popcntq(p[3]);
// As long as we are still looking for more than 64 bits
auto prevIndex = index;
while (index > 0) {
prevIndex = index;
index -= __popcntq(*(p++));
}
auto pos = __bsfq(_pdep_u64(1ULL << (prevIndex - 1), *(p - 1)));
return ((p - 1 - bits) << CHUNKMASK_SHIFT) + pos;
} The last thing I care about, is finding out the exact correct name that the CLR guys thought the I just want the code to work... And given that this is a very niche API, I don't see a good reason to make it pretty over functional for the target audience... |
The C++ intrinsics don't match the asm opcodes in name anyway. Would it not be better to match the asm descriptions and merge opcodes with overloading? Casting to a defined clr type if needed can be done via the While at the same time, not shying away from the use of vowels, but staying away from underscore exuberance? |
@damageboy C# version could look something like this using System.Numerics;
const int CHUNKMASK_SHIFT = 6;
unsafe int GetKeyForIndexIntrinsicsUnrolled(long index, ulong* bits)
{
index++;
var p = bits;
for (; index >= 256; p += 4)
{
index -= Bits.Count(p[0]) + Bits.Count(p[1]) + Bits.Count(p[2]) + Bits.Count(p[3]);
}
// As long as we are still looking for more than 64 bits
var prevIndex = index;
while (index > 0)
{
prevIndex = index;
index -= Bits.Count(*(p++));
}
// or Bits.ScanForward(...)
var pos = Bits.First(Bits.Scatter(1UL << (prevIndex - 1), *(p - 1)));
return ((p - 1 - bits) << CHUNKMASK_SHIFT) + pos;
} or with using static System.Numerics.Bits;
unsafe int GetKeyForIndexIntrinsicsUnrolled(long index, ulong* bits)
{
index++;
var p = bits;
for (; index >= 256; p += 4)
{
index -= Count(p[0]) + Count(p[1]) + Count(p[2]) + Count(p[3]);
}
// As long as we are still looking for more than 64 bits
var prevIndex = index;
while (index > 0)
{
prevIndex = index;
index -= Count(*(p++));
}
// or ScanForward(...)
var pos = First(Scatter(1UL << (prevIndex - 1), *(p - 1)));
return ((p - 1 - bits) << CHUNKMASK_SHIFT) + pos;
} |
@dsyme @redknightlois @jonathanmarston @mellinoe @damageboy @CarolEidt @russellhadley @mgravell @terrajobst API starter for comment/feedback (example use in comment above) namespace System.Numerics
{
public static class Bits
{
// POPCNT on Intel
public static byte Count(byte value);
public static ushort Count(ushort value);
public static uint Count(uint value);
public static ulong Count(ulong value);
// +/- shift values to rotate left and right
public static byte Rotate(byte value, sbyte shift);
public static short Rotate(short value, sbyte shift);
public static int Rotate(int value, sbyte shift);
public static long Rotate(long value, sbyte shift);
// BSF on Intel
public static int First(int value);
public static int First(long value);
// BSR on Intel
public static int Last(int value);
public static int Last(long value);
// PEXT on Intel
public static uint Gather(uint value, uint bitMask);
public static ulong Gather(ulong value, ulong bitMask);
// PDEP on Intel
public static uint Scatter(uint value, uint bitMask);
public static ulong Scatter(ulong value, ulong bitMask);
public static byte Crc(byte crc, byte value);
public static short Crc(short crc, short value);
public static int Crc(int crc, int value);
public static long Crc(long crc, long value);
// Byteswap
public static short SwitchEndianness(short value);
public static int SwitchEndianness(int value);
public static long SwitchEndianness(long value);
// LZCNT on Intel
public static int LeadingZeros(int bitMask);
public static int LeadingZeros(long bitMask);
// TZCNT on Intel
public static int TrailingZeros(int bitMask);
public static int TrailingZeros(long bitMask);
}
} None are too exotic, so probably could have software fallbacks - not sure about detection of HW support though. |
Perhaps to address @damageboy's concerns also have a Intrinsics Interop namespace System.Numerics.Intrinsics
{
public static class Interop
{
uint _BitScanForward(uint value) => Bits.First(value);
ulong _BitScanForward64(ulong value) => Bits.First(value);
uint __bsfd(uint value) => Bits.First(value);
ulong __bsfdq(ulong value) => Bits.First(value);
uint _pdep_u32(uint source, uint mask) => Bits.Scatter(source, mask);
ulong _pdep_u64(ulong source, uint mask) => Bits.Scatter(source, mask);
int __popcnt16(ushort value) => Bits.Count(value);
int __popcnt(uint value) => Bits.Count(value);
int __popcnt64(uint value) => Bits.Count(value);
int __popcntd(uint __X) => Bits.Count(__X);
int __popcntq(ulong __X) => Bits.Count(__X);
// ...
}
} Then you just need to a the header var pos = __bsfq(_pdep_u64(1UL << (prevIndex - 1), *(p - 1))); Or if you were MSVC rather than gcc var pos = _BitScanForward64(_pdep_u64(1UL << (prevIndex - 1), *(p - 1))); |
@benaadams Having those two versions is basically what I meant... I like meaningful names just like any sane person, but when porting or trying to implement some paper you may be reading it just makes some sense to have the interop version around. Few comment though
|
Yes, throw for intrinsics of wrong platform, also some are x-plat so something like? namespace System.Numerics.Intrinsics
{
[Flags]
public enum CpuPlatform
{
x86 = 1 << 0,
x64 = 1 << 1 | x86,
ARM = 1 << 8,
ARM64 = 1 << 9 | ARM
}
public static class Interop
{
uint _BitScanForward(uint value) => Bits.First(value);
ulong _BitScanForward64(ulong value) => Bits.First(value);
// ...
}
}
namespace System.Numerics.Intrinsics.x64
{
public static class Interop
{
private static void ThrowPlatformNotSupportedException()
=> throw new PlatformNotSupportedException();
private static void CheckPlatform()
{
if (!Environment.Is64BitProcess
|| Environment.CpuPlatform & CpuPlatform.x64 != CpuPlatform.x64)
ThrowPlatformNotSupportedException();
}
public byte _mm_crc32_u8(byte crc, byte value)
{
CheckPlatform();
Bits.Crc(crc, value);
}
public ushort _mm_crc32_u16(ushort crc, ushort value)
{
CheckPlatform();
Bits.Crc(crc, value);
}
public uint _mm_crc32_u32(uint crc, uint value)
{
CheckPlatform();
Bits.Crc(crc, value);
}
public ulong _mm_crc32_u64(ulong crc, ulong value)
{
CheckPlatform();
Bits.Crc(crc, value);
}
// ...
}
namespace System.Numerics.Intrinsics.x86
{
public static class Interop
{
private static void CheckPlatform()
{
if (Environment.CpuPlatform & CpuPlatform.x86 != CpuPlatform.x86)
ThrowPlatformNotSupportedException();
}
}
}
namespace System.Numerics.Intrinsics.ARM
{
public static class Interop
{
private static void CheckPlatform()
{
if (Environment.CpuPlatform & CpuPlatform.ARM != CpuPlatform.ARM)
ThrowPlatformNotSupportedException();
}
}
}
}
Yes. They are fairly universal functions and the software fallback is well known; so I'd think they sit well as platform independent "intrinsics". Note this is different than interop intrinsics (as above) and platform/cpu specific intrinsics that either aren't common or have a complex software fallback (e.g. encryption opcodes) - but I think that's a different discussion.
For platform independent intrinsics should be a "is hardware accelerated" check; that is branch eliminated at Jit time. Something equivalent to a readonly static; that has prechecked CPUID rather than doing the expensive check always. For platform specific intrinsics (always same cpu opcode; though with type overloading); same mechanism but "is hardware supported"; with a branch eliminated PNS exception path (as above) Seem sensible? Not sure on AoT |
Seems pretty sensible to me so far, yes.
Well, there is something like the intel way of doing things in ICC where they can generate functions for several archs and then they basically do a synamic dispatch to the appropriate function. For anything that take a considerable amount of cycles that sort of approach is both inclusive as far as compiling once, running "everywhere"... |
For detection I was hoping there was some way to directly tie to the method/method group itself with an extension like: namespace System.Numerics.Intrinsics
{
public static class IntrinsicExstensions
{
public static bool IsHardwareAccelerated(this MethodInfo intrinsicFunction);
public static bool IsHardwareAccelerated(this MethodGroup intrinsicFunction);
}
} To do int bits;
if (Bits.Count.IsHardwareAccelerated())
{
bits = Bits.Count(value);
}
else
{
// ...
} But it doesn't seem that's valid C# 😞 |
On the other hand Bits.IsHardwareAccelerated(Bits.Count) Is really not that bad |
One thing I'm not really clear about in this discussion is are we talking about numeric intrinsics per-se here, or general intrinsics? All of the examples so far are fine for Maybe the whole thing needs to become slightly wider in scope and move into some future sounding |
@damageboy, there have been a few proposal on the subject of general intrinsics (https://github.com/dotnet/coreclr/issues/6906#issuecomment-307164495). |
Prefetching and clearing cache can be inadvisable, but its not strictly unsafe..? i.e. its only performance that can go wrong, not a failure in operation. e.g. a prefetch byref would be safe; while a prefetch by pointer would be unsafe, but both are valid |
@benaadams Right, bad naming... @tannergooding Haven't seen that one before There seem to be a few of these slogging around... |
@ericstj fyi. |
@benaadams That C# example would work right off the bat with software fallbacks I had to build because no intrinsics (even though perf sucks :D). |
@redknightlois I really like attacking this through roslyn suggestions. I'm the last person to willingly push forward the cryptic naming, but it really helps with getting stuff off the ground... |
You could make Roslyn fixes that work without the interop class too. |
@jnm2 the downside of that is while you are coding an algorithm that has been published, you would use the interop naming just to be able to follow the algorithm properly. Later on you move to the better notation. |
Intel hardware intrinsic API proposal has been opened at dotnet/corefx#22940 |
Support for many of the interesting instructions like popcnt (technically SSE4a) could be an interesting addition and prove to be useful to avoid using unmanaged code in certain performance sensitive applications.
Many (technically all) of the operations can be emulated in CPU when not available with specific optimizations for the target platform or even have the ability with specially crafted if-then-else optimizations. That would allow to even switch to an entirely different algorithm without any runtime impact (if properly done at the jitting phase).
The text was updated successfully, but these errors were encountered: