-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector Shuffling Operations #14362
Comments
When we have discussed Shuffle operations internally, this was the general roadblock we hit, and never came up with a great solution. Our initial designs had this notion of a "BitVector" type, which could act as a sort of variable-sized mask for operating with different-sized data types. It was part of our very early designs, so I'm struggling to remember how you interacted with it exactly, for example, how you constructed it, how you modified it, etc. |
How attached are you guys to this particular implementation? I feel like dynamically sized vectors are an interesting idea for specialized workloads but not tractable for more general cases. It's easy to imagine running into further issues with trying to expose things like horizontal adds. It may be worth looking at how different C++ libraries provide abstractions over SIMD; the venerable Agner Fog has a great lib that I think would serve well as an API for .NET: http://www.agner.org/optimize/vectorclass.pdf I don't mind doing this work if you think it's a good idea; I feel pretty strongly that we get the API right now so that we're not locked out when trying to add other functionality in the future. |
@MikePopoloski - There are really two usage models for SIMD vectors - fixed-size types work well for modeling systems with a fixed number of dimensions (e.g. points in space, or colors expressed as RGB). For that we have types like Vector3, etc. Vector is useful in solving large parallel problems, such as in mathematics, finance or physical sciences, where you really want to be able to leverage all the available parallelism, without having to have conditional code for different targets. |
As we have Vector for simple stream processing on doubles, we could have Vector or something alike for complex processing on 2D vectors. However, this means Vector must occupy two xmm registers on x86 without AVX. Maybe a bad idea. |
Anyone mind if I post a real-world example to provide some feedback? This particular example is complex multiply. It's kind of interesting because it has fairly simple requirements, but also a bit peculiar. In essence, it requires some pairwise (i.e. even/odd) permutation. Please see the attached code for a working proof-of-concept (using Vector, but with shuffles done scalar - the code is tested and correct, but slow). For "vAeven", the even elements are copied to the odd positions, and for "vAodd", the odd elements are copied to the even positions. For "vBswap", it's a simple permutation ("swizzle" in the Intel lingo). The algorithm is agnostic to vector length, as long as it is a multiple of 2. In all cases, it would be using the proposed Swizzle2(). Now, it turns out there is a "horizontal add" on AVX (and SSE3) which seems ready-made for complex arithmetic. If that is used, then the two permutations used to generate "vAeven" and "vAodd" can be eliminated - so if you're after the best possible performance, then you'd need to consider exposing the horizontal add as well. Another possibility is exposing complex multiply as an intrinsic - since the optimal forms on different architectures vary. (It's common enough, in signal processing anyway.) The precedent is that "dot" product is already exposed that way, as well as some matrix operations, and "lerp". Above, there's some discussion of whether to continue trying to abstract away register length. As far as from the perspective of this testcase, one possibility might be to add Vector8 - which would turn into two Vector4's in SSE2, but be a single register on AVX. Maybe I've just muddied the water, and complex-multiply should be split off into it's own 'issue'? |
I think instead of having:
we could have:
The only "difference" between On a machine with SSE2, Swizzle(a, 1, 0) could behave as follow: (x0, x1) -> (x1, x0) but on a machine with AVX, it could behave like this: (x0, x1, x2, x3) -> (x1, x0, x3, x2) Thus it does not break the abstraction of Vector of not needing to know the underlying register size. |
The problem with the proposed APIs (both the one proposed by @MikePopoloski, as well as those proposed by @tfenise) is that the hardware (at least SSE and AVX and their derivatives, but presumably other implementations as well) do not support shuffles or permutations with variable selectors. While one could imagine that the JIT could simply do constant propagation to deal with that, in the event that it could not derive constant values the code generation would be complex, and not give the developer the performance they might expect. That is why it would be more desirable to come up with useful names for the types of permutation to be done. |
That would seem reasonable to me. Others? |
What about permutations among four or more elements? |
Shuffles and permutations of more than a couple of elements are problematic with
I find it a bit more difficult to come up with good names for the permutes. It could get really cumbersome:
|
There are over 200 shuffles of Vector4. It might be impractical to have over 200 shuffle functions, let alone permutations. For the complex case, I also suggest the following API or a equivalent one: There can also be shift or rotation API for Maybe I'm asking for too much. |
For swizzling, the field name (i.e. struct Vector2 {
public float this[int index] { get; } // might as well add this for symmetry
public Vector2 this[int x, int y] { get; }
public Vector3 this[int x, int y, int z] { get; }
public Vector4 this[int x, int y, int z, int w] { get; }
} The jitter should look to see if the numbers passed in are all constants so it 1. uses the register selection feature of some SIMD processors and 2. skips over the index out of range checks. For the basic case, it should do fancy tricks like: // Vector3
public Vector4 this[int x, int y, int z, int w] {
// 1. use | instead of || to avoid branching and to do the comparison in parallel
// 2. cast to uint to avoid the index < 0 test
if (unchecked((uint)(x|y|z|w)) > 2) {
throw new ArgumentOutOfRangeException();
}
// treat the struct as if it were a float array
fixed (float *as_array = &this.X) {
return new Vector4(
as_array[x],
as_array[y],
as_array[z],
as_array[w]
);
}
} |
With some help of Roslyn (analyzers or the compiler itself) we can even do this: public enum Operator { X, Y, Z, W };
struct Vector2
{
public float this[int index] { get; }
public Vector2 this[Operator x, Operator y] { get; }
public Vector3 this[Operator x, Operator y, Operator z] { get; }
public Vector4 this[Operator x, Operator y, Operator z, Operator w] { get; }
} And it would be used like this: Vector2 t = new Vector2() { X = 1, Y = 2 };
Vector2 res = t[Operator.X, Operator.X, Operator.Y, Operator.Y]; |
I believe that the following would be just as easy for the JIT to translate into a shuffle: |
@CarolEidt I do like that!!! On a side note, I found this issue because what I needed was to know if PSHUFB was available to optimize IPAddress.NetworkToHost() operation which is quite bad. We have a better one built with inline code, but PSHUFB should be the way to go (it is even faster than BSWAP). How would that work with byte vectors? Would it be even possible? |
@redknightlois - the byte vectors gets us back into |
@CarolEidt Couldn't the morpher being able to figure out things like:
It will probably be hard and costly to match, but it would also ensure that we can get the improvement also for EDIT: In the case of Vector the morpher could also look into things like this:
|
@redknightlois We don't have such a constructor for |
@CarolEidt probably I am not getting the question, but let me elaborate how I think it could work.
I am OK with requiring that you have to code explicitely for each T the appropriate operation. The rationale is why? Because lets say that I now need a XXYY for T instead of for only byte. I can go and build a custom operation with type guards and code every single version of it with the If I need a very complex operation I could just go and build it with the low-level optimized codepath and then compose them. For example, lets say that I want to build a movelh primitive, the code would be something like this:
You would notice that I put a Also for the JIT to pick it up there probably would be restrictions in what to put in the parameters. If there is no match, then the code is it is just plain boring serial CPU code. The upside is that if the morphers are able to handle the general cases for the most interesting operations: Load, Store, Shuffle, Rotations, Swizzles, Conditionals, etc you can build the primitive yourself and expect it to be inlined directly in your calling code. Probably it sounds easier than it is, but I see it as an alternative to having to find "the proper" API to reflect things that are just not friendly to deal with in a high level language. There are a few that would require special APIs but I believe they could be minimal if the coverage can be done and restricted through special purpose morphers. |
@CarolEidt A potential shuffle selection would also be the four presented in this nvidia presentation: http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf public Vector<T> ShuffleDown ( Vector<T> v, int n );
public Vector<T> ShuffleUp ( Vector<T> v, int n );
public Vector<T> ShuffleXor( Vector<T> v )
public Vector<T> Shuffle( Vector<T> v, Vector<int> index); |
@mellinoe & @CarolEidt we should chat. It seems we need a more holistic plan for exposing more intrinsics from our library. |
@terrajobst it would be great to nudge this one along. |
Going to close this for now. This API needs some more thought about how shuffling would work with regards to variable length vectors (when exposed as a general-purpose API). If someone feels strongly about this, please feel free to open a new issue. Our API review process is detailed here with a "good example" being listed under step 1. Ideally the new issue would provide many of the sections given, but would at a minimum provide the base rationale and proposed API surface sections. For this API in particular, the exposure of the System.Runtime.Intrinsic types and the methods for converting to/from |
As mentioned in #14281, a lot of useful SIMD code relies on shuffling data around within registers. There are a whole bunch of instructions available to do this, so we'd like to expose a coherent subset of these via Vector.
Rationale and Usage
Simple stream processing, such as the operations already supported on Vector, are entirely parallel and thus don't benefit from shuffling operations. However, lots of common uses of SIMD do require it, such as doing geometric vector operations like matrix multiplies and vector cross products. These are key scenarios for certain users, such as game developers.
If rearranging elements is required for an algorithm and fast hardware support is not provided, the cost of loading and storing to and from SIMD registers might exceed any benefits gained from using SIMD in the first place.
When looking at existing libraries that wrap SIMD functionality at a low level, it's apparent that there are a few main use cases of shuffle:
Rather than expose one single shuffle method directly, I think it makes sense to go with this slightly higher level functionality such that the API doesn't become too tied to a particular architecture's instruction set.
Proposed API
New methods on the Vector struct:
New methods on Vector class that are simple wrappers around the Vector methods.
Details and Questions
The text was updated successfully, but these errors were encountered: