-
Notifications
You must be signed in to change notification settings - Fork 5
Rework u128 #885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some results on ARM64 where old is our current emulated128 and new is the one I'm building now:
|
Here are x64 results:
|
Here's where we sit with Windows (ARM64). I'm happy with the big speedup in comps and mul since those are the main things we use:
|
Me too. Great work, Matt (@mborland)! Thank you for charging ahead with this. |
Thanks. I'm going to poke at u256 afterwards. Writing ADX assembly didn't help at all here because it's only 2 adds with one carry, but maybe in the 256 bit case it'll help out? We'll see. |
I've added additional benchmarks and division is also available now. Here is ARM64 (M4 Mac):
|
Here's the complete new x64. Looks like some heuristics can be applied to DIV to bring timing down:
|
Here is a look at MSVC x64. With MUL I tried using the MSVC intrinsics, but clearly that's not the way to go so I'll pull those out. I'm not sure what's up with the DIV result Updated with generic mul:
|
All tests are passing on all platforms now so we can have better results. I think the main ones that we are worried about are speeding up comps and divs since we use big integer div for decimal64 and decimal128. |
Windows MSVC 14.3: ARM64:
X64:
|
macOS ARM64:
|
Ubuntu 24.04: X64:
S390x:
PPC64LE
|
Everything got better. Some trivially, some profoundly. This is really a nice result @mborland and I know you worked hard for this. |
Thanks. It looks like the other 64 bit platforms can take the same optimizations as x64. I need to find some good docker images for ARM32 and x86. The ubuntu ones fail to build, and that's where the emulation is important since they won't have native types. |
Hmmm. Do you/we really need to do that? I don't really see this as a clearly sensible use of development time. I think the improvements compared to earlier are clear enough? |
Probably not worth optimizing for, but checking the results are the same. I assume the majority of users are running x64 Linux or Win64 where we are sitting well. |
Looks like we're sitting good for 32 bits, and they pass all tests: I386:
Arm32v7:
|
I've made some improvements to the 64-bit platforms, and edited their posts rather than making more. We're almost universally better now. The only outstanding TODO I have is worthwhile to specialize 128 by 64 bit division? Passing it to the wide integer knuth might be overkill, especially because the first few steps of that method is to find information we already know (e.g. where is the head word, div by 0, etc). We also still have the old 128 by 32 bit division which uses 4 integer divs, and 4 integer mods. I also re-investigated speeding up the digit counting methods since that is one of the slower operations we have that is used frequently (it's the basis for normalization). Below is the benchmark of just that operation on Mac ARM64:
|
@NAThompson here are the benchmarks as run on a native ARM64 Graviton runner:
Apparently the CLZ instruction is horrible since on M4 the speedup was ~3x. I can probably get the mul timing value down using the same memcpy trick that is used on x64. |
We should check the MSVC comparison vs: https://github.com/microsoft/STL/blob/main/stl/inc/__msvc_int128.hpp. It extensively uses intrinsics which at least on ARM I found to perform worse than the regular scalar implementations. |
Here is the data with MSVC's std::_Unsigned128 as the "builtin" type:
I am going to have to see what they are doing for the div operation since it's about 2-5x faster. I'm sure we are having some loss by converting to wide-integer and back |
Took a peek into the source for MSVC. It looks like they are also using Knuth 4.3.1D, but specialized per operation e.g. 128 by 128 or 128 by 64 div. I think we can benefit from doing the same. It would also port well to a future portable u128 library since then we don't have to package all of wide-integer along with it. |
Made some simplification on the two word path:
My current concern with the benchmarks of MSVC on the other operations is that the results are not matching our numerical results. Old and new both have identical values of s (as they should), but MSVC does not. |
Using portions of wide-integer was, indeed, a stepping stone to get the right division answers. It was also a mash-up because I had to separate into half-limbs for 64-bit limbs. We can certainly do better than that in the future. And you are getting this for sure. Also 128/64 is can be specialized. The limb-counting is not needed for this known case. There will be (as you've found out) a lot more cycles to optimize in long division. I uncovered an older work I had started in 2003 that packages all of Thanks Matt (@mborland) |
I think for MSVC we need both 128 by 128 and 128 by 64. The reason for their numerical differences is their div follows |
Hi Matt (@mborland) this concerns me at tad. And I need to query. Did you find a bug in something related to my / our Knuth long-division? Or did you find what seems like a bug in a third party work? |
No. I just had the benchmarks mislabeled. What should have been one word testing was two word and vice versa. The implementations of everything is totally fine. Just changed what I needed to look. |
Right now it has macros to define operators and is entirely missing a bunch of comparison operators. This can also be the time to explore inserting intrinsics into say the mul operator
The text was updated successfully, but these errors were encountered: