maxmath.cbrt returns NaN for an input value of 0 #16

chadefranklin · 2023-10-02T03:43:34Z

As per the title of the issue, maxmath.cbrt() returns NaN for an input of 0. This is the case for the method both Bursted and non-Bursted.

I also wanted your opinion on an issue with the explicit fnmadd_ps instructions. Is it wise to use these instructions directly rather than let Burst handle using fused instructions automatically when using FloatMode.Fast (which it does do, though differently it seems)? For my use case, I need to use FloatMode.Strict and the explicit use of fused instructions sort of goes against that. Maybe it would be worth having a separate fastcbrt() method.

Edit:
Looking further into the fused instruction issue, it seems using the fused instructions directly leads to less optimal code than without? Could you take a closer look at this?

MrUnbelievable92 · 2023-10-05T01:34:55Z

'cbrt(0f)' returning NaN (and 'rcbrt(0f/d)' not returning NaN) is indeed an issue - fixed in the upcoming release.

We looked at the generated code. Just to recap:

Regarding explicit 'fmadd' and friends: It's always advantageous to use it when possible. Unfortunately Unity.Burst still does not support deterministic compilation so this library will neither.
Indeed LLVM sometimes chooses to generate more RISC like code (explicit load and broadcast instead of memory operands). This generates more instructions but the performance remains the same, except for maybe instruction decode bottlenecks. Additionally, the use of 'Xse.fdadd' i.e. "fake" fused-divide-add uses the rcpps instruction + some Newton-Raphson, replacing a divps instruction with 5 instructions. The latency is a little higher on the most recent CPUs but the throughput is much higher, which is relevant for tight loops, which is the only place where performance matters anyways.

## Known Issues - `half8` `==` and `!=` operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... *hint*) - `bool` vectors generated from operations on non-`(s)byte` vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties - `float8` `min()` and `max()` functions don't handle NaNs the same way Unity.Mathematics does ## Fixes - Fixed XML documentation not showing descriptions for valid `Promise` flags - Fixed `cminmax` documentation - `bitmask64` with `numBits` equal to 64 now correctly returns a bitmask with all 64 bits set if not compiling for Bmi1 i.e. AVX2 - Fixed `uint8` to `float8` type conversion if compiling for AVX2 - Fixed incorrect `mod` implementations - (ISSUE #16) Fixed `float` and `double` `(r)cbrt` edge cases (+/-0, Infinity and NaN). Additionally, the scalar- and vector `float` implementation now returns accurate results for subnormal numbers. Performance is affected negatively yet minimally (~2 clock cycles, + ~10 instructions); new valid `Promise` flags allow for call-site selection of faster code paths ## Additions # `Divider<T>` `Divider<T>` is an opaque OOP-like struct which performs fast integer division and modulo operations as well as divisibility checks For any divisor of any scalar- or vector integer type `T`, a `Divider<T>` instance replaces division operations by multiplication-, shift- and rounding operations, utilizing the most suitable of 2 algorithms, typically used by compilers for compile time constant divisors. `Divider<T>` was carefully crafted in a way that allows for complete compile-time evaluation of constant divisors of all types in Burst compiled code `Divider<T>` is NOT meant to replace divison operations; a (notable) performance gain is only to be expected in case the same divisor is used multiple times, or when multiple divisors are computed at once, utilizing SIMD (for instance, when a very predictable `i` is the divisor in a for-loop) Numerous `Promise` flags allow for faster operations, provided that the `Divider<T>` instance is both initialized and used in the same block of Burst compiled code and not loaded from RAM The implementation is pseudo-generic and only works for integer types known to MaxMath. Furthermore, Bursts inabilty to compile-time evaluate `typeof(T)` often requires explicit initialization (example: `new Divider<byte>((byte)42))`. `DEBUG` build only validity checks ensure correct initialization and usage The current Divider<T> API consists of...: - `/` and `%` operators: - LHS: scalar RHS: Divider<scalar>: requires both scalars to be of the same type; returns a scalar of the that type - LHS: vector RHS: Divider<vector>: requires both vectors to be of the same type; returns a vector of the that type - LHS: scalar RHS: Divider<vector>: requires the vector type to contain integers of the scalar type; returns an instance of the vector type - LHS: vector RHS: Divider<scalar>: requires the vector type to contain integers of the scalar type; returns an instance of the vector type - `DivRem` member methods - `EvenlyDivides` member methods - `T Divisor` as a readonly property - `public const` `Promise`s within `Divider<T>`, documenting valid promise flags with appropriate naming, starting with "PROMISE_" - `Get/SetInnerDivider<U>` methods: get or set a scalar- or vector `Divider<U>` within a `Divider<T>` - Component shuffles: `Divider<T>.wzxy` swizzle "operators" as properties NOTE: `Get/SetInnerDivider<U>` methods and `Divider<T>.[a][b][c][d]` properties will change in the future. Due to current limitations regarding C# generics, swizzle operators only take in or return the same type the respective property is a member of, i.e. you cannot use these to get a `Divider<int2>` from a `Divider<int4>`. `Get/SetInnerDivider<U>` are placeholderholders both for these operations as well as for the `v[a]_[b]` properties for vectors with 8 or more components. [C# will at some point get more complex type extension language support](dotnet/csharplang#5811), at which point this API will change. # quadruple (PREVIEW) Analogous to (U)Int128, this library now supports 128 bit floating point operations with its respective software-implemented type. It is fully IEEE754 compliant and in the typical 1 sign bit, 15 exponent bits, 113 mantissa bits format. NOTE: `quadruple` is in preview for an unforseeable amount of time. This means that it is neither completely optimized, nor are all `maxmath` functions available for it at this time. The following functions have been implemented: `ToString` and `Parse` (no perfect roundtrip guaranteed), All constants (example: PI_QUAD), `Random128` `NextQuadruple` (optionally with min and max values), all type conversions except for `decimal`, `-`(unary), `+`(binary), `-`(binary), `*`, `/`, `%`, `==`, `!=`, `<`, `<=`, `>`, `>=`, `fmod`, `mad`, `msub`, `rcp`, `isnan`, `isinf`, `isfinite`, `isnormal`, `issubnormal`, `round`, `floor`, `ceil`, `trunc`, `roundtoint` (and all other integer variations), `fastsqrt`, `(r)sqrt`, `(r)cbrt`, `isinrange`, `approx`, `select`, `compareto`, `min`, `max`, `copysign`, `nextgreater`, `nextsmaller`, `nexttoward`, `radians`, `degrees`, `chgsign` # Functions - Added `isnormal` and `issubnormal` functions for floating point types - Added `hypot` and `inthypot` functions for calculating `[int]sqrt(a * a + b * b)` without overflow, unless an optional `Promise` parameter with its `NoOverflow` flag set is passed as a compile time constant argument - Added `roundto(s)byte/(u)short/(u)int/(u)long/(U)Int128`. These take in floating point values of any type and convert them to the respective integer scalar- or vector type while rounding toward the nearest integer - Added `cor` and `cxor`. These reduce vectors of a given integer type to a scalar integer of that type by applying bitwise OR or XOR operations between each element - Split `approx` into two overloads: one with a custom tolerance parameter (the old version) and one without, which calculates an appropriate tolerance instead - Added `roundmultiple(x, m)`, `floormultiple(x, m)`, `ceilmultiple(x, m)` and `truncmultiple(x, m)` for all types, rounding x toward the nearest multiple of any positive m with the selected rounding mode (for example: ceilmultiple rounds x toward the nearest _greater_ multiple of m) - Added a whole stack of bit manipulation functions for all scalar- and vector integer types: `parityodd`, `parityeven`, `countzerobits`, `l1cnt`, `t1cnt`, `lzmask`, `tzmask`, `l1mask`, `t1mask`, `bits_extractlowest0`, `bits_masktolowest`, `bits_masktolowest0`, `bits_maskfromlowest`, `bits_maskfromlowest0`, `bits_setlowest`, `bits_surroundlowest` and `bits_surroundlowest0` # Global Compilation Options - Added Global Compilation Options for `OptimizeFor`, `FloatMode` and `FloatPrecision`. A proposal for compile-time access to job-specific options has been forwarded to the Burst team and is on their backlog. For now, these global options are dependency-injection-style placeholders and thus hard-coded to `OptimizeFor.Performance`, `FloatMode.Default` and `FloatPrecision.Standard`, respectively, and can be customized within the source code itself at .../MaxMath/Runtime/Compiler Extensions/Compilation Options.cs ## Improvements # Meta - This library now fully supports ARM CPUs' SIMD instructions (huge!). It utilizes [SSE2NEON](https://github.com/DLTcollab/sse2neon) and [SIMDe](https://github.com/simd-everywhere/simde) to convert x86 SIMD instructions to ARM SIMD instructions or instruction sequences. Because of this, generated ARM code will sometimes remain slightly unoptimized, because the author is unable to verify correctness of ARM specific optimizations with unit tests in most cases. # Performance - Implemented optimized `(u)long` vector to `float` vector type convesion operators - Implemented the execution of two loop bodies in one for functions that use loop-based algorithms, when a vector type wider than 128 bits is used without compiling for AVX(2) - Implemented an `AssumeRangeAttribute` equivalent for all vectorized functions with known return value ranges - Implemented more optimal `(U)Int128` comparison operators - Implemented optimal `(U)Int128` multiplication operations with- and division and modulo operations by compile time constants - Implemented optimal `(U)Int128` division and modulo operations by replacing a loop algorithm with straight line code. Because Burst does not expose the hardware-supported 128x64 narrowing division instruction as an intrinsic, this instruction, which is fundamentally important to the algorithm, is implemented with fallback code. A highly optimized (speed & size) native DLL written in Windows x86-64 assembly containing the most optimal implementation of any varation of 128 bit integer division was added to utilize this hardware instruction. This does mean that 128 bit integer division now results in a function call that cannot be inlined, yet the performance gain is worth it. Additionally, the C#/assembly interface was carefully crafted to avoid calling external functions partially or even entirely by utilizing `Unity.Burst.CompilerServices.Constant.IsConstantExpression<T>()` - Increased valid `Promise.Unsafe0` range for `(u)long` `intcbrt` from [0, 2^46} to [0, 2^48] - Added an optional `Promise` parameter to `gamma` - Added an optional `Promise` parameter to `erf(c)` - Added an optional `Promise` parameter to `gcd` and `lcm` - Added `quarter` and `half` scalar- and vector function overloads for `min`, `max`, `minmax`, `clamp`, `saturate`, `isinrange`, `trunc`, `round`, `ceil`, `floor` and `sign` - Removed the only non-optimizing branch in vector code in the entire library within the `long2/3/4` `>>` operator if the shift amount is not a compile time constant for a ~30% performance gain - Reduced branch predictor penalty of `gcd` by moving the loop condition to an earlier part of the loops code - Reduced the latency of internal `byte` to `uint`/`float` and `ushort` to `ulong`/`double` vector zero-extension conversions (and back) if an entire SIMD register is converted, by ~6 cycles. This operation is especially relevant in `byte` vector `shl` and `shr` operations, which are very common throughout the library, including within loops. Consequently, this improvement should yield significant performance gains - Reduced latency of `gcd` within the loop by 1 to 8 clock cycles at the cost of 0 to 9 clock cycles outside the loop if compiling for SSE4 or higher, by adding 1 to 6 instructions. - Reduced latency of `comb` for all scalar and vector types. 1) `(s)byte`, `(u)short` and scalar `(u)int` overloads now make use of an algorithm that only requires one division per loop iteration instead of two, if the type fits into a single hardware register when cast to the next wider integer type. This algorithm is more than twice as fast. 2) Implemented `Divider<T>` into both algorithms. This yields a massive performance gain of 2.3x to 20x (!), at the cost of an immense amount of code size 3) The first eight loop iterations of both algorithms have been extracted from the loop and optimized by hand. This yields a performance gain between ~15 (16 bit) and ~600 (64 bit) clock cycles. Both 2) and 3) are disabled if the global compilation tag `OptimizeFor` is set to `OptimizeFor.Size`. - Reduced latency of `half{X}` to `(s)byte[X}`, `(u)short{X}`, `(u)long{X}`, and `half8` to `(u)int8` type conversion from 10 down to 4 or 5 cycles, also affecting managed C# fallback performance. Unity.Mathematics' implementation remains suboptimal (for now... *hint*) - Reduced latency of SSE2 fallback code for `ulong` vector `<` and `>` operators by 1 cycle and reduced code size by removing 1 instruction - Reduced latency of `(s)byte` vector `floorlog2`/`ceillog2` by 1 cycle and reduced code size by removing 1 to 2 instructions, as well as 1 constant read from RAM - Reduced latency of vector `long` `(n)abs` by 0 to 2 cycles (highly CPU specific) and reduced code size by removing 1 to 2 instructions - Reduced latency of `all_dif` overloads for all `(s)byte` vectors and `short16` by 10% to 20%; the `(s)byte` overloads now use ~35% less constant data read from RAM - Reduced latency of `(s)byte16/32` division and modulo operations by 6 cycles by adding 6 instructions - Reduced latency of vector `float` to `(u)long` type conversion by 5 to 10 cycles - Reduced latency of `quarter`/`half` to integer type conversion operations by 6 or more cycles - Reduced latency, code size and constant data read from RAM of SSE2 fallback code for `(s)byte` vector `shrl`, `shra` and `shl` drastically when the shift amount is constant - Reduced latency of vector `rol`/`ror` drastically when the rotation value is constant and a multiple of 8 - Reduced latency of `(u)int8` and `(u)long` vector `l/tzcnt` by up to 2 cycles and removed 1 instruction - Reduced latency of `float8` to `uint8` conversion operator by ~6 cycles if compiling for AVX2 - Reduced latency of scalar- and vector `double` `fastrsqrt` by 4 cycles and increased accuracy of the result - Reduced latency of `(u)long` `intsqrt` by reducing the latency of each loop interation by 1 (signed) or 2 (unsigned) cycles, at the cost of an extra 5 (unsigned) or 6 (signed) instructions with a latency of 3 cycles outside the loop - Reduced latency of scalar `(u)int` `intlog10` by 1 cycle and removed 1 instruction - Reduced latency of vector `double` to `(u)long` type conversion operators by 1 cycle and removed 5 instructions - Reduced latency of previously internal `roundto(u)long(double x)` by ~10 cycles, also reducing `(u)long` vector division latency by ~20 cycles down to ~94 cycles - Reduced latency of squaring `(u)long` vectors by 2 to 3 cycles and removed 2 instructions, also reducing `(u)long` vector `inpow` latency, as squaring is part of the loop - Reduced latency of all floating point `nexttoward` functions by 1 cycle - Reduced latency of `(u)long` vector multiplication by scalar- or vector integer types with bit-width less than or equal to 32 by 2 cycles and removed 1 instruction - Reduced latency of signed 8 bit integer division by 1 cycle if compiling for SSSE3 or higher - Reduced latency of managed- and SSE2 fallback code paths for integer vector `isinrange(x, min, max)` function overloads by ~2 cycles - Reduced latency of floating point `isinrange(x, min, max)` function overloads by ~7 cycles - Reduced latency of `float`/`double` to `(U)Int128` type conversion drastically (hundreds of cycles...!) - Reduced code size of vector `bits_depositparallel` and `bits_extractparallel` by removing 1 instruction if compiling for SSE4 or higher - Reduced code size of vectorized `perm` by removing 3 instructions - Reduced code size of vectorized `intpow` by removing 5 or more instructions - Reduced code size of ushort8 to float8 type conversion by removing 1 instruction - Added SSE2 fallback code for `(u)long2/3/4` to `(s)byte2/3/4` type conversion - Added SSE2 fallback code for `countbits` - Added SSE2 fallback code for `mulsaturated` for `int2/3/4` - Added AVX fallback code for `float8` and `double3/4` `gamma` - Added AVX fallback code for `float8` and `double3/4` `erf(c)` - Added AVX fallback code for `double3/4` `(r)cbrt` ## Changes - Bumped C# Dev Tools dependency to version 1.0.9 - Bumped Unity.Mathematics dependency to version 1.3.1 and implemented `square` for all types, `chgsign` for floating point types and the constants `TAU` (present since 2020, updated documentation), `PI2`, `PIHALF`, `TODEGREES` and `TORADIANS` - Bumped Unity.Burst dependency to version 1.8.18 - Moved `Promise` type declaration from `MaxMath.maxmath` (class) to `MaxMath` (namespace) - Two-parameter `all_dif` overloads are no longer a wrapper for `math.all(a != b)`; these now return true if and only if both vectors do not share any components with each other - `double` to `(u)long` type conversion outside of the `(u)long` domain is now undefined behavior (just like in the C standard) - the Mono runtime changed its behavior with an upgrade to some Unity 2022 version and is more performance intensive to follow "correctly" now. Reduced latency by 2 to 4 cycles, removed up to 5 instructions and 3 pieces of constant data read from RAM - `rol`/`ror` return values are now undefined if the rotation amount is outside [0, BITS) (compliant with Unity.Mathematics). Reduced latency by up to 2 cycles if the rotation amount is not a compile time constant ## Fixed Oversights - Added `asbyte(sbyte{X})`, `assbyte(byte{X})`, `asshort(ushort{X})`, `asushort(short{X})`, `asint(uint8`), `asuint(int8)`, `aslong(ulong{X})`, `asulong(long{X})` - Added `minmax(a, b, out min, out max)` for scalar types; Improved managed fallback performance also - Added `minmaxmag(a, b, out minmag, out maxmag)` for scalar types; Improved managed fallback performance also - Added `bitmask` overloads for `bool2` and `bool3`, an oversight by the Unity.Mathematics team - Added missing `(u)long` matrix from/to `(u)int` matrix conversion operators - Added C# unboxing casts to `(U)Int128` `Equals(object other)` - `countbits(uint8 x)` now returns an `int8` ## Upcoming The next release, 3.0, is going to be a major deviation from the previous premise of this library being an extension to Unity.Mathematics. It is going to become a wrapper library instead, replacing Unity.Mathematics eniterly, while retaining the things that work well and cannot be done with MaxMath and Burst alone. The most important guideline is an easy transition from either Unity.Mathematics alone or in combination with MaxMath. The `maxmath` class will be renamed to `math`; a simple case sensitive find&replace ("Unity.Mathematics" -> "MaxMath" and "maxmath" -> "math") will be enough to migrate to this new version. Reasons for this are a) Performance: Integer division/mod is not vectorized in Unity.Mathematics - just as an example. More dramatically Unity.Mathematics' `bool` vectors perform poorly and a solution is at hand. b) Unity.Mathematics is not updated anymore. Thus we cannot expect certain fixes, like `half` types not being IEEE754 compliant at all. Release 3.0 is going to be released as both an extension- and as a wrapper library. It is only going to focus on extra test coverage and possible bugfixes, i.e. there are not going to be any improvements or additions. Questions are welcome. Please use the issue tracker or seek contact on [Discord](https://discord.gg/ur5z3pW8)

MrUnbelievable92 · 2024-10-26T19:27:30Z

Fixed in 2.9.0

MrUnbelievable92 self-assigned this Oct 5, 2023

MrUnbelievable92 added the bug Something isn't working label Oct 5, 2023

MrUnbelievable92 mentioned this issue Oct 26, 2024

Version 2.9.0 #18

Merged

MrUnbelievable92 closed this as completed Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maxmath.cbrt returns NaN for an input value of 0 #16

maxmath.cbrt returns NaN for an input value of 0 #16

chadefranklin commented Oct 2, 2023 •

edited

Loading

MrUnbelievable92 commented Oct 5, 2023

MrUnbelievable92 commented Oct 26, 2024

maxmath.cbrt returns NaN for an input value of 0 #16

maxmath.cbrt returns NaN for an input value of 0 #16

Comments

chadefranklin commented Oct 2, 2023 • edited Loading

MrUnbelievable92 commented Oct 5, 2023

MrUnbelievable92 commented Oct 26, 2024

chadefranklin commented Oct 2, 2023 •

edited

Loading