-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathfp.texi
1815 lines (1495 loc) · 68.7 KB
/
fp.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
@c Copyright @copyright{} 2022 Richard Stallman and Free Software Foundation, Inc.
This is part of the GNU C Intro and Reference Manual
and covered by its license.
@node Floating Point in Depth
@chapter Floating Point in Depth
@menu
* Floating Representations::
* Floating Type Specs::
* Special Float Values::
* Invalid Optimizations::
* Exception Flags::
* Exact Floating-Point::
* Rounding::
* Rounding Issues::
* Significance Loss::
* Fused Multiply-Add::
* Error Recovery::
@c * Double-Rounding Problems::
* Exact Floating Constants::
* Handling Infinity::
* Handling NaN::
* Signed Zeros::
* Scaling by the Base::
* Rounding Control::
* Machine Epsilon::
* Complex Arithmetic::
* Round-Trip Base Conversion::
* Further Reading::
@end menu
@node Floating Representations
@section Floating-Point Representations
@cindex floating-point representations
@cindex representation of floating-point numbers
@cindex IEEE 754-2008 Standard
Storing numbers as @dfn{floating point} allows representation of
numbers with fractional values, in a range larger than that of
hardware integers. A floating-point number consists of a sign bit, a
@dfn{significand} (also called the @dfn{mantissa}), and a power of a
fixed base. GNU C uses the floating-point representations specified by
the @cite{IEEE 754-2008 Standard for Floating-Point Arithmetic}.
The IEEE 754-2008 specification defines basic binary floating-point
formats of five different sizes: 16-bit, 32-bit, 64-bit, 128-bit, and
256-bit. The formats of 32, 64, and 128 bits are used for the
standard C types @code{float}, @code{double}, and @code{long double}.
GNU C supports the 16-bit floating point type @code{_Float16} on some
platforms, but does not support the 256-bit floating point type.
Each of the formats encodes the floating-point number as a sign bit.
After this comes an exponent that specifies a power of 2 (with a fixed
offset). Then comes the significand.
The first bit of the significand, before the binary point, is always
1, so there is no need to store it in memory. It is called the
@dfn{hidden bit} because it doesn't appear in the floating-point
number as used in the computer itself.
All of those floating-point formats are sign-magnitude representations,
so +0 and @minus{}0 are different values.
Besides the IEEE 754 format 128-bit float, GNU C also offers a format
consisting of a pair of 64-bit floating point numbers. This lacks the
full exponent range of the IEEE 128-bit format, but is useful when the
underlying hardware platform does not support that.
@node Floating Type Specs
@section Floating-Point Type Specifications
The standard library header file @file{float.h} defines a number of
constants that describe the platform's implementation of
floating-point types @code{float}, @code{double} and @code{long
double}. They include:
@findex FLT_MIN
@findex DBL_MIN
@findex LDBL_MIN
@findex FLT_HAS_SUBNORM
@findex DBL_HAS_SUBNORM
@findex LDBL_HAS_SUBNORM
@findex FLT_TRUE_MIN
@findex DBL_TRUE_MIN
@findex LDBL_TRUE_MIN
@findex FLT_MAX
@findex DBL_MAX
@findex LDBL_MAX
@findex FLT_DECIMAL_DIG
@findex DBL_DECIMAL_DIG
@findex LDBL_DECIMAL_DIG
@table @code
@item FLT_MIN
@itemx DBL_MIN
@itemx LDBL_MIN
Defines the minimum normalized positive floating-point values that can
be represented with the type.
@item FLT_HAS_SUBNORM
@itemx DBL_HAS_SUBNORM
@itemx LDBL_HAS_SUBNORM
Defines if the floating-point type supports subnormal (or ``denormalized'')
numbers or not (@pxref{subnormal numbers}).
@item FLT_TRUE_MIN
@itemx DBL_TRUE_MIN
@itemx LDBL_TRUE_MIN
Defines the minimum positive values (including subnormal values) that
can be represented with the type.
@item FLT_MAX
@itemx DBL_MAX
@itemx LDBL_MAX
Defines the largest values that can be represented with the type.
@item FLT_DECIMAL_DIG
@itemx DBL_DECIMAL_DIG
@itemx LDBL_DECIMAL_DIG
Defines the number of decimal digits @code{n} such that any
floating-point number that can be represented in the type can be
rounded to a floating-point number with @code{n} decimal digits, and
back again, without losing any precision of the value.
@end table
@node Special Float Values
@section Special Floating-Point Values
@cindex special floating-point values
@cindex floating-point values, special
IEEE floating point provides for special values that are not ordinary
numbers.
@table @asis
@item infinities
@code{+Infinity} and @code{-Infinity} are two different infinite
values, one positive and one negative. These result from
operations such as @code{1 / 0}, @code{Infinity + Infinity},
@code{Infinity * Infinity}, and @code{Infinity + @var{finite}}, and also
from a result that is finite, but larger than the most positive possible
value or smaller than the most negative possible value.
@xref{Handling Infinity}, for more about working with infinities.
@item NaNs (not a number)
@cindex QNaN
@cindex SNaN
There are two special values, called Not-a-Number (NaN): a quiet
NaN (QNaN), and a signaling NaN (SNaN).
A QNaN is produced by operations for which the value is undefined
in real arithmetic, such as @code{0 / 0}, @code{sqrt (-1)},
@code{Infinity - Infinity}, and any basic operation in which an
operand is a QNaN.
The signaling NaN is intended for initializing
otherwise-unassigned storage, and the goal is that unlike a
QNaN, an SNaN @emph{does} cause an interrupt that can be caught
by a software handler, diagnosed, and reported. In practice,
little use has been made of signaling NaNs, because the most
common CPUs in desktop and portable computers fail to implement
the full IEEE 754 Standard, and supply only one kind of NaN, the
quiet one. Also, programming-language standards have taken
decades to catch up to the IEEE 754 standard, and implementations
of those language standards make an additional delay before
programmers become willing to use these features.
To enable support for signaling NaNs, use the GCC command-line option
@option{-fsignaling-nans}, but this is an experimental feature and may
not work as expected in every situation.
A NaN has a sign bit, but its value means nothing.
@xref{Handling NaN}, for more about working with NaNs.
@item subnormal numbers
@cindex subnormal numbers
@cindex underflow, floating
@cindex floating underflow
@anchor{subnormal numbers}
It can happen that a computed floating-point value is too small to
represent, such as when two tiny numbers are multiplied. The result
is then said to @dfn{underflow}. The traditional behavior before
the IEEE 754 Standard was to use zero as the result, and possibly to report
the underflow in some sort of program output.
The IEEE 754 Standard is vague about whether rounding happens
before detection of floating underflow and overflow, or after, and CPU
designers may choose either.
However, the Standard does something unusual compared to earlier
designs, and that is that when the result is smaller than the
smallest @dfn{normalized} representable value (i.e., one in
which the leading significand bit is @code{1}), the normalization
requirement is relaxed, leading zero bits are permitted, and
precision is gradually lost until there are no more bits in the
significand. That phenomenon is called @dfn{gradual underflow},
and it serves important numerical purposes, although it does
reduce the precision of the final result. Some floating-point
designs allow you to choose at compile time, or even at
run time, whether underflows are gradual, or are flushed abruptly
to zero. Numbers that have entered the region of gradual
underflow are called @dfn{subnormal}.
You can use the library functions @code{fesetround} and
@code{fegetround} to set and get the rounding mode. Rounding modes
are defined (if supported by the platform) in @code{fenv.h} as:
@code{FE_UPWARD} to round toward positive infinity; @code{FE_DOWNWARD}
to round toward negative infinity; @code{FE_TOWARDZERO} to round
toward zero; and @code{FE_TONEAREST} to round to the nearest
representable value, the default mode. It is best to use
@code{FE_TONEAREST} except when there is a special need for some other
mode.
@end table
@node Invalid Optimizations
@section Invalid Optimizations
@cindex invalid optimizations in floating-point arithmetic
@cindex floating-point arithmetic invalid optimizations
Signed zeros, Infinity, and NaN invalidate some optimizations by
programmers and compilers that might otherwise have seemed obvious:
@itemize @bullet
@item
@code{x + 0} and @code{x - 0} are not the same as @code{x} when
@code{x} is zero, because the result depends on the rounding rule.
@xref{Rounding}, for more about rounding rules.
@item
@code{x * 0.0} is not the same as @code{0.0} when @code{x} is
Infinity, a NaN, or negative zero.
@item
@code{x / x} is not the same as @code{1.0} when @code{x} is Infinity,
a NaN, or zero.
@item
@code{(x - y)} is not the same as @code{-(y - x)} because when the
operands are finite and equal, one evaluates to @code{+0} and the
other to @code{-0}.
@item
@code{x - x} is not the same as @code{0.0} when @var{x} is Infinity or
a NaN.
@item
@code{x == x} and @code{x != x} are not equivalent to @code{1} and
@code{0} when @var{x} is a NaN.
@item
@code{x < y} and @code{isless (x, y)} are not equivalent, because the
first sets a sticky exception flag (@pxref{Exception Flags}) when an
operand is a NaN, whereas the second does not affect that flag. The
same holds for the other @code{isxxx} functions that are companions to
relational operators. @xref{FP Comparison Functions, , , libc, The
GNU C Library Reference Manual}.
@end itemize
The @option{-funsafe-math-optimizations} option enables
these optimizations.
@node Exception Flags
@section Floating Arithmetic Exception Flags
@cindex floating arithmetic exception flags
@cindex exception flags (floating point)
@cindex sticky exception flags (floating point)
@cindex floating overflow
@cindex overflow, floating
@cindex floating underflow
@cindex underflow, floating
@dfn{Sticky exception flags} record the occurrence of particular
conditions: once set, they remain set until the program explicitly
clears them.
The conditions include @emph{invalid operand},
@emph{division-by_zero}, @emph{inexact result} (i.e., one that
required rounding), @emph{underflow}, and @emph{overflow}. Some
extended floating-point designs offer several additional exception
flags. The functions @code{feclearexcept}, @code{feraiseexcept},
@code{fetestexcept}, @code{fegetexceptflags}, and
@code{fesetexceptflags} provide a standardized interface to those
flags. @xref{Status bit operations, , , libc, The GNU C Library
Reference Manual}.
One important use of those @anchor{fetestexcept}flags is to do a
computation that is normally expected to be exact in floating-point
arithmetic, but occasionally might not be, in which case, corrective
action is needed. You can clear the @emph{inexact result} flag with a
call to @code{feclearexcept (FE_INEXACT)}, do the computation, and
then test the flag with @code{fetestexcept (FE_INEXACT)}; the result
of that call is 0 if the flag is not set (there was no rounding), and
1 when there was rounding (which, we presume, implies the program has
to correct for that).
@c =====================================================================
@ignore
@node IEEE 754 Decimal Arithmetic
@section IEEE 754 Decimal Arithmetic
@cindex IEEE 754 decimal arithmetic
One of the difficulties that users of computers for numerical
work face, whether they realize it or not, is that the computer
does not operate in the number base that most people are familiar
with. As a result, input decimal fractions must be converted to
binary floating-point values for use in computations, and then
the final results converted back to decimal for humans. Because the
precision is finite and limited, and because algorithms for correct
round-trip conversion between number bases were not known until the
1990s, and are still not implemented on most systems and most
programming languages, the result is frequent confusion for users,
when a simple expression like @code{3.0*(1.0/3.0)} evaluates to
0.999999 instead of the expected 1.0. Here is an
example from a floating-point calculator that allows rounding-mode
control, with the mode set to @emph{round-towards-zero}:
@example
for (k = 1; k <= 10; ++k)
(void)printf ("%2d\t%10.6f\n", k, k*(1.0/k))
1 1.000000
2 1.000000
3 0.999999
4 1.000000
5 0.999999
6 0.999999
7 0.999999
8 1.000000
9 0.999999
10 0.999999
@end example
Increasing working precision can sometimes help by reducing
intermediate rounding errors, but the reality is that when
fractional values are involved, @emph{no amount} of extra
precision can suffice for some computations. For example, the
nice decimal value @code{1/10} in C99-style binary representation
is @code{+0x1.999999999999ap-4}; that final digit @code{a} is the
rounding of an infinite string of @code{9}'s.
Financial computations in particular depend critically on correct
arithmetic, and the losses due to rounding errors can be
large, especially for businesses with large numbers of small
transactions, such as grocery stores and telephone companies.
Tax authorities are particularly picky, and demand specific
rounding rules, including one that instead of rounding ties to
the nearest number, rounds instead in the direction that favors
the taxman.
Programming languages used for business applications, notably the
venerable Cobol language, have therefore always implemented
financial computations in @emph{fixed-point decimal arithmetic}
in software, and because of the large monetary amounts that must be
processed, successive Cobol standards have increased the minimum
number size from 18 to 32 decimal digits, and the most recent one
requires a decimal exponent range of at least @code{[-999, +999]}.
The revised IEEE 754-2008 standard therefore requires decimal
floating-point arithmetic, as well as the now-widely used binary
formats from 1985. Like the binary formats, the decimal formats
also support Infinity, NaN, and signed zero, and the five basic
operations are also required to produce correctly rounded
representations of infinitely precise exact results.
However, the financial applications of decimal arithmetic
introduce some new features:
@itemize @bullet
@item
There are three decimal formats occupying 32, 64, and 128 bits of
storage, and offering exactly 7, 16, and 34 decimal digits of
precision. If one size has @code{n} digits, the next larger size
has @code{2 n + 2} digits. Thus, a future 256-bit format would
supply 70 decimal digits, and at least one library already
supports the 256-bit binary and decimal formats.
@item
Decimal arithmetic has an additional rounding mode, called
@emph{round-ties-to-away-from-zero}, meaning that a four-digit
rounding of @code{1.2345} is @code{1.235}, and @code{-1.2345}
becomes @code{-1.235}. That rounding mode is mandated by
financial laws in several countries.
@item
The decimal significand is an
@anchor{decimal-significand}@emph{integer}, instead of a fractional
value, and trailing zeros are only removed at user request. That
feature allows floating-point arithmetic to emulate the
@emph{fixed-point arithmetic} traditionally used in financial
computations.
@end itemize
@noindent
We can easily estimate how many digits are likely to be needed for
financial work: seven billion people on Earth, with an average annual
income of less than US$10,000, means a world financial base that can
be represented in just 15 decimal digits. Even allowing for alternate
currencies, future growth, multiyear accounting, and intermediate
computations, the 34 digits supplied by the 128-bit format are more
than enough for financial purposes.
We return to decimal arithmetic later in this chapter
(@pxref{More on decimal floating-point arithmetic}), after we have
covered more about floating-point arithmetic in general.
@c =====================================================================
@end ignore
@node Exact Floating-Point
@section Exact Floating-Point Arithmetic
@cindex exact floating-point arithmetic
@cindex floating-point arithmetic, exact
As long as the numbers are exactly representable (fractions whose
denominator is a power of 2), and intermediate results do not require
rounding, then floating-point arithmetic is @emph{exact}. It is easy
to predict how many digits are needed for the results of arithmetic
operations:
@itemize @bullet
@item
addition and subtraction of two @var{n}-digit values with the
@emph{same} exponent require at most @code{@var{n} + 1} digits, but
when the exponents differ, many more digits may be needed;
@item
multiplication of two @var{n}-digit values requires exactly
2 @var{n} digits;
@item
although integer division produces a quotient and a remainder of
no more than @var{n}-digits, floating-point remainder and square
root may require an unbounded number of digits, and the quotient
can need many more digits than can be stored.
@end itemize
Whenever a result requires more than @var{n} digits, rounding
is needed.
@c =====================================================================
@node Rounding
@section Rounding
@cindex rounding
When floating-point arithmetic produces a result that can't fit
exactly in the significand of the type that's in use, it has to
@dfn{round} the value. The basic arithmetic operations---addition,
subtraction, multiplication, division, and square root---always produce
a result that is equivalent to the exact, possibly infinite-precision
result rounded to storage precision according to the current rounding
rule.
Rounding sets the @code{FE_INEXACT} exception flag (@pxref{Exception
Flags}). This enables programs to determine that rounding has
occurred.
Rounding consists of adjusting the exponent to bring the significand
back to the required base-point alignment, then applying the current
@dfn{rounding rule} to squeeze the significand into the fixed
available size.
The current rule is selected at run time from four options. Here they
are:
@itemize *
@item
@emph{round-to-nearest}, with ties rounded to an even integer;
@item
@emph{round-up}, towards @code{+Infinity};
@item
@emph{round-down}, towards @code{-Infinity};
@item
@emph{round-towards-zero}.
@end itemize
Under those four rounding rules, a decimal value
@code{-1.2345} that is to be rounded to a four-digit result would
become @code{-1.234}, @code{-1.234}, @code{-1.235}, and
@code{-1.234}, respectively.
The default rounding rule is @emph{round-to-nearest}, because that has
the least bias, and produces the lowest average error. When the true
result lies exactly halfway between two representable machine numbers,
the result is rounded to the one that ends with an even digit.
The @emph{round-towards-zero} rule was common on many early computer
designs, because it is the easiest to implement: it just requires
silent truncation of all extra bits.
The two other rules, @emph{round-up} and @emph{round-down}, are
essential for implementing @dfn{interval arithmetic}, whereby
each arithmetic operation produces lower and upper bounds that
are guaranteed to enclose the exact result.
@xref{Rounding Control}, for details on getting and setting the
current rounding mode.
@node Rounding Issues
@section Rounding Issues
@cindex rounding issues (floating point)
@cindex floating-point rounding issues
The default IEEE 754 rounding mode minimizes errors, and most
normal computations should not suffer any serious accumulation of
errors from rounding.
Of course, you can contrive examples where that is not so. Here
is one: iterate a square root, then attempt to recover the
original value by repeated squaring.
@example
#include <stdio.h>
#include <math.h>
int main (void)
@{
double x = 100.0;
double y;
int n, k;
for (n = 10; n <= 100; n += 10)
@{
y = x;
for (k = 0; k < n; ++k) y = sqrt (y);
for (k = 0; k < n; ++k) y *= y;
printf ("n = %3d; x = %.0f\ty = %.6f\n", n, x, y);
@}
return 0;
@}
@end example
@noindent
Here is the output:
@example
n = 10; x = 100 y = 100.000000
n = 20; x = 100 y = 100.000000
n = 30; x = 100 y = 99.999977
n = 40; x = 100 y = 99.981025
n = 50; x = 100 y = 90.017127
n = 60; x = 100 y = 1.000000
n = 70; x = 100 y = 1.000000
n = 80; x = 100 y = 1.000000
n = 90; x = 100 y = 1.000000
n = 100; x = 100 y = 1.000000
@end example
After 50 iterations, @code{y} has barely one correct digit, and
soon after, there are no correct digits.
@c =====================================================================
@node Significance Loss
@section Significance Loss
@cindex significance loss (floating point)
@cindex floating-point significance loss
A much more serious source of error in floating-point computation is
@dfn{significance loss} from subtraction of nearly equal values. This
means that the number of bits in the significand of the result is
fewer than the size of the value would permit. If the values being
subtracted are close enough, but still not equal, a @emph{single
subtraction} can wipe out all correct digits, possibly contaminating
all future computations.
Floating-point calculations can sometimes be carefully designed so
that significance loss is not possible, such as summing a series where
all terms have the same sign. For example, the Taylor series
expansions of the trigonometric and hyperbolic sines have terms of
identical magnitude, of the general form @code{@var{x}**(2*@var{n} +
1) / (2*@var{n} + 1)!}. However, those in the trigonometric sine series
alternate in sign, while those in the hyperbolic sine series are all
positive. Here is the output of two small programs that sum @var{k}
terms of the series for @code{sin (@var{x})}, and compare the computed
sums with known-to-be-accurate library functions:
@example
x = 10 k = 51
s (x) = -0.544_021_110_889_270
sin (x) = -0.544_021_110_889_370
x = 20 k = 81
s (x) = 0.912_945_250_749_573
sin (x) = 0.912_945_250_727_628
x = 30 k = 109
s (x) = -0.987_813_746_058_855
sin (x) = -0.988_031_624_092_862
x = 40 k = 137
s (x) = 0.617_400_430_980_474
sin (x) = 0.745_113_160_479_349
x = 50 k = 159
s (x) = 57_105.187_673_745_720_532
sin (x) = -0.262_374_853_703_929
// sinh(x) series summation with positive signs
// with k terms needed to converge to machine precision
x = 10 k = 47
t (x) = 1.101_323_287_470_340e+04
sinh (x) = 1.101_323_287_470_339e+04
x = 20 k = 69
t (x) = 2.425_825_977_048_951e+08
sinh (x) = 2.425_825_977_048_951e+08
x = 30 k = 87
t (x) = 5.343_237_290_762_229e+12
sinh (x) = 5.343_237_290_762_231e+12
x = 40 k = 105
t (x) = 1.176_926_334_185_100e+17
sinh (x) = 1.176_926_334_185_100e+17
x = 50 k = 121
t (x) = 2.592_352_764_293_534e+21
sinh (x) = 2.592_352_764_293_536e+21
@end example
@noindent
We have added underscores to the numbers to enhance readability.
The @code{sinh (@var{x})} series with positive terms can be summed to
high accuracy. By contrast, the series for @code{sin (@var{x})}
suffers increasing significance loss, so that when @var{x} = 30 only
two correct digits remain. Soon after, all digits are wrong, and the
answers are complete nonsense.
An important skill in numerical programming is to recognize when
significance loss is likely to contaminate a computation, and revise
the algorithm to reduce this problem. Sometimes, the only practical
way to do so is to compute in higher intermediate precision, which is
why the extended types like @code{long double} are important.
@c Formerly mentioned @code{__float128}
@c =====================================================================
@node Fused Multiply-Add
@section Fused Multiply-Add
@cindex fused multiply-add in floating-point computations
@cindex floating-point fused multiply-add
In 1990, when IBM introduced the POWER architecture, the CPU
provided a previously unknown instruction, the @dfn{fused
multiply-add} (FMA). It computes the value @code{x * y + z} with
an @strong{exact} double-length product, followed by an addition with a
@emph{single} rounding. Numerical computation often needs pairs of
multiply and add operations, for which the FMA is well-suited.
On the POWER architecture, there are two dedicated registers that
hold permanent values of @code{0.0} and @code{1.0}, and the
normal @emph{multiply} and @emph{add} instructions are just
wrappers around the FMA that compute @code{x * y + 0.0} and
@code{x * 1.0 + z}, respectively.
In the early days, it appeared that the main benefit of the FMA
was getting two floating-point operations for the price of one,
almost doubling the performance of some algorithms. However,
numerical analysts have since shown numerous uses of the FMA for
significantly enhancing accuracy. We discuss one of the most
important ones in the next section.
A few other architectures have since included the FMA, and most
provide variants for the related operations @code{x * y - z}
(FMS), @code{-x * y + z} (FNMA), and @code{-x * y - z} (FNMS).
@c The IEEE 754-2008 revision requires implementations to provide
@c the FMA, as a sixth basic operation.
The functions @code{fmaf}, @code{fma}, and @code{fmal} implement fused
multiply-add for the @code{float}, @code{double}, and @code{long
double} data types. Correct implementation of the FMA in software is
difficult, and some systems that appear to provide those functions do
not satisfy the single-rounding requirement. That situation should
change as more programmers use the FMA operation, and more CPUs
provide FMA in hardware.
Use the @option{-ffp-contract=fast} option to allow generation of FMA
instructions, or @option{-ffp-contract=off} to disallow it.
@c =====================================================================
@node Error Recovery
@section Error Recovery
@cindex error recovery (floating point)
@cindex floating-point error recovery
When two numbers are combined by one of the four basic
operations, the result often requires rounding to storage
precision. For accurate computation, one would like to be able
to recover that rounding error. With historical floating-point
designs, it was difficult to do so portably, but now that IEEE
754 arithmetic is almost universal, the job is much easier.
For addition with the default @emph{round-to-nearest} rounding
mode, we can determine the error in a sum like this:
@example
volatile double err, sum, tmp, x, y;
if (fabs (x) >= fabs (y))
@{
sum = x + y;
tmp = sum - x;
err = y - tmp;
@}
else /* fabs (x) < fabs (y) */
@{
sum = x + y;
tmp = sum - y;
err = x - tmp;
@}
@end example
@noindent
@cindex twosum
Now, @code{x + y} is @emph{exactly} represented by @code{sum + err}.
This basic operation, which has come to be called @dfn{twosum}
in the numerical-analysis literature, is the first key to tracking,
and accounting for, rounding error.
To determine the error in subtraction, just swap the @code{+} and
@code{-} operators.
We used the @code{volatile} qualifier (@pxref{volatile}) in the
declaration of the variables, which forces the compiler to store and
retrieve them from memory, and prevents the compiler from optimizing
@code{err = y - ((x + y) - x)} into @code{err = 0}.
For multiplication, we can compute the rounding error without
magnitude tests with the FMA operation (@pxref{Fused Multiply-Add}),
like this:
@example
volatile double err, prod, x, y;
prod = x * y; /* @r{rounded product} */
err = fma (x, y, -prod); /* @r{exact product @code{= @var{prod} + @var{err}}} */
@end example
For addition, subtraction, and multiplication, we can represent the
exact result with the notional sum of two values. However, the exact
result of division, remainder, or square root potentially requires an
infinite number of digits, so we can at best approximate it.
Nevertheless, we can compute an error term that is close to the true
error: it is just that error value, rounded to machine precision.
For division, you can approximate @code{x / y} with @code{quo + err}
like this:
@example
volatile double err, quo, x, y;
quo = x / y;
err = fma (-quo, y, x) / y;
@end example
For square root, we can approximate @code{sqrt (x)} with @code{root +
err} like this:
@example
volatile double err, root, x;
root = sqrt (x);
err = fma (-root, root, x) / (root + root);
@end example
With the reliable and predictable floating-point design provided
by IEEE 754 arithmetic, we now have the tools we need to track
errors in the five basic floating-point operations, and we can
effectively simulate computing in twice working precision, which
is sometimes sufficient to remove almost all traces of arithmetic
errors.
@c =====================================================================
@ignore
@node Double-Rounding Problems
@section Double-Rounding Problems
@cindex double-rounding problems (floating point)
@cindex floating-point double-rounding problems
Most developers today use 64-bit x86 processors with a 64-bit
operating system, with a Streaming SIMD Extensions (SSE) instruction
set. In the past, using a 32-bit x87 instruction set was common,
leading to issues described in this section.
To offer a few more digits of precision and a wider exponent range,
the IEEE 754 Standard included an optional @emph{temporary real}
format, with 11 more bits in the significand, and 4 more bits in the
biased exponent.
Compilers are free to exploit the longer format, and most do so.
That is usually a @emph{good thing}, such as in computing a
lengthy sum or product, or in implementing the computation of the
hypotenuse of a right-triangle as @code{sqrt (x*x + y*y)}: the
wider exponent range is critical there for avoiding disastrous
overflow or underflow.
@findex fesetprec
@findex fegetprec
However, sometimes it is critical to know what the intermediate
precision and rounding mode are, such as in tracking errors with
the techniques of the preceding section. Some compilers provide
options to prevent the use of the 80-bit format in computations
with 64-bit @code{double}, but ensuring that code is always
compiled that way may be difficult. The x86 architecture has the
ability to force rounding of all operations in the 80-bit
registers to the 64-bit storage format, and some systems provide
a software interface with the functions @code{fesetprec} and
@code{fegetprec}. Unfortunately, neither of those functions is
defined by the ISO Standards for C and C++, and consequently,
they are not standardly available on all platforms that use
the x86 floating-point design.
When @code{double} computations are done in the 80-bit format,
results necessarily involve a @dfn{double rounding}: first to the
64-bit significand in intermediate operations in registers, and
then to the 53-bit significand when the register contents are
stored to memory. Here is an example in decimal arithmetic where
such a double rounding results in the wrong answer: round
@code{1_234_999} from seven to five to four digits. The result is
@code{1_235_000}, whereas the correct representation to four
significant digits is @code{1_234_000}.
@cindex -ffloat-store
One way to reduce the use of the 80-bit format is to declare variables
as @code{volatile double}: that way, the compiler is required to store
and load intermediates from memory, rather than keeping them in 80-bit
registers over long sequences of floating-point instructions. Doing
so does not, however, eliminate double rounding. The now-common
x86-64 architecture has separate sets of 32-bit and 64-bit
floating-point registers. The option @option{-float-store} says that
floating-point computation should use only those registers, thus eliminating
the possibility of double rounding.
@end ignore
@c =====================================================================
@node Exact Floating Constants
@section Exact Floating-Point Constants
@cindex exact specification of floating-point constants
@cindex floating-point constants, exact specification of
One of the frustrations that numerical programmers have suffered
with since the dawn of digital computers is the inability to
precisely specify numbers in their programs. On the early
decimal machines, that was not an issue: you could write a
constant @code{1e-30} and be confident of that exact value
being used in floating-point operations. However, when the
hardware works in a base other than 10, then human-specified
numbers have to be converted to that base, and then converted
back again at output time. The two base conversions are rarely
exact, and unwanted rounding errors are introduced.
@cindex hexadecimal floating-point constants
As computers usually represent numbers in a base other than 10,
numbers often must be converted to and from different bases, and
rounding errors can occur during conversion. This problem is solved
in C using hexademical floating-point constants. For example,
@code{+0x1.fffffcp-1} is the number that is the IEEE 754 32-bit value
closest to, but below, @code{1.0}. The significand is represented as a
hexadecimal fraction, and the @emph{power of two} is written in
decimal following the exponent letter @code{p} (the traditional
exponent letter @code{e} is not possible, because it is a hexadecimal
digit).
In @code{printf} and @code{scanf} and related functions, you can use
the @samp{%a} and @samp{%A} format specifiers for writing and reading
hexadecimal floating-point values. @samp{%a} writes them with lower
case letters and @samp{%A} writes them with upper case letters. For
instance, this code reproduces our sample number:
@example
printf ("%a\n", 1.0 - pow (2.0, -23));
@print{} 0x1.fffffcp-1
@end example
@noindent
The @code{strtod} family was similarly extended to recognize
numbers in that new format.
If you want to ensure exact data representation for transfer of
floating-point numbers between C programs on different
computers, then hexadecimal constants are an optimum choice.
@c =====================================================================
@node Handling Infinity
@section Handling Infinity
@cindex infinity in floating-point arithmetic
@cindex floating-point infinity
As we noted earlier, the IEEE 754 model of computing is not to stop
the program when exceptional conditions occur. It takes note of
exceptional values or conditions by setting sticky @dfn{exception
flags}, or by producing results with the special values Infinity and
QNaN. In this section, we discuss Infinity; @pxref{Handling NaN} for
the other.
In GNU C, you can create a value of negative Infinity in software like
this:
@example
double x;
x = -1.0 / 0.0;
@end example
GNU C supplies the @code{__builtin_inf}, @code{__builtin_inff}, and
@code{__builtin_infl} macros, and the GNU C Library provides the
@code{INFINITY} macro, all of which are compile-time constants for
positive infinity.
GNU C also provides a standard function to test for an Infinity:
@code{isinf (x)} returns @code{1} if the argument is a signed
infinity, and @code{0} if not.
Infinities can be compared, and all Infinities of the same sign are
equal: there is no notion in IEEE 754 arithmetic of different kinds of
Infinities, as there are in some areas of mathematics. Positive
Infinity is larger than any finite value, and negative Infinity is
smaller than any finite value.
Infinities propagate in addition, subtraction, multiplication,
and square root, but in division, they disappear, because of the
rule that @code{finite / Infinity} is @code{0.0}. Thus, an
overflow in an intermediate computation that produces an Infinity
is likely to be noticed later in the final results. The programmer can
then decide whether the overflow is expected, and acceptable, or whether
the code possibly has a bug, or needs to be run in higher
precision, or redesigned to avoid the production of the Infinity.
@c =====================================================================
@node Handling NaN
@section Handling NaN
@cindex NaN in floating-point arithmetic
@cindex not a number
@cindex floating-point NaN
NaNs are not numbers: they represent values from computations that
produce undefined results. They have a distinctive property that
makes them unlike any other floating-point value: they are
@emph{unequal to everything, including themselves}! Thus, you can
write a test for a NaN like this:
@example
if (x != x)
printf ("x is a NaN\n");
@end example
@noindent
This test works in GNU C, but some compilers might evaluate that test
expression as false without properly checking for the NaN value.
A more portable way to test for NaN is to use the @code{isnan}
function declared in @code{math.h}:
@example
if (isnan (x))
printf ("x is a NaN\n");
@end example
@noindent
@xref{Floating Point Classes, , , libc, The GNU C Library Reference Manual}.
One important use of NaNs is marking of missing data. For
example, in statistics, such data must be omitted from
computations. Use of any particular finite value for missing
data would eventually collide with real data, whereas such data
could never be a NaN, so it is an ideal marker. Functions that
deal with collections of data that may have holes can be written
to test for, and ignore, NaN values.
It is easy to generate a NaN in computations: evaluating @code{0.0 /
0.0} is the commonest way, but @code{Infinity - Infinity},
@code{Infinity / Infinity}, and @code{sqrt (-1.0)} also work.
Functions that receive out-of-bounds arguments can choose to return a
stored NaN value, such as with the @code{NAN} macro defined in
@code{math.h}, but that does not set the @emph{invalid operand}
exception flag, and that can fool some programs.
@cindex NaNs-always-propagate rule
Like Infinity, NaNs propagate in computations, but they are even
stickier, because they never disappear in division. Thus, once a
NaN appears in a chain of numerical operations, it is almost