Skip to content

Commit b3be7c9

Browse files
committed
man: updated the sections on floating point numbers in the manual.
1 parent c6f4a40 commit b3be7c9

File tree

3 files changed

+224
-79
lines changed

3 files changed

+224
-79
lines changed

doc/manual/float.tex

Lines changed: 172 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -1,68 +1,133 @@
11

2-
\chapter{Floating point}
2+
\chapter{Floating point arithmetic}
33
\label{floatingpoint}
44

5-
Starting with version 5.0 \FORM\ is equiped with arbitrary floating point
6-
capability. The low level routines are part of the GMP and mpfr libraries
7-
which should be available on most systems. If not they can be picked up
8-
easily from the internet. The main commands involving the floating point
9-
system are
5+
Starting with version 5.0, \FORM{} is equiped with arbitrary precision floating point
6+
arithmetic. The low level routines are handled by the GMP and MPFR libraries,
7+
which are available on most systems and if missing can be easily picked up
8+
from the internet. This chapter describes the commands, functions, and behaviour
9+
of \FORM's floating point sytem.
10+
11+
\section{Initializing and closing the floating point system}
12+
Before any floating-point operations can be performed, \FORM{} must activate the
13+
floating point system and set the working precision. This initialization allocates
14+
the internal data structures used by the GMP and MPFR libraries. The system remains
15+
active until the end of the program, or until it is explicitly closed.
16+
The two statements that control these operations are:
1017
\begin{description}
11-
\item[\#startfloat] This instruction is needed to startup the floating
12-
point system. Invoking it will allocate a number of arrays. The instruction
13-
has either one or two arguments:
18+
\item[\#StartFloat] This instruction initializes the floating
19+
point system and allocates the necessary internal arrays.
20+
It takes either one or two arguments:
1421
\begin{verbatim}
15-
#startfloat <precision> [,MZV=<maximumweight>]
22+
#StartFloat <precision> [,MZV=<maximumweight>]
1623
\end{verbatim}
1724
The first argument is mandatory and specifies the desired precision. It must
18-
be a positive integer followed by either \texttt{b} (for precision in bits)
25+
be a positive integer followed by either a \texttt{b} (for precision in bits)
1926
or \texttt{d} (for precision in decimal digits).
20-
\FORM{} will round to at least this precision. Because the internal
21-
routines work with WORDs, the precision (in bits) will internally be rounded up to the nearest
22-
integer number of WORDs. The second argument is optional for when one wants
23-
to work with multiple zeta values (MZVs) or Euler sums. It specifies the
24-
maximum weight that will be used. The evaluation of the sums requires a
25-
number of auxiliary arrays. The default value is zero. If one would like to
26-
change the precision during a run, this is possible. The effect would be
27-
that the existing arrays are released and new arrays will be allocated.
28-
\item[\#endfloat] This instruction releases all arrays allocated for the
29-
floating point system.
27+
\FORM{} will round to at least this precision.
28+
The second argument is optional and only needed when working with multiple
29+
zeta values (MZVs) or Euler sums. It specifies the maximum weight
30+
that will be used. The evaluation of the sums requires a
31+
number of auxiliary arrays that depend on this weight. The default weight is zero.
32+
\item[\#EndFloat] This instruction releases all arrays allocated for the
33+
floating point system. Note that if one would like to change the precision during a run,
34+
this is now possible with a new \texttt{\#StartFloat} instruction.
3035
\end{description}
36+
Example programs that illustrate the use of these statements and the
37+
functionality of \FORM's floating point system are given below.
38+
39+
40+
\section{Conversion between rational and floating point coefficients}
41+
A term in an expression can have a rational or floating point coefficient.
42+
The following statements convert between the two.
3143
\begin{description}
32-
\item[tofloat] Converts the rational coefficients at the ground level to
33-
floating point numbers in the precision specified in the \#startfloat
34-
instruction. From this point on the coefficient at this level will be
35-
floating point. If one needs to convert numbers inside a function argument
36-
one should use the argument environment. This can be nested.
37-
\item[torational] Tries to convert the floating point coefficients to
38-
rational numbers. To this end it uses repeated fractions as in
44+
\item[ToFloat] Converts rational coefficients to
45+
floating point numbers in the precision specified by \texttt{\#StartFloat}.
46+
From this point on, the coefficient will be floating point.
47+
\item[ToRational] Attempts to convert floating point coefficients to
48+
rational numbers. To this end it uses continued fractions as in
3949
\begin{eqnarray}
40-
x & \rightarrow & n_0 + 1/(n_1+1/(n_2+1/(n_3+\cdots))) \nonumber
50+
x \;\rightarrow\; n_0 + \frac{1}{\,n_1 + \frac{1}{\,n_2 + \frac{1}{\,n_3 + \cdots}}}\;,
51+
\nonumber
4152
\end{eqnarray}
4253
with $x$ a floating point number. The algorithm keeps track of the
4354
remaining precision and if $1/n_i$ is close to this precision it truncates
44-
the sequence at $n_{i-1}$. After that it works out the fraction. It could
45-
be that $x$ cannot be expressed as a fraction within the given precision.
55+
the sequence at $n_{i-1}$. After that it works out the corresponding fraction.
56+
It could be that $x$ cannot be expressed as a fraction within the given precision.
4657
This can usually be seen by that the fractions are `rather wild', or that
4758
the result changes when the precision is increased. This statement can also
48-
be abbreviated to `torat'.
49-
\item[evaluate] If this command has no arguments all floating point
50-
functions that \FORM{} knows about will be evaluated. The currently allowed
51-
arguments are the functions mzv\_, euler\_, sqrt\_ and mzvhalf\_. If any
52-
(or more than one) of these are specified only those functions will be
53-
evaluated.
54-
\item[strictrounding] This statement rounds floating point numbers to a
55-
given precision. The syntax is
59+
be abbreviated as \texttt{ToRat}.
60+
\end{description}
61+
62+
The above statements operate on ground level coefficient only. To convert numbers
63+
inside a function argument, one must use the \texttt{Argument} environment.
64+
For example:
65+
\begin{verbatim}
66+
CFunction f;
67+
#StartFloat 10d
68+
Local F = 0.1666666666*f(0.1428571429);
69+
ToRat;
70+
Print "<1> %t";
71+
Argument f;
72+
ToRat;
73+
EndArgument;
74+
Print "<2> %t";
75+
.end
76+
<1> + 1/6*f(1.428571429e-01)
77+
<2> + 1/6*f(1/7)
78+
\end{verbatim}
79+
The argument environment may be nested.
80+
Similarly, the statements \texttt{Evaluate}, \texttt{StrictRounding} and \texttt{Chop} act at
81+
the ground level. To have them act on function argument, one uses the \texttt{Argument} environment.
82+
These statements are explained further below.
83+
84+
\section{Evaluation of functions and symbols}
85+
Before version 5.0, \FORM{} already reserved function names for many common mathematical
86+
functions. These functions can now be evaluated numerically using:
87+
88+
\begin{description}
89+
\item[Evaluate] This statement evaluates the mathematical functions and or symbols numerically:
90+
\begin{verbatim}
91+
Evaluate [function(s)],[symbol(s)];
92+
\end{verbatim}
93+
where the argument specifies the function(s) and/or symbol(s) to evaluate.
94+
More than one function and/or symbol may be listed.
95+
If this statement is used without arguments, all floating point functions and symbols that \FORM{}
96+
knows will be evaluated. Currently, the full list of functions that can be evaluated numerically reads
97+
\begin{verbatim}
98+
sqrt_, ln_, eexp_, li2_, gamma_, agm_,
99+
sin_, cos_, tan_, asin_, acos_, atan_, atan2_,
100+
sinh_, cosh_, tanh_, asinh_, acosh_, atanh_,
101+
mzv_, euler_, mzvhalf_,
102+
\end{verbatim}
103+
where the functions on the last line denote the multiple zeta values, Euler sums and
104+
harmonic polylogarithms of argument $1/2$ respectively.
105+
The list of symbols/constants that can be evaluated is
56106
\begin{verbatim}
57-
strictrounding [precision];
107+
pi_, ee_, em_,
58108
\end{verbatim}
59-
where precision is an optional argument that specifies the rounding
109+
where \texttt{ee\_}\index{ee\_} denotes the basis of the natural logarithm
110+
and \texttt{em\_}\index{em\_} the Euler-Mascheroni constant.
111+
112+
In addition, the functions \texttt{lin\_}, \texttt{hpl\_} and \texttt{mpl\_} are reserved function names,
113+
but currently have no numerical evaluation.
114+
\end{description}
115+
116+
117+
\section{Rounding behaviour}
118+
\begin{description}
119+
\item[StrictRounding] This statement rounds floating point numbers to a
120+
given precision:
121+
\begin{verbatim}
122+
StrictRounding [<precision>];
123+
\end{verbatim}
124+
where \texttt{<precision>} is an optional argument that specifies the rounding
60125
precision in either digits or bits, using the same syntax as
61-
\texttt{\#startfloat}. If no argument is given, this statement rounds
62-
the floating point coefficients to the default precision. Internally,
63-
the GMP and mpfr libraries may use extra precision beyond that set by
64-
\texttt{\#startfloat}. As a result, terms may not merge due to this
65-
extra precision. For example:
126+
\texttt{\#startfloat}. If omitted, the default precision is used.
127+
128+
Internally, the GMP and mpfr libraries may use extra precision beyond that set by
129+
\texttt{\#startfloat}. As a result, terms that print the same may still differ slightly
130+
due to this extra precision and therefore fail to merge. For example:
66131
\begin{verbatim}
67132
#startfloat 6d
68133
CFunction f;
@@ -89,13 +154,13 @@ \chapter{Floating point}
89154
$1.1100110101011111101*2^{-14}$. When rounded to 5 bits, this becomes
90155
$1.1101*2^{-14}$, which in decimal digits appears as
91156
1.10626220703125e-04.
92-
\item[Chop] This statement removes floating point numbers that are smaller
93-
in absolute magnitude than a specified threshold. It takes one argument delta:
157+
\item[Chop] This statement removes floating point numbers that are {\em smaller}
158+
in absolute magnitude than a specified threshold. It takes one argument:
94159
\begin{verbatim}
95160
Chop <delta>;
96161
\end{verbatim}
97-
All floating point numbers with absolute value less than delta are replaced by 0.
98-
Terms with no floating point coefficient are left untouched. The threshold delta
162+
All floating point numbers with absolute value {\em less} than \texttt{<delta>} are replaced by 0.
163+
Terms with no floating point coefficient are left untouched. The threshold \texttt{<delta>}
99164
can be a floating point number, integer, rational number, or power. Because
100165
statements in \FORM{} act term by term, it is often important to sort before invoking the
101166
chop statement. Otherwise, terms might be removed individually, while after
@@ -109,33 +174,21 @@ \chapter{Floating point}
109174
Format floatprecision;
110175
\end{verbatim}
111176
\FORM{} prints floats with the number of digits specified by the current
112-
\#startfloat instruction. With
177+
\texttt{\#startfloat} instruction. With
113178
\begin{verbatim}
114179
Format floatprecision <precision>;
115180
\end{verbatim}
116181
\FORM{} prints the number of digits specified by \texttt{<precision>}.
117-
The syntax is the same as for the precision in \#startfloat: a positive
118-
integer followed by either \texttt{b} (for bits) or \texttt{d} (for decimal
119-
digits). If the requested precision exceeds the precision specified by
120-
\#startfloat, only the available digits are printed. Finally, with
182+
The syntax is the same as for the precision in \texttt{\#startfloat}.
183+
If the requested precision exceeds the precision specified by
184+
\texttt{\#startfloat}, only the available digits are printed. Finally, with
121185
\begin{verbatim}
122186
Format floatprecision off;
123187
\end{verbatim}
124-
the floating point numbers are printed in raw internal format.
188+
the floating point numbers are printed in raw internal format, see also section \ref{sec:float_raw}.
125189
\end{description}
126-
In addition to the above commands there are the following functions that
127-
can be evaluated sqrt\_, ln\_, eexp\_, li2\_, gamma\_, agm\_, sin\_, cos\_, tan\_,
128-
asin\_, acos\_, atan\_, atan2\_, sinh\_, cosh\_, tanh\_, asinh\_, acosh\_, atanh\_.
129-
For the function lin\_ there is currently no code.
130-
The agm\_ function is the arithmetic geometric mean of its two input
131-
values.
132-
133-
In addition to the above functions there are also the constant
134-
pi\_\index{pi\_}, the basis of the natural logarithm ee\_\index{ee\_} and the
135-
Euler-Mascheroni constant em\_\index{em\_}. These constants will also be
136-
expanded with the evaluate command. When given as an argument to evaluate,
137-
only the specified constants will be evaluated.
138190

191+
\section{Examples}
139192
The following example shows some work with Multiple Zeta Values (MZV's):
140193
\begin{verbatim}
141194
#StartFloat 500b, MZV=15
@@ -190,10 +243,10 @@ \chapter{Floating point}
190243
191244
0.08 sec out of 0.09 sec
192245
\end{verbatim}
193-
The \#startfloat initializes the floating point system and allocates arrays
194-
for 500 bits of precision. If there is a second number it indicates the
195-
maximum weight for MZVs and Euler sums. The functions are only evaluated
196-
when the proper command is given. In the second module we divide the
246+
In the first module, \texttt{\#startfloat} initializes the floating point system with
247+
500 bits of precision and a maximum weight for the MZVs and Euler sums of 15.
248+
The \texttt{mzv\_} functions are then evaluated with the \texttt{Evaluate}
249+
statement. In the second module we divide the
197250
numbers and convert the result to a rational. It is a good idea to try this
198251
with various precisions to see whether this is stable. With 60 bits the
199252
final answer would be
@@ -202,5 +255,51 @@ \chapter{Floating point}
202255
\end{verbatim}
203256
while at 150 bits we have already the same answer as with 500 bits. The
204257
fraction that is obtained by this program can be proven to be correct.
205-
\vspace{3mm}
206258

259+
260+
\section{Raw form}
261+
\label{sec:float_raw}
262+
Internally, floating point numbers are represented by the function \texttt{float\_},
263+
i.e. \texttt{float\_(prec, size, exp, limbs)}. The integer arguments encode the
264+
internal representation of the floating point number as in the GMP library:
265+
\begin{description}
266+
\item[prec] The precision of the mantissa in limbs.
267+
\item[size] The number of limbs currently in use.
268+
\item[exp] The exponent, determining the location of the implied radix point.
269+
\item[limbs] The limbs packed as the numerator of a \FORM{} rational.
270+
\end{description}
271+
In a normalized term containing \texttt{float\_}, the rational coefficient must
272+
be either $1/1$ or $-1/1$, where the sign of the term is absorbed into the rational
273+
coefficient.
274+
Furthermore, the \texttt{float\_} is protected from the pattern matcher and from
275+
statements that act on functions -- such as \texttt{Transform}, \texttt{Argument},
276+
\texttt{Normalize} etc.
277+
The following program illustrates this:
278+
%
279+
\begin{verbatim}
280+
CFunction f;
281+
#StartFloat 10d
282+
Local F = 1.23456789 + f(1,2);
283+
Identify f?(?a) = f(10);
284+
Print "<1> %t";
285+
.sort
286+
<1> + 1.23456789e+00
287+
<1> + f(10)
288+
#EndFloat
289+
Normalize;
290+
Print "<2> %t";
291+
.sort
292+
<2> + float_(2,3,1,420101683733788795657820481376616399786)
293+
<2> + 10*f(1)
294+
#StartFloat 5d
295+
Print "<3> %t";
296+
.end
297+
<3> + 1.2346e+00
298+
<3> + 10*f(1)
299+
\end{verbatim}
300+
%
301+
As shown, the \texttt{id}-statement does not effect the \texttt{float\_} function.
302+
Here we also see the use of the preprocessor statement \texttt{\#EndFloat} which closes
303+
the floating point system. After this statement, the \texttt{float\_} function becomes a
304+
regular function. Its protected status, however, persists so that \texttt{id}-statements
305+
or statements like \texttt{Normalize} still do not modify it.

0 commit comments

Comments
 (0)