On Tue, Jul 14, 2020 at 2:47 PM Bill Huffman < huffman@...> wrote:
On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL
Hi Andrew et al,
Thank you for sending the code. I am attaching an updated version of
recip.cc, implementing the complete NewtonRaphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.
If my calculations are not wrong, with a 7bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.
I have a concern about not including special "step" instructions. The complete NR sequences are as follows:
float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
}
return x;
}
float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = r * x; // fmul
float h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
t = r * x; // fmul
h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
}
return x;
}
As you can see above, even with proper cornercase handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence
you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.
Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value
of v0 at the beginning and end of the sequence.
Roughly, calculating the number of extra instructions in each sequence I get:
 save v0; 1 inst
 classify input; 1 inst
 compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
 materialize constant 2.0 for recip; 1 inst
 … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
 restore v0; 1 inst
This is 67 extra instructions for recip and 78 extra instructions for rsqrt. The two NR iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 1112 instructions for recip and 1617 instructions
for rsqrt. Apart from the extra instructions, the sequences would require 35 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value,
the rsqrt sequence would need more than 4 registers (79 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.
If we had the following special "step" instructions the sequences would be much shorter (see also the attached
recip_step.cc file):
A recip_step instruction that calculates 2a*b with the following special case handling:
 One input is infinity and the other 0 => result is 2
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3a*b)/2 with the following special case handling:
 One input is infinity and the other 0 => result is 1.5
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
With the above instructions, the NR sequences are reduced to:
float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x); // fma
x = x * t; // fmul
t = recip_step(r, x); // fma
x = x * t; // fmul
return x;
}
float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
return x;
}
Which is a total of 5 instructions for recip (vs 1112 without recip_step) and 7 instructions for rsqrt (vs 1617 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and
input value), so they can be executed for all LMUL values.
FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight. I suppose a compromise might be to make them destructive. This would have no perf. consequence for rsqrt, but would add one extra move instruction
to all but the last reciprocal iteration step.
Seems like they shouldn't be so big as they don't specify rs3 at all. Are we tight on two register input opcodes? Actually, none of the vector instructions use the rs3 field (the vector FMAs are destructive to save encoding space).
There are still several Rtype code points left in the vector opcode, but it has gotten a bit tight, and there will be stiff competition for those code points over the coming years.
Bill
For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three NR iterations.
Thanks,
Grigorios
PS: I also tried the optimization that you propose of using (3/2  a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
On 9 Jul 2020, at 10:12, Andrew Waterman < andrew@...> wrote:
I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog
code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2
recip.cc && ./a.out test
max recip error on [0.5, 1]: 2^7.4843
max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest,
but it would increase cost, so we don't do so.)
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater
performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots.
Following is a discussion of some design considerations leading to a concrete proposal.
Strict or bounded definition?
Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition
allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).
The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions.
What precision to choose?
Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps
the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly.
A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:
 4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
 5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.
 6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.
 7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.
 9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.
 12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively.
All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge.
Instruction encoding?
As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0
 rsqrt(+/ 0) = +/ inf, raising DZ exception
 rsqrt(nonzero negative) = qNaN, raising NV exception
 rsqrt(NaN) = qNaN, raising NV for sNaN inputs
 rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0
 recip(+/ 0) = +/ inf, raising DZ exception
 recip(NaN) = qNaN, raising NV if signaling
 recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they
will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.
