toggle quoted messageShow quoted text
On Thu, Jul 9, 2020 at 1:12 AM Andrew Waterman <andrew@...> wrote: I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2 recip.cc && ./a.out test max recip error on [0.5, 1]: 2^7.4843 max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)
Code is here; LMK if you spot any bugs: https://github.com/riscv/riscvvspec/blob/vfrecip/recip.cc
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots.
Following is a discussion of some design considerations leading to a concrete proposal.
Strict or bounded definition?
Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).
The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions.
What precision to choose?
Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly.
A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:
 4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.  5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.  6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.  7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.  9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.  12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively.
All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge.
Instruction encoding?
As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0  rsqrt(+/ 0) = +/ inf, raising DZ exception  rsqrt(nonzero negative) = qNaN, raising NV exception  rsqrt(NaN) = qNaN, raising NV for sNaN inputs  rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0  recip(+/ 0) = +/ inf, raising DZ exception  recip(NaN) = qNaN, raising NV if signaling  recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.
