Re: VFRECIP/VFRSQRT instructions


I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

  $ g++ -O2 && ./a.out --test
  max recip error on [0.5, 1]: 2^-7.4843
  max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs:

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

Join to automatically receive all group messages.