Re: VFRECIP/VFRSQRT instructions


On Tue, Jul 14, 2020 at 7:36 AM Grigorios Magklis <grigorios.magklis@...> wrote:
Hi Andrew et al,

Thank you for sending the code. I am attaching an updated version of, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.

If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.

I have a concern about not including special "step" instructions. The complete N-R sequences are as follows:

float newton_raphson_recip(float r)
  float x = recip(r);
  int k = fpclassify(x);
  if (k != FP_INFINITE && k != FP_ZERO) {
    float t = fmaf(-r, x, 2.0f);  // fma
    x = x * t;                    // fmul
    t = fmaf(-r, x, 2.0f);        // fma
    x = x * t;                    // fmul
  return x;

float newton_raphson_rsqrt(float r)
  float x = rsqrt(r);
  int k = fpclassify(r);
  if (k != FP_INFINITE && k != FP_ZERO)  {
    float t = r * x;              // fmul
    float h = 0.5 * x;            // fmul
    t = fmaf(-t, x, 3.0f);        // fma
    x = h * t;                    // fmul
    t = r * x;                    // fmul
    h = 0.5 * x;                  // fmul
    t = fmaf(-t, x, 3.0f);        // fma
    x = h * t;                    // fmul
  return x;

As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.

Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.

Roughly, calculating the number of extra instructions in each sequence I get:
  • save v0; 1 inst
  • classify input; 1 inst
  • compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
  • materialize constant 2.0 for recip; 1 inst
  • … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
  • restore v0; 1 inst

This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.

If we had the following special "step" instructions the sequences would be much shorter (see also the attached file):

A recip_step instruction that calculates 2-a*b with the following special case handling:
  • One input is infinity and the other 0 => result is 2
  • One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
  • One input is infinity and the other 0 => result is 1.5
  • One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)

With the above instructions, the N-R sequences are reduced to:

float newton_raphson_recip_with_step(float r)
  float x = recip(r);
  float t = recip_step(r, x);   // fma
  x = x * t;                    // fmul
  t = recip_step(r, x);         // fma
  x = x * t;                    // fmul
  return x;

float newton_raphson_rsqrt_with_step(float r)
  float x = rsqrt(r);
  float t = r * x;              // fmul
  t = rsqrt_step(t, x);         // fma
  x = x * t;                    // fmul
  t = r * x;                    // fmul
  t = rsqrt_step(t, x);         // fma
  x = x * t;                    // fmul
  return x;

Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.

FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight.  I suppose a compromise might be to make them destructive.  This would have no perf. consequence for rsqrt, but would add one extra move instruction to all but the last reciprocal iteration step.

For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations.


PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.

On 9 Jul 2020, at 10:12, Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

  $ g++ -O2 && ./a.out --test
  max recip error on [0.5, 1]: 2^-7.4843
  max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs:

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

Join to automatically receive all group messages.