#### VFRECIP/VFRSQRT instructions

andrew@...

The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

andrew@...

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

andrew@...

On Thu, Jul 9, 2020 at 1:12 AM Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables. When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:

The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation. In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration. (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

Mr Grigorios Magklis

Hi Andrew et al,

Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.

If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.

I have a concern about not including special "step" instructions. The complete N-R sequences are as follows:

float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(-r, x, 2.0f);  // fma
x = x * t;                    // fmul
t = fmaf(-r, x, 2.0f);        // fma
x = x * t;                    // fmul
}
return x;
}

float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO)  {
float t = r * x;              // fmul
float h = 0.5 * x;            // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
t = r * x;                    // fmul
h = 0.5 * x;                  // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
}
return x;
}

As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.

Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.

Roughly, calculating the number of extra instructions in each sequence I get:
• save v0; 1 inst
• classify input; 1 inst
• compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
• materialize constant 2.0 for recip; 1 inst
• … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
• restore v0; 1 inst

This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.

If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):

A recip_step instruction that calculates 2-a*b with the following special case handling:
• One input is infinity and the other 0 => result is 2
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
• One input is infinity and the other 0 => result is 1.5
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)

With the above instructions, the N-R sequences are reduced to:

float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x);   // fma
x = x * t;                    // fmul
t = recip_step(r, x);         // fma
x = x * t;                    // fmul
return x;
}

float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x;              // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
t = r * x;                    // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
return x;
}

Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.

For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations.

Thanks,
Grigorios

PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.

On 9 Jul 2020, at 10:12, Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

andrew@...

On Tue, Jul 14, 2020 at 7:36 AM Grigorios Magklis <grigorios.magklis@...> wrote:
Hi Andrew et al,

Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.

If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.

I have a concern about not including special "step" instructions. The complete N-R sequences are as follows:

float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(-r, x, 2.0f);  // fma
x = x * t;                    // fmul
t = fmaf(-r, x, 2.0f);        // fma
x = x * t;                    // fmul
}
return x;
}

float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO)  {
float t = r * x;              // fmul
float h = 0.5 * x;            // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
t = r * x;                    // fmul
h = 0.5 * x;                  // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
}
return x;
}

As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.

I forgot to mention that I added sample vector code for estimating square root: https://github.com/riscv/riscv-v-spec/blob/vfrecip/vector-examples.adoc#square-root-approximation-example

Handling the zero and infinity input cases is a bit faster than you estimated, since you don't need to explicitly classify the input.  The approach is to generate the mask by comparing the input against 0.0, then estimate the reciprocal, then compare the estimate against 0.0 (which handles +inf inputs, and also handles -inf inputs for reciprocal, since the comparison against 0.0 also matches -0.0).  The second comparison is performed under the mask and overwrites the mask, effectively ANDing it in.

Of course, a two-instruction savings doesn't nullify your argument, but it does tweak the tradeoffs slightly.

Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.

Roughly, calculating the number of extra instructions in each sequence I get:
• save v0; 1 inst
• classify input; 1 inst
• compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
• materialize constant 2.0 for recip; 1 inst
• … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
• restore v0; 1 inst

This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.

If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):

A recip_step instruction that calculates 2-a*b with the following special case handling:
• One input is infinity and the other 0 => result is 2
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
• One input is infinity and the other 0 => result is 1.5
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)

With the above instructions, the N-R sequences are reduced to:

float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x);   // fma
x = x * t;                    // fmul
t = recip_step(r, x);         // fma
x = x * t;                    // fmul
return x;
}

float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x;              // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
t = r * x;                    // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
return x;
}

Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.

For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations.

Thanks,
Grigorios

PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.

Halving the input discards one bit of precision for subnormal inputs, so that checks out.  It's presumably only severe for the smallest subnormals; whether that is acceptable is obviously application-dependent.

On 9 Jul 2020, at 10:12, Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

andrew@...

On Tue, Jul 14, 2020 at 7:36 AM Grigorios Magklis <grigorios.magklis@...> wrote:
Hi Andrew et al,

Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.

If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.

I have a concern about not including special "step" instructions. The complete N-R sequences are as follows:

float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(-r, x, 2.0f);  // fma
x = x * t;                    // fmul
t = fmaf(-r, x, 2.0f);        // fma
x = x * t;                    // fmul
}
return x;
}

float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO)  {
float t = r * x;              // fmul
float h = 0.5 * x;            // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
t = r * x;                    // fmul
h = 0.5 * x;                  // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
}
return x;
}

As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.

Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.

Roughly, calculating the number of extra instructions in each sequence I get:
• save v0; 1 inst
• classify input; 1 inst
• compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
• materialize constant 2.0 for recip; 1 inst
• … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
• restore v0; 1 inst

This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.

If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):

A recip_step instruction that calculates 2-a*b with the following special case handling:
• One input is infinity and the other 0 => result is 2
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
• One input is infinity and the other 0 => result is 1.5
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)

With the above instructions, the N-R sequences are reduced to:

float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x);   // fma
x = x * t;                    // fmul
t = recip_step(r, x);         // fma
x = x * t;                    // fmul
return x;
}

float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x;              // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
t = r * x;                    // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
return x;
}

Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.

FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight.  I suppose a compromise might be to make them destructive.  This would have no perf. consequence for rsqrt, but would add one extra move instruction to all but the last reciprocal iteration step.

For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations.

Thanks,
Grigorios

PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.

On 9 Jul 2020, at 10:12, Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

Bill Huffman

On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL

On Tue, Jul 14, 2020 at 7:36 AM Grigorios Magklis <grigorios.magklis@...> wrote:
Hi Andrew et al,

Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.

If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.

I have a concern about not including special "step" instructions. The complete N-R sequences are as follows:

float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(-r, x, 2.0f);  // fma
x = x * t;                    // fmul
t = fmaf(-r, x, 2.0f);        // fma
x = x * t;                    // fmul
}
return x;
}

float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO)  {
float t = r * x;              // fmul
float h = 0.5 * x;            // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
t = r * x;                    // fmul
h = 0.5 * x;                  // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
}
return x;
}

As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.

Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.

Roughly, calculating the number of extra instructions in each sequence I get:
• save v0; 1 inst
• classify input; 1 inst
• compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
• materialize constant 2.0 for recip; 1 inst
• … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
• restore v0; 1 inst

This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.

If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):

A recip_step instruction that calculates 2-a*b with the following special case handling:
• One input is infinity and the other 0 => result is 2
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
• One input is infinity and the other 0 => result is 1.5
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)

With the above instructions, the N-R sequences are reduced to:

float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x);   // fma
x = x * t;                    // fmul
t = recip_step(r, x);         // fma
x = x * t;                    // fmul
return x;
}

float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x;              // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
t = r * x;                    // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
return x;
}

Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.

FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight.  I suppose a compromise might be to make them destructive.  This would have no perf. consequence for rsqrt, but would add one extra move instruction to all but the last reciprocal iteration step.

Seems like they shouldn't be so big as they don't specify rs3 at all.  Are we tight on two register input opcodes?

Bill

For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations.

Thanks,
Grigorios

PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.

On 9 Jul 2020, at 10:12, Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

andrew@...

On Tue, Jul 14, 2020 at 2:47 PM Bill Huffman <huffman@...> wrote:

On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL

On Tue, Jul 14, 2020 at 7:36 AM Grigorios Magklis <grigorios.magklis@...> wrote:
Hi Andrew et al,

Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.

If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.

I have a concern about not including special "step" instructions. The complete N-R sequences are as follows:

float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(-r, x, 2.0f);  // fma
x = x * t;                    // fmul
t = fmaf(-r, x, 2.0f);        // fma
x = x * t;                    // fmul
}
return x;
}

float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO)  {
float t = r * x;              // fmul
float h = 0.5 * x;            // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
t = r * x;                    // fmul
h = 0.5 * x;                  // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
}
return x;
}

As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.

Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.

Roughly, calculating the number of extra instructions in each sequence I get:
• save v0; 1 inst
• classify input; 1 inst
• compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
• materialize constant 2.0 for recip; 1 inst
• … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
• restore v0; 1 inst

This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.

If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):

A recip_step instruction that calculates 2-a*b with the following special case handling:
• One input is infinity and the other 0 => result is 2
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
• One input is infinity and the other 0 => result is 1.5
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)

With the above instructions, the N-R sequences are reduced to:

float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x);   // fma
x = x * t;                    // fmul
t = recip_step(r, x);         // fma
x = x * t;                    // fmul
return x;
}

float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x;              // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
t = r * x;                    // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
return x;
}

Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.

FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight.  I suppose a compromise might be to make them destructive.  This would have no perf. consequence for rsqrt, but would add one extra move instruction to all but the last reciprocal iteration step.

Seems like they shouldn't be so big as they don't specify rs3 at all.  Are we tight on two register input opcodes?

Actually, none of the vector instructions use the rs3 field (the vector FMAs are destructive to save encoding space).

There are still several R-type code points left in the vector opcode, but it has gotten a bit tight, and there will be stiff competition for those code points over the coming years.

Bill

For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations.

Thanks,
Grigorios

PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.

On 9 Jul 2020, at 10:12, Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

Bill Huffman

On 7/14/20 2:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL

On Tue, Jul 14, 2020 at 2:47 PM Bill Huffman <huffman@...> wrote:

On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL

On Tue, Jul 14, 2020 at 7:36 AM Grigorios Magklis <grigorios.magklis@...> wrote:
Hi Andrew et al,

Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.

If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.

I have a concern about not including special "step" instructions. The complete N-R sequences are as follows:

float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(-r, x, 2.0f);  // fma
x = x * t;                    // fmul
t = fmaf(-r, x, 2.0f);        // fma
x = x * t;                    // fmul
}
return x;
}

float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO)  {
float t = r * x;              // fmul
float h = 0.5 * x;            // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
t = r * x;                    // fmul
h = 0.5 * x;                  // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
}
return x;
}

As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.

Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.

Roughly, calculating the number of extra instructions in each sequence I get:
• save v0; 1 inst
• classify input; 1 inst
• compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
• materialize constant 2.0 for recip; 1 inst
• … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
• restore v0; 1 inst

This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.

If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):

A recip_step instruction that calculates 2-a*b with the following special case handling:
• One input is infinity and the other 0 => result is 2
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
• One input is infinity and the other 0 => result is 1.5
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)

With the above instructions, the N-R sequences are reduced to:

float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x);   // fma
x = x * t;                    // fmul
t = recip_step(r, x);         // fma
x = x * t;                    // fmul
return x;
}

float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x;              // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
t = r * x;                    // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
return x;
}

Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.

FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight.  I suppose a compromise might be to make them destructive.  This would have no perf. consequence for rsqrt, but would add one extra move instruction to all but the last reciprocal iteration step.

Seems like they shouldn't be so big as they don't specify rs3 at all.  Are we tight on two register input opcodes?

Actually, none of the vector instructions use the rs3 field (the vector FMAs are destructive to save encoding space).

There are still several R-type code points left in the vector opcode, but it has gotten a bit tight, and there will be stiff competition for those code points over the coming years.

Sorry, I had scalar on the mind....

Bill

Bill

For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations.

Thanks,
Grigorios

PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.

On 9 Jul 2020, at 10:12, Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

Bill Huffman

Do we improve accuracy a bit if the step is:

t = 1.0 - r*x; x = x + t*x

t = 2.0 - r*x; x = t*x

Bill

On 7/14/20 2:58 PM, Bill Huffman wrote:

On 7/14/20 2:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL

On Tue, Jul 14, 2020 at 2:47 PM Bill Huffman <huffman@...> wrote:

On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL

On Tue, Jul 14, 2020 at 7:36 AM Grigorios Magklis <grigorios.magklis@...> wrote:
Hi Andrew et al,

Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.

If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.

I have a concern about not including special "step" instructions. The complete N-R sequences are as follows:

float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(-r, x, 2.0f);  // fma
x = x * t;                    // fmul
t = fmaf(-r, x, 2.0f);        // fma
x = x * t;                    // fmul
}
return x;
}

float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO)  {
float t = r * x;              // fmul
float h = 0.5 * x;            // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
t = r * x;                    // fmul
h = 0.5 * x;                  // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
}
return x;
}

As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.

Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.

Roughly, calculating the number of extra instructions in each sequence I get:
• save v0; 1 inst
• classify input; 1 inst
• compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
• materialize constant 2.0 for recip; 1 inst
• … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
• restore v0; 1 inst

This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.

If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):

A recip_step instruction that calculates 2-a*b with the following special case handling:
• One input is infinity and the other 0 => result is 2
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
• One input is infinity and the other 0 => result is 1.5
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)

With the above instructions, the N-R sequences are reduced to:

float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x);   // fma
x = x * t;                    // fmul
t = recip_step(r, x);         // fma
x = x * t;                    // fmul
return x;
}

float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x;              // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
t = r * x;                    // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
return x;
}

Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.

FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight.  I suppose a compromise might be to make them destructive.  This would have no perf. consequence for rsqrt, but would add one extra move instruction to all but the last reciprocal iteration step.

Seems like they shouldn't be so big as they don't specify rs3 at all.  Are we tight on two register input opcodes?

Actually, none of the vector instructions use the rs3 field (the vector FMAs are destructive to save encoding space).

There are still several R-type code points left in the vector opcode, but it has gotten a bit tight, and there will be stiff competition for those code points over the coming years.

Sorry, I had scalar on the mind....

Bill

Bill

For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations.

Thanks,
Grigorios

PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.

On 9 Jul 2020, at 10:12, Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

andrew@...

On Tue, Jul 14, 2020 at 3:09 PM Bill Huffman <huffman@...> wrote:

Do we improve accuracy a bit if the step is:

t = 1.0 - r*x; x = x + t*x

t = 2.0 - r*x; x = t*x

Yes.

Bill

On 7/14/20 2:58 PM, Bill Huffman wrote:

On 7/14/20 2:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL

On Tue, Jul 14, 2020 at 2:47 PM Bill Huffman <huffman@...> wrote:

On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL

On Tue, Jul 14, 2020 at 7:36 AM Grigorios Magklis <grigorios.magklis@...> wrote:
Hi Andrew et al,

Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.

If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.

I have a concern about not including special "step" instructions. The complete N-R sequences are as follows:

float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(-r, x, 2.0f);  // fma
x = x * t;                    // fmul
t = fmaf(-r, x, 2.0f);        // fma
x = x * t;                    // fmul
}
return x;
}

float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO)  {
float t = r * x;              // fmul
float h = 0.5 * x;            // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
t = r * x;                    // fmul
h = 0.5 * x;                  // fmul
t = fmaf(-t, x, 3.0f);        // fma
x = h * t;                    // fmul
}
return x;
}

As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.

Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.

Roughly, calculating the number of extra instructions in each sequence I get:
• save v0; 1 inst
• classify input; 1 inst
• compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
• materialize constant 2.0 for recip; 1 inst
• … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
• restore v0; 1 inst

This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.

If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):

A recip_step instruction that calculates 2-a*b with the following special case handling:
• One input is infinity and the other 0 => result is 2
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
• One input is infinity and the other 0 => result is 1.5
• One input is infinity and the other NaN => result is infinity with the sign of -a*b (using only the sign of NaN)

With the above instructions, the N-R sequences are reduced to:

float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x);   // fma
x = x * t;                    // fmul
t = recip_step(r, x);         // fma
x = x * t;                    // fmul
return x;
}

float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x;              // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
t = r * x;                    // fmul
t = rsqrt_step(t, x);         // fma
x = x * t;                    // fmul
return x;
}

Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.

FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight.  I suppose a compromise might be to make them destructive.  This would have no perf. consequence for rsqrt, but would add one extra move instruction to all but the last reciprocal iteration step.

Seems like they shouldn't be so big as they don't specify rs3 at all.  Are we tight on two register input opcodes?

Actually, none of the vector instructions use the rs3 field (the vector FMAs are destructive to save encoding space).

There are still several R-type code points left in the vector opcode, but it has gotten a bit tight, and there will be stiff competition for those code points over the coming years.

Sorry, I had scalar on the mind....

Bill

Bill

For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations.

Thanks,
Grigorios

PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.

On 9 Jul 2020, at 10:12, Andrew Waterman <andrew@...> wrote:

I'm following up with detailed semantics in the form of a self-contained C++ program.  The `recip` and `rsqrt` functions model the proposed instructions.  When the program is invoked with the `--verilog` command-line argument, it prints Verilog code for the lookup tables.  When invoked with `--test` or `--test-long`, it self-validates and prints error information, e.g.:

\$ g++ -O2 recip.cc && ./a.out --test
max recip error on [0.5, 1]: 2^-7.4843
max rsqrt error on [0.25, 1]: 2^-7.31422

The core of both schemes is a 2^7-entry, 7-bit-wide LUT, whose entries are chosen to minimize the maximum error on the interval.

(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values.  For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)

Code is here; LMK if you spot any bugs: https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots.  These are both useful for -ffast-math code where it's acceptable to sacrifice a bit of precision in exchange for greater performance.  Reciprocal square root in particular is ubiquitous in graphics and physics computations.  These instructions can feed into Newton-Raphson schemes to provide refined estimates of quotients and square roots.

Following is a discussion of some design considerations leading to a concrete proposal.

Strict or bounded definition?

Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound.  A strict definition is obviously preferable for software portability and compliance testing.  A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).

The task group has recommended adopting a strict definition.  I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookup-table components of the instructions.

What precision to choose?

Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them.  So, I'm recommending using a 7-bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for Newton-Raphson refinement to converge reasonably quickly.

A common goal is to produce an estimate within a couple ulps of the IEEE 754 result.  With that in mind, a few precision sweet spots emerge:

- 4-bit estimates (which, due to careful rounding, actually have relative error of 2^-4.6 for reciprocal and 2^-4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
- 5-bit (2^-5.5, 2^-5.3) take 1, 1, 3, and 4 iterations, respectively.
- 6-bit (2^-6.4, 2^-6.3) take 0, 1, 2, and 4 iterations, respectively.
- 7-bit (2^-7.4, 2^-7.3) take 0, 1, 2, and 3 iterations, respectively.
- 9-bit (2^-9.4, 2^-9.2) take 0, 0, 2, and 3 iterations, respectively.
- 12-bit (2^-12.4, 2^-12.3) take 0, 0, 1, and 3 iterations, respectively.

All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations.  The 7-bit scheme is of interest because it's the cliff at which double-precision needs 3 iterations to converge.

Instruction encoding?

As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.

If, over time, more precise estimates are required, they can be accommodated in the same unary-operation encoding space.

Corner-case details?

For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754-compliant C implementation.  In particular:

- rsqrt(+inf) = +0
- rsqrt(+/- 0) = +/- inf, raising DZ exception
- rsqrt(nonzero negative) = qNaN, raising NV exception
- rsqrt(NaN) = qNaN, raising NV for sNaN inputs
- rsqrt(subnormal) is handled correctly; the result is normal and never overflows

- recip(+/- inf) = +/- 0
- recip(+/- 0) = +/- inf, raising DZ exception
- recip(NaN) = qNaN, raising NV if signaling
- recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception

Newton-Raphson iteration instructions?

Other ISAs have provided instructions that perform part of a Newton-Raphson iteration step, e.g., computing (3 - a * b) / 2.0 in a single instruction.  These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per Newton-Raphson loop, not one instruction per Newton-Raphson iteration.  (This is because the expression can be rewritten as (3/2 - a/2 * b), where a/2 is the same for all iteration steps.)

To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.

David Horner

Excellent observation.
at least 1 bits better. and when r*x close to 1, much better.

David Horner

The current LUT generator assumes N-by-N look up table.

I will load in my github Andrew's program modified to  take input (index size) and output (estimate number of bits) arguments.
(--verilog still generates LUT, and test generates the values below),

Of course, how well the table can be synthesized is more importance than LUT dimensions per se.
I have used yosys with varying results.

I continue to try to profile the accuracy within input segments rather than over total float range.

Other dimensions are possible with resultant increase of decrease in accuracy:
parameters are: output width / input width  array size

7/7 896
max recip error on [0.5, 1]: 2^-7.36951
max rsqrt error on [0.25, 1]: 2^-7.2998
8/7 1024
max recip error on [0.5, 1]: 2^-7.78083
max rsqrt error on [0.25, 1]: 2^-7.69943
9/7 1152
max recip error on [0.5, 1]: 2^-7.89148
max rsqrt error on [0.25, 1]: 2^-7.8783
10/7 1280
max recip error on [0.5, 1]: 2^-7.94879
max rsqrt error on [0.25, 1]: 2^-7.91063
11/7 1408 by 1
max recip error on [0.5, 1]: 2^-7.97629
max rsqrt error on [0.25, 1]: 2^-8
11/7 1408 by 0x1000
max recip error on [0.5, 1]: 2^-8
max rsqrt error on [0.25, 1]: 2^-8
6/8 1536
max recip error on [0.5, 1]: 2^-6.90724
max rsqrt error on [0.25, 1]: 2^-6.88438
7/8 1792
max recip error on [0.5, 1]: 2^-7.78083
max rsqrt error on [0.25, 1]: 2^-7.73897
8/8 2048 by 0X1000
max recip error on [0.5, 1]: 2^-8.45311
max rsqrt error on [0.25, 1]: 2^-8.36032
10/8 2560 by 0X1000
max recip error on [0.5, 1]: 2^-8.88626
max rsqrt error on [0.25, 1]: 2^-8.8425
10/8 2560 by 0X1
max recip error on [0.5, 1]: 2^-8.88626
max rsqrt error on [0.25, 1]: 2^-8.83164
12/8
max recip error on [0.5, 1]: 2^-8.98041
max rsqrt error on [0.25, 1]: 2^-8.9616
13/8 2560
max recip error on [0.5, 1]: 2^-9
max rsqrt error on [0.25, 1]: 2^-9
9/9
max recip error on [0.5, 1]: 2^-9.47252
max rsqrt error on [0.25, 1]: 2^-9.31953

observation:
7 input bits has a minimum max error of 2^-8.
8 input bits has a minimum max error of 2^-9.

Bill Huffman

David,

Are the max errors absolute? Or relative to the recip or rsqrt, which
is presumably in the range (1.0, 2.0]?

That you use [0.5, 1] when you might have meant [0.5, 1) leaves some
question about what is happening with the powers of two (even powers for
rsqrt). Hopefully, they're always precise and there's no issue of error
there.

Bill

On 7/31/20 4:33 AM, David Horner wrote:
EXTERNAL MAIL

The current LUT generator assumes N-by-N look up table.

I will load in my github Andrew's program modified to  take input (index
size) and output (estimate number of bits) arguments.
(--verilog still generates LUT, and test generates the values below),

Of course, how well the table can be synthesized is more importance than
LUT dimensions per se.
I have used yosys with varying results.

I continue to try to profile the accuracy within input segments rather
than over total float range.

Other dimensions are possible with resultant increase of decrease in
accuracy:
parameters are: output width / input width  array size

7/7 896
max recip error on [0.5, 1]: 2^-7.36951
max rsqrt error on [0.25, 1]: 2^-7.2998
8/7 1024
max recip error on [0.5, 1]: 2^-7.78083
max rsqrt error on [0.25, 1]: 2^-7.69943
9/7 1152
max recip error on [0.5, 1]: 2^-7.89148
max rsqrt error on [0.25, 1]: 2^-7.8783
10/7 1280
max recip error on [0.5, 1]: 2^-7.94879
max rsqrt error on [0.25, 1]: 2^-7.91063
11/7 1408 by 1
max recip error on [0.5, 1]: 2^-7.97629
max rsqrt error on [0.25, 1]: 2^-8
11/7 1408 by 0x1000
max recip error on [0.5, 1]: 2^-8
max rsqrt error on [0.25, 1]: 2^-8
6/8 1536
max recip error on [0.5, 1]: 2^-6.90724
max rsqrt error on [0.25, 1]: 2^-6.88438
7/8 1792
max recip error on [0.5, 1]: 2^-7.78083
max rsqrt error on [0.25, 1]: 2^-7.73897
8/8 2048 by 0X1000
max recip error on [0.5, 1]: 2^-8.45311
max rsqrt error on [0.25, 1]: 2^-8.36032
10/8 2560 by 0X1000
max recip error on [0.5, 1]: 2^-8.88626
max rsqrt error on [0.25, 1]: 2^-8.8425
10/8 2560 by 0X1
max recip error on [0.5, 1]: 2^-8.88626
max rsqrt error on [0.25, 1]: 2^-8.83164
12/8
max recip error on [0.5, 1]: 2^-8.98041
max rsqrt error on [0.25, 1]: 2^-8.9616
13/8 2560
max recip error on [0.5, 1]: 2^-9
max rsqrt error on [0.25, 1]: 2^-9
9/9
max recip error on [0.5, 1]: 2^-9.47252
max rsqrt error on [0.25, 1]: 2^-9.31953

observation:
7 input bits has a minimum max error of 2^-8.
8 input bits has a minimum max error of 2^-9.

David Horner

The error is relative error.
The calculation is unchanged from Andrew's original. (Although I explicitly force double even when it shouldn't matter).
The test range is from 0.5 to 1 inclusive.
Again I left Andrew's choices unchanged.
As you point out the 1 case should not contribute to max error in the reciprocal case as the error should be zero. The 1 case for rsqrt for odd powers of 2 exponent is non zero by definition as sqrt  2 is irrational.
Andrew provides test_long which tests all single precision values that are not NaN.

I hope to post my code soon.
I get a seg fault when no parm are provided on the command line.
Argh. I belive it is relates to handling the argv as a vector of values. It appears the support construct moves some code out of reates out of conditional scope and thus frees even when no explicit allocation is made.

F6

On Fri, Jul 31, 2020, 13:29 Bill Huffman, <huffman@...> wrote:
David,

Are the max errors absolute?  Or relative to the recip or rsqrt, which
is presumably in the range (1.0, 2.0]?

That you use [0.5, 1] when you might have meant [0.5, 1) leaves some
question about what is happening with the powers of two (even powers for
rsqrt).  Hopefully, they're always precise and there's no issue of error
there.

Bill

On 7/31/20 4:33 AM, David Horner wrote:
> EXTERNAL MAIL
>
>
> The current LUT generator assumes N-by-N look up table.
>
> I will load in my github Andrew's program modified to  take input (index
> size) and output (estimate number of bits) arguments.
>    (--verilog still generates LUT, and test generates the values below),
>
> Of course, how well the table can be synthesized is more importance than
> LUT dimensions per se.
> I have used yosys with varying results.
>
> I continue to try to profile the accuracy within input segments rather
> than over total float range.
>
>
> Other dimensions are possible with resultant increase of decrease in
> accuracy:
> parameters are: output width / input width  array size
>
> 7/7 896
> max recip error on [0.5, 1]: 2^-7.36951
> max rsqrt error on [0.25, 1]: 2^-7.2998
> 8/7 1024
> max recip error on [0.5, 1]: 2^-7.78083
> max rsqrt error on [0.25, 1]: 2^-7.69943
> 9/7 1152
> max recip error on [0.5, 1]: 2^-7.89148
> max rsqrt error on [0.25, 1]: 2^-7.8783
> 10/7 1280
> max recip error on [0.5, 1]: 2^-7.94879
> max rsqrt error on [0.25, 1]: 2^-7.91063
> 11/7 1408 by 1
> max recip error on [0.5, 1]: 2^-7.97629
> max rsqrt error on [0.25, 1]: 2^-8
> 11/7 1408 by 0x1000
> max recip error on [0.5, 1]: 2^-8
> max rsqrt error on [0.25, 1]: 2^-8
> 6/8 1536
> max recip error on [0.5, 1]: 2^-6.90724
> max rsqrt error on [0.25, 1]: 2^-6.88438
> 7/8 1792
> max recip error on [0.5, 1]: 2^-7.78083
> max rsqrt error on [0.25, 1]: 2^-7.73897
> 8/8 2048 by 0X1000
> max recip error on [0.5, 1]: 2^-8.45311
> max rsqrt error on [0.25, 1]: 2^-8.36032
> 10/8 2560 by 0X1000
> max recip error on [0.5, 1]: 2^-8.88626
> max rsqrt error on [0.25, 1]: 2^-8.8425
> 10/8 2560 by 0X1
> max recip error on [0.5, 1]: 2^-8.88626
> max rsqrt error on [0.25, 1]: 2^-8.83164
> 12/8
> max recip error on [0.5, 1]: 2^-8.98041
> max rsqrt error on [0.25, 1]: 2^-8.9616
> 13/8 2560
> max recip error on [0.5, 1]: 2^-9
> max rsqrt error on [0.25, 1]: 2^-9
> 9/9
> max recip error on [0.5, 1]: 2^-9.47252
> max rsqrt error on [0.25, 1]: 2^-9.31953
>
>
> observation:
> 7 input bits has a minimum max error of 2^-8.
> 8 input bits has a minimum max error of 2^-9.
>
>
>
>

Bill Huffman

David,

Because of the errors you get, I'm assuming your "output width" and "input width" do not include the hidden bit.  Right?

It's interesting.  I did a similar exercise a number of years ago and got a few hundredths of a bit better accuracy from 7/7 tables.  It's possible I did it wrong.  It's also possible that there's a slight improvement available.

If you want to send me the tables I can compare.  Mine are in decimal numbers from 128 to 255.  I could send you tables as well.

Bill

On 7/31/20 12:24 PM, David Horner wrote:

EXTERNAL MAIL

The error is relative error.
The calculation is unchanged from Andrew's original. (Although I explicitly force double even when it shouldn't matter).
The test range is from 0.5 to 1 inclusive.
Again I left Andrew's choices unchanged.
As you point out the 1 case should not contribute to max error in the reciprocal case as the error should be zero. The 1 case for rsqrt for odd powers of 2 exponent is non zero by definition as sqrt  2 is irrational.
Andrew provides test_long which tests all single precision values that are not NaN.

I hope to post my code soon.
I get a seg fault when no parm are provided on the command line.
Argh. I belive it is relates to handling the argv as a vector of values. It appears the support construct moves some code out of reates out of conditional scope and thus frees even when no explicit allocation is made.

F6

On Fri, Jul 31, 2020, 13:29 Bill Huffman, <huffman@...> wrote:
David,

Are the max errors absolute?  Or relative to the recip or rsqrt, which
is presumably in the range (1.0, 2.0]?

That you use [0.5, 1] when you might have meant [0.5, 1) leaves some
question about what is happening with the powers of two (even powers for
rsqrt).  Hopefully, they're always precise and there's no issue of error
there.

Bill

On 7/31/20 4:33 AM, David Horner wrote:
> EXTERNAL MAIL
>
>
> The current LUT generator assumes N-by-N look up table.
>
> I will load in my github Andrew's program modified to  take input (index
> size) and output (estimate number of bits) arguments.
>    (--verilog still generates LUT, and test generates the values below),
>
> Of course, how well the table can be synthesized is more importance than
> LUT dimensions per se.
> I have used yosys with varying results.
>
> I continue to try to profile the accuracy within input segments rather
> than over total float range.
>
>
> Other dimensions are possible with resultant increase of decrease in
> accuracy:
> parameters are: output width / input width  array size
>
> 7/7 896
> max recip error on [0.5, 1]: 2^-7.36951
> max rsqrt error on [0.25, 1]: 2^-7.2998
> 8/7 1024
> max recip error on [0.5, 1]: 2^-7.78083
> max rsqrt error on [0.25, 1]: 2^-7.69943
> 9/7 1152
> max recip error on [0.5, 1]: 2^-7.89148
> max rsqrt error on [0.25, 1]: 2^-7.8783
> 10/7 1280
> max recip error on [0.5, 1]: 2^-7.94879
> max rsqrt error on [0.25, 1]: 2^-7.91063
> 11/7 1408 by 1
> max recip error on [0.5, 1]: 2^-7.97629
> max rsqrt error on [0.25, 1]: 2^-8
> 11/7 1408 by 0x1000
> max recip error on [0.5, 1]: 2^-8
> max rsqrt error on [0.25, 1]: 2^-8
> 6/8 1536
> max recip error on [0.5, 1]: 2^-6.90724
> max rsqrt error on [0.25, 1]: 2^-6.88438
> 7/8 1792
> max recip error on [0.5, 1]: 2^-7.78083
> max rsqrt error on [0.25, 1]: 2^-7.73897
> 8/8 2048 by 0X1000
> max recip error on [0.5, 1]: 2^-8.45311
> max rsqrt error on [0.25, 1]: 2^-8.36032
> 10/8 2560 by 0X1000
> max recip error on [0.5, 1]: 2^-8.88626
> max rsqrt error on [0.25, 1]: 2^-8.8425
> 10/8 2560 by 0X1
> max recip error on [0.5, 1]: 2^-8.88626
> max rsqrt error on [0.25, 1]: 2^-8.83164
> 12/8
> max recip error on [0.5, 1]: 2^-8.98041
> max rsqrt error on [0.25, 1]: 2^-8.9616
> 13/8 2560
> max recip error on [0.5, 1]: 2^-9
> max rsqrt error on [0.25, 1]: 2^-9
> 9/9
> max recip error on [0.5, 1]: 2^-9.47252
> max rsqrt error on [0.25, 1]: 2^-9.31953
>
>
> observation:
> 7 input bits has a minimum max error of 2^-8.
> 8 input bits has a minimum max error of 2^-9.
>
>
>
>

David Horner

This is the program Andrew wrote.
https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On 2020-07-31 4:46 p.m., Bill Huffman wrote:

David,

Because of the errors you get, I'm assuming your "output width" and "input width" do not include the hidden bit.  Right?

That is correct, Andrew's approach assumes the implicit high hidden bit.

It's interesting.  I did a similar exercise a number of years ago and got a few hundredths of a bit better accuracy from 7/7 tables.  It's possible I did it wrong.  It's also possible that there's a slight improvement available.

Andrew chose a range from [xn , (x+1)n) perhaps (xn,(x+1)n] will work better.
I will give it a try.

If you want to send me the tables I can compare.  Mine are in decimal numbers from 128 to 255.  I could send you tables as well.

As mentioned Andrew's --verilog directive creates a table;  both input and output in range from 0 to 127.
I wouldn't expect the bias to make any significant difference.

I'd be happy to see your tables.
If you want I will send Andrew's program's output for 7x7.
And any of the other listed combinations from my mods to his program.
I will post my mods even though I still get that seg fault with null (or single) command line args.
Then you could run your own.

Bill

On 7/31/20 12:24 PM, David Horner wrote:
EXTERNAL MAIL

The error is relative error.
The calculation is unchanged from Andrew's original. (Although I explicitly force double even when it shouldn't matter).
The test range is from 0.5 to 1 inclusive.
Again I left Andrew's choices unchanged.
As you point out the 1 case should not contribute to max error in the reciprocal case as the error should be zero. The 1 case for rsqrt for odd powers of 2 exponent is non zero by definition as sqrt  2 is irrational.
Andrew provides test_long which tests all single precision values that are not NaN.

I hope to post my code soon.
I get a seg fault when no parm are provided on the command line.
Argh. I belive it is relates to handling the argv as a vector of values. It appears the support construct moves some code out of reates out of conditional scope and thus frees even when no explicit allocation is made.

F6

On Fri, Jul 31, 2020, 13:29 Bill Huffman, <huffman@...> wrote:
David,

Are the max errors absolute?  Or relative to the recip or rsqrt, which
is presumably in the range (1.0, 2.0]?

That you use [0.5, 1] when you might have meant [0.5, 1) leaves some
question about what is happening with the powers of two (even powers for
rsqrt).  Hopefully, they're always precise and there's no issue of error
there.

Bill

On 7/31/20 4:33 AM, David Horner wrote:
> EXTERNAL MAIL
>
>
> The current LUT generator assumes N-by-N look up table.
>
> I will load in my github Andrew's program modified to  take input (index
> size) and output (estimate number of bits) arguments.
>    (--verilog still generates LUT, and test generates the values below),
>
> Of course, how well the table can be synthesized is more importance than
> LUT dimensions per se.
> I have used yosys with varying results.
>
> I continue to try to profile the accuracy within input segments rather
> than over total float range.
>
>
> Other dimensions are possible with resultant increase of decrease in
> accuracy:
> parameters are: output width / input width  array size
>
> 7/7 896
> max recip error on [0.5, 1]: 2^-7.36951
> max rsqrt error on [0.25, 1]: 2^-7.2998
> 8/7 1024
> max recip error on [0.5, 1]: 2^-7.78083
> max rsqrt error on [0.25, 1]: 2^-7.69943
> 9/7 1152
> max recip error on [0.5, 1]: 2^-7.89148
> max rsqrt error on [0.25, 1]: 2^-7.8783
> 10/7 1280
> max recip error on [0.5, 1]: 2^-7.94879
> max rsqrt error on [0.25, 1]: 2^-7.91063
> 11/7 1408 by 1
> max recip error on [0.5, 1]: 2^-7.97629
> max rsqrt error on [0.25, 1]: 2^-8
> 11/7 1408 by 0x1000
> max recip error on [0.5, 1]: 2^-8
> max rsqrt error on [0.25, 1]: 2^-8
> 6/8 1536
> max recip error on [0.5, 1]: 2^-6.90724
> max rsqrt error on [0.25, 1]: 2^-6.88438
> 7/8 1792
> max recip error on [0.5, 1]: 2^-7.78083
> max rsqrt error on [0.25, 1]: 2^-7.73897
> 8/8 2048 by 0X1000
> max recip error on [0.5, 1]: 2^-8.45311
> max rsqrt error on [0.25, 1]: 2^-8.36032
> 10/8 2560 by 0X1000
> max recip error on [0.5, 1]: 2^-8.88626
> max rsqrt error on [0.25, 1]: 2^-8.8425
> 10/8 2560 by 0X1
> max recip error on [0.5, 1]: 2^-8.88626
> max rsqrt error on [0.25, 1]: 2^-8.83164
> 12/8
> max recip error on [0.5, 1]: 2^-8.98041
> max rsqrt error on [0.25, 1]: 2^-8.9616
> 13/8 2560
> max recip error on [0.5, 1]: 2^-9
> max rsqrt error on [0.25, 1]: 2^-9
> 9/9
> max recip error on [0.5, 1]: 2^-9.47252
> max rsqrt error on [0.25, 1]: 2^-9.31953
>
>
> observation:
> 7 input bits has a minimum max error of 2^-8.
> 8 input bits has a minimum max error of 2^-9.
>
>
>
>

Bill Huffman

David,

Here are a series of statements leading to my worst case answer:

• For the mantissa range 0xF5_0000 to 0xF5_FFFF, the reciprocal estimate is 0x85_0000
• The largest error is for 0xF5_0000
• The reciprocal of 0xF5_0000, to "infinite" precision is 0x85_BF37.612D
• The relative error, then, is (0x85_0000 - 0x85_BF37.612D)/0x85_BF37.612D => -0x0.016E_0000_0022
• The log2 of the absolute value of that error is: -7.484300

I don't have any errors as large as to have a log2 of -7.36951.  Where did that error come from for you?

Bill

On 7/31/20 2:25 PM, DSHORNER wrote:

EXTERNAL MAIL

This is the program Andrew wrote.
https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On 2020-07-31 4:46 p.m., Bill Huffman wrote:

David,

Because of the errors you get, I'm assuming your "output width" and "input width" do not include the hidden bit.  Right?

That is correct, Andrew's approach assumes the implicit high hidden bit.

It's interesting.  I did a similar exercise a number of years ago and got a few hundredths of a bit better accuracy from 7/7 tables.  It's possible I did it wrong.  It's also possible that there's a slight improvement available.

Andrew chose a range from [xn , (x+1)n) perhaps (xn,(x+1)n] will work better.
I will give it a try.

If you want to send me the tables I can compare.  Mine are in decimal numbers from 128 to 255.  I could send you tables as well.

As mentioned Andrew's --verilog directive creates a table;  both input and output in range from 0 to 127.
I wouldn't expect the bias to make any significant difference.

I'd be happy to see your tables.
If you want I will send Andrew's program's output for 7x7.
And any of the other listed combinations from my mods to his program.
I will post my mods even though I still get that seg fault with null (or single) command line args.
Then you could run your own.

Bill

On 7/31/20 12:24 PM, David Horner wrote:
EXTERNAL MAIL

The error is relative error.
The calculation is unchanged from Andrew's original. (Although I explicitly force double even when it shouldn't matter).
The test range is from 0.5 to 1 inclusive.
Again I left Andrew's choices unchanged.
As you point out the 1 case should not contribute to max error in the reciprocal case as the error should be zero. The 1 case for rsqrt for odd powers of 2 exponent is non zero by definition as sqrt  2 is irrational.
Andrew provides test_long which tests all single precision values that are not NaN.

I hope to post my code soon.
I get a seg fault when no parm are provided on the command line.
Argh. I belive it is relates to handling the argv as a vector of values. It appears the support construct moves some code out of reates out of conditional scope and thus frees even when no explicit allocation is made.

F6

On Fri, Jul 31, 2020, 13:29 Bill Huffman, <huffman@...> wrote:
David,

Are the max errors absolute?  Or relative to the recip or rsqrt, which
is presumably in the range (1.0, 2.0]?

That you use [0.5, 1] when you might have meant [0.5, 1) leaves some
question about what is happening with the powers of two (even powers for
rsqrt).  Hopefully, they're always precise and there's no issue of error
there.

Bill

On 7/31/20 4:33 AM, David Horner wrote:
> EXTERNAL MAIL
>
>
> The current LUT generator assumes N-by-N look up table.
>
> I will load in my github Andrew's program modified to  take input (index
> size) and output (estimate number of bits) arguments.
>    (--verilog still generates LUT, and test generates the values below),
>
> Of course, how well the table can be synthesized is more importance than
> LUT dimensions per se.
> I have used yosys with varying results.
>
> I continue to try to profile the accuracy within input segments rather
> than over total float range.
>
>
> Other dimensions are possible with resultant increase of decrease in
> accuracy:
> parameters are: output width / input width  array size
>
> 7/7 896
> max recip error on [0.5, 1]: 2^-7.36951
> max rsqrt error on [0.25, 1]: 2^-7.2998
> 8/7 1024
> max recip error on [0.5, 1]: 2^-7.78083
> max rsqrt error on [0.25, 1]: 2^-7.69943
> 9/7 1152
> max recip error on [0.5, 1]: 2^-7.89148
> max rsqrt error on [0.25, 1]: 2^-7.8783
> 10/7 1280
> max recip error on [0.5, 1]: 2^-7.94879
> max rsqrt error on [0.25, 1]: 2^-7.91063
> 11/7 1408 by 1
> max recip error on [0.5, 1]: 2^-7.97629
> max rsqrt error on [0.25, 1]: 2^-8
> 11/7 1408 by 0x1000
> max recip error on [0.5, 1]: 2^-8
> max rsqrt error on [0.25, 1]: 2^-8
> 6/8 1536
> max recip error on [0.5, 1]: 2^-6.90724
> max rsqrt error on [0.25, 1]: 2^-6.88438
> 7/8 1792
> max recip error on [0.5, 1]: 2^-7.78083
> max rsqrt error on [0.25, 1]: 2^-7.73897
> 8/8 2048 by 0X1000
> max recip error on [0.5, 1]: 2^-8.45311
> max rsqrt error on [0.25, 1]: 2^-8.36032
> 10/8 2560 by 0X1000
> max recip error on [0.5, 1]: 2^-8.88626
> max rsqrt error on [0.25, 1]: 2^-8.8425
> 10/8 2560 by 0X1
> max recip error on [0.5, 1]: 2^-8.88626
> max rsqrt error on [0.25, 1]: 2^-8.83164
> 12/8
> max recip error on [0.5, 1]: 2^-8.98041
> max rsqrt error on [0.25, 1]: 2^-8.9616
> 13/8 2560
> max recip error on [0.5, 1]: 2^-9
> max rsqrt error on [0.25, 1]: 2^-9
> 9/9
> max recip error on [0.5, 1]: 2^-9.47252
> max rsqrt error on [0.25, 1]: 2^-9.31953
>
>
> observation:
> 7 input bits has a minimum max error of 2^-8.
> 8 input bits has a minimum max error of 2^-9.
>
>
>
>

David Horner

What I initially posted was a compilation from days previously, and I pulled in some bogus test results.
Here is a fresh run :
./a.out 7 5  ;./a.out 7 6  ;./a.out 7 7  ;./a.out 7 8  ;./a.out 7 9  ;./a.out 7 10  ;./a.out 7 11  ;./a.out 8 7  ;./a.out 8  ;./a.out 8 9  ;./a.out 9  ;
ip 7 op 5 LUT #bits 640 verilog 0  test/test-long 1
max recip 7x5 error: 2^-5.89148
max rsqrt 7x5 error: 2^-5.98208
ip 7 op 6 LUT #bits 768 verilog 0  test/test-long 1
max recip 7x6 error: 2^-6.79055
max rsqrt 7x6 error: 2^-6.73312
ip 7 op 7 LUT #bits 896 verilog 0  test/test-long 1
max recip 7x7 error: 2^-7.4843
max rsqrt 7x7 error: 2^-7.31422
ip 7 op 8 LUT #bits 1024 verilog 0  test/test-long 1
max recip 7x8 error: 2^-7.77603
max rsqrt 7x8 error: 2^-7.6318
ip 7 op 9 LUT #bits 1152 verilog 0  test/test-long 1
max recip 7x9 error: 2^-7.8889
max rsqrt 7x9 error: 2^-7.87831
ip 7 op 10 LUT #bits 1280 verilog 0  test/test-long 1
max recip 7x10 error: 2^-7.94879
max rsqrt 7x10 error: 2^-7.89712
ip 7 op 11 LUT #bits 1408 verilog 0  test/test-long 1
max recip 7x11 error: 2^-7.97629
max rsqrt 7x11 error: 2^-8
ip 8 op 7 LUT #bits 1792 verilog 0  test/test-long 1
max recip 8x7 error: 2^-7.77602
max rsqrt 8x7 error: 2^-7.72555
estimate width, op=0, out of range reset to default
ip 8 op 8 LUT #bits 2048 verilog 0  test/test-long 1
max recip 8x8 error: 2^-8.45311
max rsqrt 8x8 error: 2^-8.25349
ip 8 op 9 LUT #bits 2304 verilog 0  test/test-long 1
max recip 8x9 error: 2^-8.71923
max rsqrt 8x9 error: 2^-8.67807
estimate width, op=0, out of range reset to default
ip 9 op 9 LUT #bits 4608 verilog 0  test/test-long 1
max recip 9x9 error: 2^-9.43021
max rsqrt 9x9 error: 2^-9.28082

On 2020-08-01 12:27 a.m., Bill Huffman wrote:

David,

Here are a series of statements leading to my worst case answer:

• For the mantissa range 0xF5_0000 to 0xF5_FFFF, the reciprocal estimate is 0x85_0000
• The largest error is for 0xF5_0000
• The reciprocal of 0xF5_0000, to "infinite" precision is 0x85_BF37.612D
• The relative error, then, is (0x85_0000 - 0x85_BF37.612D)/0x85_BF37.612D => -0x0.016E_0000_0022
• The log2 of the absolute value of that error is: -7.484300

I don't have any errors as large as to have a log2 of -7.36951.  Where did that error come from for you?

It came from some testing I was performing on adjacent index values.
Completely bogus as I mentioned above.

I will instrument the code for more details, but I suspect this code has exactly the same worst case situation (for 7x7).

Bill
On 7/31/20 2:25 PM, DSHORNER wrote:
EXTERNAL MAIL

This is the program Andrew wrote.
https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On 2020-07-31 4:46 p.m., Bill Huffman wrote:

David,

Because of the errors you get, I'm assuming your "output width" and "input width" do not include the hidden bit.  Right?

That is correct, Andrew's approach assumes the implicit high hidden bit.

It's interesting.  I did a similar exercise a number of years ago and got a few hundredths of a bit better accuracy from 7/7 tables.  It's possible I did it wrong.  It's also possible that there's a slight improvement available.

Andrew chose a range from [xn , (x+1)n) perhaps (xn,(x+1)n] will work better.
I will give it a try.

If you want to send me the tables I can compare.  Mine are in decimal numbers from 128 to 255.  I could send you tables as well.

As mentioned Andrew's --verilog directive creates a table;  both input and output in range from 0 to 127.
I wouldn't expect the bias to make any significant difference.

I'd be happy to see your tables.
If you want I will send Andrew's program's output for 7x7.
And any of the other listed combinations from my mods to his program.
I will post my mods even though I still get that seg fault with null (or single) command line args.
Then you could run your own.

David Horner

This is the link to the revised code that does n by m LUT

https://github.com/David-Horner/recip/blob/master/vrecip.cc

On 2020-08-01 4:51 p.m., David Horner via lists.riscv.org wrote:

What I initially posted was a compilation from days previously, and I pulled in some bogus test results.
Here is a fresh run :
./a.out 7 5  ;./a.out 7 6  ;./a.out 7 7  ;./a.out 7 8  ;./a.out 7 9  ;./a.out 7 10  ;./a.out 7 11  ;./a.out 8 7  ;./a.out 8  ;./a.out 8 9  ;./a.out 9  ;
ip 7 op 5 LUT #bits 640 verilog 0  test/test-long 1
max recip 7x5 error: 2^-5.89148
max rsqrt 7x5 error: 2^-5.98208
ip 7 op 6 LUT #bits 768 verilog 0  test/test-long 1
max recip 7x6 error: 2^-6.79055
max rsqrt 7x6 error: 2^-6.73312
ip 7 op 7 LUT #bits 896 verilog 0  test/test-long 1
max recip 7x7 error: 2^-7.4843
max rsqrt 7x7 error: 2^-7.31422
ip 7 op 8 LUT #bits 1024 verilog 0  test/test-long 1
max recip 7x8 error: 2^-7.77603
max rsqrt 7x8 error: 2^-7.6318
ip 7 op 9 LUT #bits 1152 verilog 0  test/test-long 1
max recip 7x9 error: 2^-7.8889
max rsqrt 7x9 error: 2^-7.87831
ip 7 op 10 LUT #bits 1280 verilog 0  test/test-long 1
max recip 7x10 error: 2^-7.94879
max rsqrt 7x10 error: 2^-7.89712
ip 7 op 11 LUT #bits 1408 verilog 0  test/test-long 1
max recip 7x11 error: 2^-7.97629
max rsqrt 7x11 error: 2^-8
ip 8 op 7 LUT #bits 1792 verilog 0  test/test-long 1
max recip 8x7 error: 2^-7.77602
max rsqrt 8x7 error: 2^-7.72555
estimate width, op=0, out of range reset to default
ip 8 op 8 LUT #bits 2048 verilog 0  test/test-long 1
max recip 8x8 error: 2^-8.45311
max rsqrt 8x8 error: 2^-8.25349
ip 8 op 9 LUT #bits 2304 verilog 0  test/test-long 1
max recip 8x9 error: 2^-8.71923
max rsqrt 8x9 error: 2^-8.67807
estimate width, op=0, out of range reset to default
ip 9 op 9 LUT #bits 4608 verilog 0  test/test-long 1
max recip 9x9 error: 2^-9.43021
max rsqrt 9x9 error: 2^-9.28082

On 2020-08-01 12:27 a.m., Bill Huffman wrote:

David,

Here are a series of statements leading to my worst case answer:

• For the mantissa range 0xF5_0000 to 0xF5_FFFF, the reciprocal estimate is 0x85_0000
• The largest error is for 0xF5_0000
• The reciprocal of 0xF5_0000, to "infinite" precision is 0x85_BF37.612D
• The relative error, then, is (0x85_0000 - 0x85_BF37.612D)/0x85_BF37.612D => -0x0.016E_0000_0022
• The log2 of the absolute value of that error is: -7.484300

I don't have any errors as large as to have a log2 of -7.36951.  Where did that error come from for you?

It came from some testing I was performing on adjacent index values.
Completely bogus as I mentioned above.

I will instrument the code for more details, but I suspect this code has exactly the same worst case situation (for 7x7).

Bill
On 7/31/20 2:25 PM, DSHORNER wrote:
EXTERNAL MAIL

This is the program Andrew wrote.
https://github.com/riscv/riscv-v-spec/blob/vfrecip/recip.cc

On 2020-07-31 4:46 p.m., Bill Huffman wrote:

David,

Because of the errors you get, I'm assuming your "output width" and "input width" do not include the hidden bit.  Right?

That is correct, Andrew's approach assumes the implicit high hidden bit.

It's interesting.  I did a similar exercise a number of years ago and got a few hundredths of a bit better accuracy from 7/7 tables.  It's possible I did it wrong.  It's also possible that there's a slight improvement available.

Andrew chose a range from [xn , (x+1)n) perhaps (xn,(x+1)n] will work better.
I will give it a try.

If you want to send me the tables I can compare.  Mine are in decimal numbers from 128 to 255.  I could send you tables as well.

As mentioned Andrew's --verilog directive creates a table;  both input and output in range from 0 to 127.
I wouldn't expect the bias to make any significant difference.

I'd be happy to see your tables.
If you want I will send Andrew's program's output for 7x7.
And any of the other listed combinations from my mods to his program.
I will post my mods even though I still get that seg fault with null (or single) command line args.
Then you could run your own.

 1 - 20 of 51