The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots. Following is a discussion of some design considerations leading to a concrete proposal. Strict or bounded definition? Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x). The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions. What precision to choose? Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly. A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:  4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.  5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.  6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.  7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.  9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.  12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively. All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge. Instruction encoding? As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0  rsqrt(+/ 0) = +/ inf, raising DZ exception  rsqrt(nonzero negative) = qNaN, raising NV exception  rsqrt(NaN) = qNaN, raising NV for sNaN inputs  rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0  recip(+/ 0) = +/ inf, raising DZ exception  recip(NaN) = qNaN, raising NV if signaling  recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2 recip.cc && ./a.out test max recip error on [0.5, 1]: 2^7.4843 max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)
toggle quoted message
Show quoted text
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote: The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots. Following is a discussion of some design considerations leading to a concrete proposal. Strict or bounded definition? Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x). The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions. What precision to choose? Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly. A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:  4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.  5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.  6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.  7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.  9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.  12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively. All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge. Instruction encoding? As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0  rsqrt(+/ 0) = +/ inf, raising DZ exception  rsqrt(nonzero negative) = qNaN, raising NV exception  rsqrt(NaN) = qNaN, raising NV for sNaN inputs  rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0  recip(+/ 0) = +/ inf, raising DZ exception  recip(NaN) = qNaN, raising NV if signaling  recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


toggle quoted message
Show quoted text
On Thu, Jul 9, 2020 at 1:12 AM Andrew Waterman <andrew@...> wrote: I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2 recip.cc && ./a.out test max recip error on [0.5, 1]: 2^7.4843 max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)
Code is here; LMK if you spot any bugs: https://github.com/riscv/riscvvspec/blob/vfrecip/recip.cc
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman <andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots.
Following is a discussion of some design considerations leading to a concrete proposal.
Strict or bounded definition?
Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).
The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions.
What precision to choose?
Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly.
A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:
 4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.  5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.  6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.  7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.  9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.  12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively.
All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge.
Instruction encoding?
As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0  rsqrt(+/ 0) = +/ inf, raising DZ exception  rsqrt(nonzero negative) = qNaN, raising NV exception  rsqrt(NaN) = qNaN, raising NV for sNaN inputs  rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0  recip(+/ 0) = +/ inf, raising DZ exception  recip(NaN) = qNaN, raising NV if signaling  recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


Hi Andrew et al,
Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete NewtonRaphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.
If my calculations are not wrong, with a 7bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.
I have a concern about not including special "step" instructions. The complete NR sequences are as follows:
float newton_raphson_recip(float r) { float x = recip(r); int k = fpclassify(x); if (k != FP_INFINITE && k != FP_ZERO) { float t = fmaf(r, x, 2.0f); // fma x = x * t; // fmul t = fmaf(r, x, 2.0f); // fma x = x * t; // fmul } return x; }
float newton_raphson_rsqrt(float r) { float x = rsqrt(r); int k = fpclassify(r); if (k != FP_INFINITE && k != FP_ZERO) { float t = r * x; // fmul float h = 0.5 * x; // fmul t = fmaf(t, x, 3.0f); // fma x = h * t; // fmul t = r * x; // fmul h = 0.5 * x; // fmul t = fmaf(t, x, 3.0f); // fma x = h * t; // fmul } return x; }
As you can see above, even with proper cornercase handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.
Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.
Roughly, calculating the number of extra instructions in each sequence I get:  save v0; 1 inst
 classify input; 1 inst
 compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
 materialize constant 2.0 for recip; 1 inst
 … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
 restore v0; 1 inst
This is 67 extra instructions for recip and 78 extra instructions for rsqrt. The two NR iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 1112 instructions for recip and 1617 instructions for rsqrt. Apart from the extra instructions, the sequences would require 35 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (79 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.
If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):
A recip_step instruction that calculates 2a*b with the following special case handling:  One input is infinity and the other 0 => result is 2
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3a*b)/2 with the following special case handling:  One input is infinity and the other 0 => result is 1.5
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
With the above instructions, the NR sequences are reduced to:
float newton_raphson_recip_with_step(float r) { float x = recip(r); float t = recip_step(r, x); // fma x = x * t; // fmul t = recip_step(r, x); // fma x = x * t; // fmul return x; }
float newton_raphson_rsqrt_with_step(float r) { float x = rsqrt(r); float t = r * x; // fmul t = rsqrt_step(t, x); // fma x = x * t; // fmul t = r * x; // fmul t = rsqrt_step(t, x); // fma x = x * t; // fmul return x; }
Which is a total of 5 instructions for recip (vs 1112 without recip_step) and 7 instructions for rsqrt (vs 1617 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.
For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three NR iterations.
Thanks, Grigorios
PS: I also tried the optimization that you propose of using (3/2  a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
toggle quoted message
Show quoted text
On 9 Jul 2020, at 10:12, Andrew Waterman < andrew@...> wrote:
I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2 recip.cc && ./a.out test max recip error on [0.5, 1]: 2^7.4843 max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote: The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots. Following is a discussion of some design considerations leading to a concrete proposal. Strict or bounded definition? Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x). The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions. What precision to choose? Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly. A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:  4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.  5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.  6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.  7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.  9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.  12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively. All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge. Instruction encoding? As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0  rsqrt(+/ 0) = +/ inf, raising DZ exception  rsqrt(nonzero negative) = qNaN, raising NV exception  rsqrt(NaN) = qNaN, raising NV for sNaN inputs  rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0  recip(+/ 0) = +/ inf, raising DZ exception  recip(NaN) = qNaN, raising NV if signaling  recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


Hi Andrew et al,
Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete NewtonRaphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.
If my calculations are not wrong, with a 7bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.
I have a concern about not including special "step" instructions. The complete NR sequences are as follows:
float newton_raphson_recip(float r) { float x = recip(r); int k = fpclassify(x); if (k != FP_INFINITE && k != FP_ZERO) { float t = fmaf(r, x, 2.0f); // fma x = x * t; // fmul t = fmaf(r, x, 2.0f); // fma x = x * t; // fmul } return x; }
float newton_raphson_rsqrt(float r) { float x = rsqrt(r); int k = fpclassify(r); if (k != FP_INFINITE && k != FP_ZERO) { float t = r * x; // fmul float h = 0.5 * x; // fmul t = fmaf(t, x, 3.0f); // fma x = h * t; // fmul t = r * x; // fmul h = 0.5 * x; // fmul t = fmaf(t, x, 3.0f); // fma x = h * t; // fmul } return x; }
As you can see above, even with proper cornercase handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.
Handling the zero and infinity input cases is a bit faster than you estimated, since you don't need to explicitly classify the input. The approach is to generate the mask by comparing the input against 0.0, then estimate the reciprocal, then compare the estimate against 0.0 (which handles +inf inputs, and also handles inf inputs for reciprocal, since the comparison against 0.0 also matches 0.0). The second comparison is performed under the mask and overwrites the mask, effectively ANDing it in.
Of course, a twoinstruction savings doesn't nullify your argument, but it does tweak the tradeoffs slightly.
Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.
Roughly, calculating the number of extra instructions in each sequence I get:  save v0; 1 inst
 classify input; 1 inst
 compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
 materialize constant 2.0 for recip; 1 inst
 … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
 restore v0; 1 inst
This is 67 extra instructions for recip and 78 extra instructions for rsqrt. The two NR iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 1112 instructions for recip and 1617 instructions for rsqrt. Apart from the extra instructions, the sequences would require 35 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (79 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.
If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):
A recip_step instruction that calculates 2a*b with the following special case handling:  One input is infinity and the other 0 => result is 2
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3a*b)/2 with the following special case handling:  One input is infinity and the other 0 => result is 1.5
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
With the above instructions, the NR sequences are reduced to:
float newton_raphson_recip_with_step(float r) { float x = recip(r); float t = recip_step(r, x); // fma x = x * t; // fmul t = recip_step(r, x); // fma x = x * t; // fmul return x; }
float newton_raphson_rsqrt_with_step(float r) { float x = rsqrt(r); float t = r * x; // fmul t = rsqrt_step(t, x); // fma x = x * t; // fmul t = r * x; // fmul t = rsqrt_step(t, x); // fma x = x * t; // fmul return x; }
Which is a total of 5 instructions for recip (vs 1112 without recip_step) and 7 instructions for rsqrt (vs 1617 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.
For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three NR iterations.
Thanks, Grigorios
PS: I also tried the optimization that you propose of using (3/2  a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
Halving the input discards one bit of precision for subnormal inputs, so that checks out. It's presumably only severe for the smallest subnormals; whether that is acceptable is obviously applicationdependent.
On 9 Jul 2020, at 10:12, Andrew Waterman < andrew@...> wrote:
I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2 recip.cc && ./a.out test max recip error on [0.5, 1]: 2^7.4843 max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote: The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots. Following is a discussion of some design considerations leading to a concrete proposal. Strict or bounded definition? Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x). The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions. What precision to choose? Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly. A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:  4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.  5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.  6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.  7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.  9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.  12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively. All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge. Instruction encoding? As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0  rsqrt(+/ 0) = +/ inf, raising DZ exception  rsqrt(nonzero negative) = qNaN, raising NV exception  rsqrt(NaN) = qNaN, raising NV for sNaN inputs  rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0  recip(+/ 0) = +/ inf, raising DZ exception  recip(NaN) = qNaN, raising NV if signaling  recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


Hi Andrew et al,
Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete NewtonRaphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.
If my calculations are not wrong, with a 7bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.
I have a concern about not including special "step" instructions. The complete NR sequences are as follows:
float newton_raphson_recip(float r) { float x = recip(r); int k = fpclassify(x); if (k != FP_INFINITE && k != FP_ZERO) { float t = fmaf(r, x, 2.0f); // fma x = x * t; // fmul t = fmaf(r, x, 2.0f); // fma x = x * t; // fmul } return x; }
float newton_raphson_rsqrt(float r) { float x = rsqrt(r); int k = fpclassify(r); if (k != FP_INFINITE && k != FP_ZERO) { float t = r * x; // fmul float h = 0.5 * x; // fmul t = fmaf(t, x, 3.0f); // fma x = h * t; // fmul t = r * x; // fmul h = 0.5 * x; // fmul t = fmaf(t, x, 3.0f); // fma x = h * t; // fmul } return x; }
As you can see above, even with proper cornercase handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.
Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence.
Roughly, calculating the number of extra instructions in each sequence I get:  save v0; 1 inst
 classify input; 1 inst
 compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
 materialize constant 2.0 for recip; 1 inst
 … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
 restore v0; 1 inst
This is 67 extra instructions for recip and 78 extra instructions for rsqrt. The two NR iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 1112 instructions for recip and 1617 instructions for rsqrt. Apart from the extra instructions, the sequences would require 35 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (79 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.
If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file):
A recip_step instruction that calculates 2a*b with the following special case handling:  One input is infinity and the other 0 => result is 2
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3a*b)/2 with the following special case handling:  One input is infinity and the other 0 => result is 1.5
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
With the above instructions, the NR sequences are reduced to:
float newton_raphson_recip_with_step(float r) { float x = recip(r); float t = recip_step(r, x); // fma x = x * t; // fmul t = recip_step(r, x); // fma x = x * t; // fmul return x; }
float newton_raphson_rsqrt_with_step(float r) { float x = rsqrt(r); float t = r * x; // fmul t = rsqrt_step(t, x); // fma x = x * t; // fmul t = r * x; // fmul t = rsqrt_step(t, x); // fma x = x * t; // fmul return x; }
Which is a total of 5 instructions for recip (vs 1112 without recip_step) and 7 instructions for rsqrt (vs 1617 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values.
FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight. I suppose a compromise might be to make them destructive. This would have no perf. consequence for rsqrt, but would add one extra move instruction to all but the last reciprocal iteration step.
For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three NR iterations.
Thanks, Grigorios
PS: I also tried the optimization that you propose of using (3/2  a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
On 9 Jul 2020, at 10:12, Andrew Waterman < andrew@...> wrote:
I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2 recip.cc && ./a.out test max recip error on [0.5, 1]: 2^7.4843 max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest, but it would increase cost, so we don't do so.)
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote: The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots. Following is a discussion of some design considerations leading to a concrete proposal. Strict or bounded definition? Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x). The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions. What precision to choose? Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly. A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:  4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.  5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.  6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.  7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.  9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.  12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively. All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge. Instruction encoding? As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0  rsqrt(+/ 0) = +/ inf, raising DZ exception  rsqrt(nonzero negative) = qNaN, raising NV exception  rsqrt(NaN) = qNaN, raising NV for sNaN inputs  rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0  recip(+/ 0) = +/ inf, raising DZ exception  recip(NaN) = qNaN, raising NV if signaling  recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL
Hi Andrew et al,
Thank you for sending the code. I am attaching an updated version of
recip.cc, implementing the complete NewtonRaphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.
If my calculations are not wrong, with a 7bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.
I have a concern about not including special "step" instructions. The complete NR sequences are as follows:
float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
}
return x;
}
float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = r * x; // fmul
float h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
t = r * x; // fmul
h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
}
return x;
}
As you can see above, even with proper cornercase handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence
you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.
Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value
of v0 at the beginning and end of the sequence.
Roughly, calculating the number of extra instructions in each sequence I get:
 save v0; 1 inst
 classify input; 1 inst
 compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
 materialize constant 2.0 for recip; 1 inst
 … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
 restore v0; 1 inst
This is 67 extra instructions for recip and 78 extra instructions for rsqrt. The two NR iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 1112 instructions for recip and 1617 instructions
for rsqrt. Apart from the extra instructions, the sequences would require 35 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value,
the rsqrt sequence would need more than 4 registers (79 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.
If we had the following special "step" instructions the sequences would be much shorter (see also the attached
recip_step.cc file):
A recip_step instruction that calculates 2a*b with the following special case handling:
 One input is infinity and the other 0 => result is 2
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3a*b)/2 with the following special case handling:
 One input is infinity and the other 0 => result is 1.5
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
With the above instructions, the NR sequences are reduced to:
float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x); // fma
x = x * t; // fmul
t = recip_step(r, x); // fma
x = x * t; // fmul
return x;
}
float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
return x;
}
Which is a total of 5 instructions for recip (vs 1112 without recip_step) and 7 instructions for rsqrt (vs 1617 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and
input value), so they can be executed for all LMUL values.
FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight. I suppose a compromise might be to make them destructive. This would have no perf. consequence for rsqrt, but would add one extra move instruction
to all but the last reciprocal iteration step.
Seems like they shouldn't be so big as they don't specify rs3 at all. Are we tight on two register input opcodes?
Bill
For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three NR iterations.
Thanks,
Grigorios
PS: I also tried the optimization that you propose of using (3/2  a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
On 9 Jul 2020, at 10:12, Andrew Waterman < andrew@...> wrote:
I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog
code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2
recip.cc && ./a.out test
max recip error on [0.5, 1]: 2^7.4843
max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest,
but it would increase cost, so we don't do so.)
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater
performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots.
Following is a discussion of some design considerations leading to a concrete proposal.
Strict or bounded definition?
Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition
allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).
The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions.
What precision to choose?
Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps
the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly.
A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:
 4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
 5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.
 6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.
 7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.
 9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.
 12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively.
All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge.
Instruction encoding?
As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0
 rsqrt(+/ 0) = +/ inf, raising DZ exception
 rsqrt(nonzero negative) = qNaN, raising NV exception
 rsqrt(NaN) = qNaN, raising NV for sNaN inputs
 rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0
 recip(+/ 0) = +/ inf, raising DZ exception
 recip(NaN) = qNaN, raising NV if signaling
 recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they
will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


On Tue, Jul 14, 2020 at 2:47 PM Bill Huffman < huffman@...> wrote:
On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL
Hi Andrew et al,
Thank you for sending the code. I am attaching an updated version of
recip.cc, implementing the complete NewtonRaphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.
If my calculations are not wrong, with a 7bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.
I have a concern about not including special "step" instructions. The complete NR sequences are as follows:
float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
}
return x;
}
float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = r * x; // fmul
float h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
t = r * x; // fmul
h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
}
return x;
}
As you can see above, even with proper cornercase handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence
you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.
Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value
of v0 at the beginning and end of the sequence.
Roughly, calculating the number of extra instructions in each sequence I get:
 save v0; 1 inst
 classify input; 1 inst
 compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
 materialize constant 2.0 for recip; 1 inst
 … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
 restore v0; 1 inst
This is 67 extra instructions for recip and 78 extra instructions for rsqrt. The two NR iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 1112 instructions for recip and 1617 instructions
for rsqrt. Apart from the extra instructions, the sequences would require 35 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value,
the rsqrt sequence would need more than 4 registers (79 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.
If we had the following special "step" instructions the sequences would be much shorter (see also the attached
recip_step.cc file):
A recip_step instruction that calculates 2a*b with the following special case handling:
 One input is infinity and the other 0 => result is 2
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3a*b)/2 with the following special case handling:
 One input is infinity and the other 0 => result is 1.5
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
With the above instructions, the NR sequences are reduced to:
float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x); // fma
x = x * t; // fmul
t = recip_step(r, x); // fma
x = x * t; // fmul
return x;
}
float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
return x;
}
Which is a total of 5 instructions for recip (vs 1112 without recip_step) and 7 instructions for rsqrt (vs 1617 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and
input value), so they can be executed for all LMUL values.
FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight. I suppose a compromise might be to make them destructive. This would have no perf. consequence for rsqrt, but would add one extra move instruction
to all but the last reciprocal iteration step.
Seems like they shouldn't be so big as they don't specify rs3 at all. Are we tight on two register input opcodes? Actually, none of the vector instructions use the rs3 field (the vector FMAs are destructive to save encoding space).
There are still several Rtype code points left in the vector opcode, but it has gotten a bit tight, and there will be stiff competition for those code points over the coming years.
Bill
For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three NR iterations.
Thanks,
Grigorios
PS: I also tried the optimization that you propose of using (3/2  a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
On 9 Jul 2020, at 10:12, Andrew Waterman < andrew@...> wrote:
I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog
code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2
recip.cc && ./a.out test
max recip error on [0.5, 1]: 2^7.4843
max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest,
but it would increase cost, so we don't do so.)
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater
performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots.
Following is a discussion of some design considerations leading to a concrete proposal.
Strict or bounded definition?
Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition
allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).
The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions.
What precision to choose?
Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps
the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly.
A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:
 4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
 5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.
 6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.
 7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.
 9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.
 12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively.
All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge.
Instruction encoding?
As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0
 rsqrt(+/ 0) = +/ inf, raising DZ exception
 rsqrt(nonzero negative) = qNaN, raising NV exception
 rsqrt(NaN) = qNaN, raising NV for sNaN inputs
 rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0
 recip(+/ 0) = +/ inf, raising DZ exception
 recip(NaN) = qNaN, raising NV if signaling
 recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they
will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


On 7/14/20 2:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL
On Tue, Jul 14, 2020 at 2:47 PM Bill Huffman < huffman@...> wrote:
On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL
Hi Andrew et al,
Thank you for sending the code. I am attaching an updated version of
recip.cc, implementing the complete NewtonRaphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.
If my calculations are not wrong, with a 7bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.
I have a concern about not including special "step" instructions. The complete NR sequences are as follows:
float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
}
return x;
}
float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = r * x; // fmul
float h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
t = r * x; // fmul
h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
}
return x;
}
As you can see above, even with proper cornercase handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence
you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.
Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value
of v0 at the beginning and end of the sequence.
Roughly, calculating the number of extra instructions in each sequence I get:
 save v0; 1 inst
 classify input; 1 inst
 compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
 materialize constant 2.0 for recip; 1 inst
 … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
 restore v0; 1 inst
This is 67 extra instructions for recip and 78 extra instructions for rsqrt. The two NR iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 1112 instructions for recip and 1617 instructions
for rsqrt. Apart from the extra instructions, the sequences would require 35 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value,
the rsqrt sequence would need more than 4 registers (79 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.
If we had the following special "step" instructions the sequences would be much shorter (see also the attached
recip_step.cc file):
A recip_step instruction that calculates 2a*b with the following special case handling:
 One input is infinity and the other 0 => result is 2
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3a*b)/2 with the following special case handling:
 One input is infinity and the other 0 => result is 1.5
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
With the above instructions, the NR sequences are reduced to:
float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x); // fma
x = x * t; // fmul
t = recip_step(r, x); // fma
x = x * t; // fmul
return x;
}
float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
return x;
}
Which is a total of 5 instructions for recip (vs 1112 without recip_step) and 7 instructions for rsqrt (vs 1617 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and
input value), so they can be executed for all LMUL values.
FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight. I suppose a compromise might be to make them destructive. This would have no perf. consequence for rsqrt, but would add one extra move instruction
to all but the last reciprocal iteration step.
Seems like they shouldn't be so big as they don't specify rs3 at all. Are we tight on two register input opcodes?
Actually, none of the vector instructions use the rs3 field (the vector FMAs are destructive to save encoding space).
There are still several Rtype code points left in the vector opcode, but it has gotten a bit tight, and there will be stiff competition for those code points over the coming years.
Sorry, I had scalar on the mind....
Bill
Bill
For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three NR iterations.
Thanks,
Grigorios
PS: I also tried the optimization that you propose of using (3/2  a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
On 9 Jul 2020, at 10:12, Andrew Waterman < andrew@...> wrote:
I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog
code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2
recip.cc && ./a.out test
max recip error on [0.5, 1]: 2^7.4843
max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest,
but it would increase cost, so we don't do so.)
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater
performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots.
Following is a discussion of some design considerations leading to a concrete proposal.
Strict or bounded definition?
Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition
allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).
The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions.
What precision to choose?
Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps
the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly.
A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:
 4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
 5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.
 6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.
 7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.
 9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.
 12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively.
All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge.
Instruction encoding?
As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0
 rsqrt(+/ 0) = +/ inf, raising DZ exception
 rsqrt(nonzero negative) = qNaN, raising NV exception
 rsqrt(NaN) = qNaN, raising NV for sNaN inputs
 rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0
 recip(+/ 0) = +/ inf, raising DZ exception
 recip(NaN) = qNaN, raising NV if signaling
 recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they
will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


Do we improve accuracy a bit if the step is:
t = 1.0  r*x; x = x + t*x
instead of:
t = 2.0  r*x; x = t*x
Bill
On 7/14/20 2:58 PM, Bill Huffman wrote:
toggle quoted message
Show quoted text
On 7/14/20 2:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL
On Tue, Jul 14, 2020 at 2:47 PM Bill Huffman < huffman@...> wrote:
On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL
Hi Andrew et al,
Thank you for sending the code. I am attaching an updated version of
recip.cc, implementing the complete NewtonRaphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.
If my calculations are not wrong, with a 7bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.
I have a concern about not including special "step" instructions. The complete NR sequences are as follows:
float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
}
return x;
}
float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = r * x; // fmul
float h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
t = r * x; // fmul
h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
}
return x;
}
As you can see above, even with proper cornercase handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence
you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.
Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value
of v0 at the beginning and end of the sequence.
Roughly, calculating the number of extra instructions in each sequence I get:
 save v0; 1 inst
 classify input; 1 inst
 compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
 materialize constant 2.0 for recip; 1 inst
 … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
 restore v0; 1 inst
This is 67 extra instructions for recip and 78 extra instructions for rsqrt. The two NR iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 1112 instructions for recip and 1617 instructions
for rsqrt. Apart from the extra instructions, the sequences would require 35 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value,
the rsqrt sequence would need more than 4 registers (79 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.
If we had the following special "step" instructions the sequences would be much shorter (see also the attached
recip_step.cc file):
A recip_step instruction that calculates 2a*b with the following special case handling:
 One input is infinity and the other 0 => result is 2
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3a*b)/2 with the following special case handling:
 One input is infinity and the other 0 => result is 1.5
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
With the above instructions, the NR sequences are reduced to:
float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x); // fma
x = x * t; // fmul
t = recip_step(r, x); // fma
x = x * t; // fmul
return x;
}
float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
return x;
}
Which is a total of 5 instructions for recip (vs 1112 without recip_step) and 7 instructions for rsqrt (vs 1617 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and
input value), so they can be executed for all LMUL values.
FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight. I suppose a compromise might be to make them destructive. This would have no perf. consequence for rsqrt, but would add one extra move instruction
to all but the last reciprocal iteration step.
Seems like they shouldn't be so big as they don't specify rs3 at all. Are we tight on two register input opcodes?
Actually, none of the vector instructions use the rs3 field (the vector FMAs are destructive to save encoding space).
There are still several Rtype code points left in the vector opcode, but it has gotten a bit tight, and there will be stiff competition for those code points over the coming years.
Sorry, I had scalar on the mind....
Bill
Bill
For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three NR iterations.
Thanks,
Grigorios
PS: I also tried the optimization that you propose of using (3/2  a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
On 9 Jul 2020, at 10:12, Andrew Waterman < andrew@...> wrote:
I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog
code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2
recip.cc && ./a.out test
max recip error on [0.5, 1]: 2^7.4843
max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest,
but it would increase cost, so we don't do so.)
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater
performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots.
Following is a discussion of some design considerations leading to a concrete proposal.
Strict or bounded definition?
Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition
allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).
The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions.
What precision to choose?
Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps
the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly.
A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:
 4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
 5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.
 6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.
 7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.
 9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.
 12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively.
All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge.
Instruction encoding?
As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0
 rsqrt(+/ 0) = +/ inf, raising DZ exception
 rsqrt(nonzero negative) = qNaN, raising NV exception
 rsqrt(NaN) = qNaN, raising NV for sNaN inputs
 rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0
 recip(+/ 0) = +/ inf, raising DZ exception
 recip(NaN) = qNaN, raising NV if signaling
 recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they
will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


On Tue, Jul 14, 2020 at 3:09 PM Bill Huffman < huffman@...> wrote:
Do we improve accuracy a bit if the step is:
t = 1.0  r*x; x = x + t*x
instead of:
t = 2.0  r*x; x = t*x
Yes.
Bill
On 7/14/20 2:58 PM, Bill Huffman wrote:
On 7/14/20 2:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL
On Tue, Jul 14, 2020 at 2:47 PM Bill Huffman < huffman@...> wrote:
On 7/14/20 2:30 PM, Andrew Waterman wrote:
EXTERNAL MAIL
Hi Andrew et al,
Thank you for sending the code. I am attaching an updated version of
recip.cc, implementing the complete NewtonRaphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error.
If my calculations are not wrong, with a 7bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root.
I have a concern about not including special "step" instructions. The complete NR sequences are as follows:
float newton_raphson_recip(float r)
{
float x = recip(r);
int k = fpclassify(x);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
t = fmaf(r, x, 2.0f); // fma
x = x * t; // fmul
}
return x;
}
float newton_raphson_rsqrt(float r)
{
float x = rsqrt(r);
int k = fpclassify(r);
if (k != FP_INFINITE && k != FP_ZERO) {
float t = r * x; // fmul
float h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
t = r * x; // fmul
h = 0.5 * x; // fmul
t = fmaf(t, x, 3.0f); // fma
x = h * t; // fmul
}
return x;
}
As you can see above, even with proper cornercase handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence
you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end.
Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value
of v0 at the beginning and end of the sequence.
Roughly, calculating the number of extra instructions in each sequence I get:
 save v0; 1 inst
 classify input; 1 inst
 compare with infinity and zero and write predicate into v0; 2 or 3 inst: need to materialie the "classes" to compare against and then the actual comparison(s)
 materialize constant 2.0 for recip; 1 inst
 … or materialize constants 0.5 and 3.0 for rsqrt; 2 inst
 restore v0; 1 inst
This is 67 extra instructions for recip and 78 extra instructions for rsqrt. The two NR iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 1112 instructions for recip and 1617 instructions
for rsqrt. Apart from the extra instructions, the sequences would require 35 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value,
the rsqrt sequence would need more than 4 registers (79 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory.
If we had the following special "step" instructions the sequences would be much shorter (see also the attached
recip_step.cc file):
A recip_step instruction that calculates 2a*b with the following special case handling:
 One input is infinity and the other 0 => result is 2
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
A rsqrt_step instruction that calculates (3a*b)/2 with the following special case handling:
 One input is infinity and the other 0 => result is 1.5
 One input is infinity and the other NaN => result is infinity with the sign of a*b (using only the sign of NaN)
With the above instructions, the NR sequences are reduced to:
float newton_raphson_recip_with_step(float r)
{
float x = recip(r);
float t = recip_step(r, x); // fma
x = x * t; // fmul
t = recip_step(r, x); // fma
x = x * t; // fmul
return x;
}
float newton_raphson_rsqrt_with_step(float r)
{
float x = rsqrt(r);
float t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
t = r * x; // fmul
t = rsqrt_step(t, x); // fma
x = x * t; // fmul
return x;
}
Which is a total of 5 instructions for recip (vs 1112 without recip_step) and 7 instructions for rsqrt (vs 1617 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and
input value), so they can be executed for all LMUL values.
FWIW, one of my concerns with adding the "step" instructions is opcode space, since we are already very tight. I suppose a compromise might be to make them destructive. This would have no perf. consequence for rsqrt, but would add one extra move instruction
to all but the last reciprocal iteration step.
Seems like they shouldn't be so big as they don't specify rs3 at all. Are we tight on two register input opcodes?
Actually, none of the vector instructions use the rs3 field (the vector FMAs are destructive to save encoding space).
There are still several Rtype code points left in the vector opcode, but it has gotten a bit tight, and there will be stiff competition for those code points over the coming years.
Sorry, I had scalar on the mind....
Bill
Bill
For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three NR iterations.
Thanks,
Grigorios
PS: I also tried the optimization that you propose of using (3/2  a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
On 9 Jul 2020, at 10:12, Andrew Waterman < andrew@...> wrote:
I'm following up with detailed semantics in the form of a selfcontained C++ program. The `recip` and `rsqrt` functions model the proposed instructions. When the program is invoked with the `verilog` commandline argument, it prints Verilog
code for the lookup tables. When invoked with `test` or `testlong`, it selfvalidates and prints error information, e.g.:
$ g++ O2
recip.cc && ./a.out test
max recip error on [0.5, 1]: 2^7.4843
max rsqrt error on [0.25, 1]: 2^7.31422
The core of both schemes is a 2^7entry, 7bitwide LUT, whose entries are chosen to minimize the maximum error on the interval.
(Note, other ISAs have split one or both of these functions into two LUTs to account for the first derivative having greater magnitude for smaller input values. For this particular scheme, doing so wouldn't improve convergence for any FP format of interest,
but it would increase cost, so we don't do so.)
On Tue, Jun 23, 2020 at 11:05 PM Andrew Waterman < andrew@...> wrote:
The task group has recommended moving forward with adding instructions that estimate reciprocals and reciprocal square roots. These are both useful for ffastmath code where it's acceptable to sacrifice a bit of precision in exchange for greater
performance. Reciprocal square root in particular is ubiquitous in graphics and physics computations. These instructions can feed into NewtonRaphson schemes to provide refined estimates of quotients and square roots.
Following is a discussion of some design considerations leading to a concrete proposal.
Strict or bounded definition?
Some ISAs have required these instructions compute the same estimate on any implementation, whereas others have only specified a maximum error bound. A strict definition is obviously preferable for software portability and compliance testing. A bounded definition
allows more implementation flexibility: for example, implementations with a fast divider could compute the reciprocal estimate as (1.0 / x).
The task group has recommended adopting a strict definition. I'll supply C source code that models the instruction semantics and prints Verilog code that implements the lookuptable components of the instructions.
What precision to choose?
Adopting a strict definition increases the pressure to keep the cost low, since some applications won't benefit from these instructions, but implementations are still effectively required to provide them. So, I'm recommending using a 7bit scheme, which keeps
the cost down (hundreds of gates to implement as a LUT) while providing just enough precision for NewtonRaphson refinement to converge reasonably quickly.
A common goal is to produce an estimate within a couple ulps of the IEEE 754 result. With that in mind, a few precision sweet spots emerge:
 4bit estimates (which, due to careful rounding, actually have relative error of 2^4.6 for reciprocal and 2^4.5 for rsqrt) take 1, 2, 3, and 4 iterations to get within a couple ulps of bfloat16, binary16, binary32, and binary64, respectively.
 5bit (2^5.5, 2^5.3) take 1, 1, 3, and 4 iterations, respectively.
 6bit (2^6.4, 2^6.3) take 0, 1, 2, and 4 iterations, respectively.
 7bit (2^7.4, 2^7.3) take 0, 1, 2, and 3 iterations, respectively.
 9bit (2^9.4, 2^9.2) take 0, 0, 2, and 3 iterations, respectively.
 12bit (2^12.4, 2^12.3) take 0, 0, 1, and 3 iterations, respectively.
All of these are local optima; the global optimum obviously depends on your favorite FP format and the ubiquity of these calculations. The 7bit scheme is of interest because it's the cliff at which doubleprecision needs 3 iterations to converge.
Instruction encoding?
As unary instructions, these instructions fit conveniently in the VFUNARY1 opcode, next to the existing VFSQRT instruction.
If, over time, more precise estimates are required, they can be accommodated in the same unaryoperation encoding space.
Cornercase details?
For values outside the domain, or near its edge, these instructions should behave like you'd expect 1.0/x or 1.0/sqrt(x) to behave on an IEEE 754compliant C implementation. In particular:
 rsqrt(+inf) = +0
 rsqrt(+/ 0) = +/ inf, raising DZ exception
 rsqrt(nonzero negative) = qNaN, raising NV exception
 rsqrt(NaN) = qNaN, raising NV for sNaN inputs
 rsqrt(subnormal) is handled correctly; the result is normal and never overflows
 recip(+/ inf) = +/ 0
 recip(+/ 0) = +/ inf, raising DZ exception
 recip(NaN) = qNaN, raising NV if signaling
 recip(subnormal) is handled correctly, noting that all but the largest subnormals will overflow to infinity, raising OF the exception
NewtonRaphson iteration instructions?
Other ISAs have provided instructions that perform part of a NewtonRaphson iteration step, e.g., computing (3  a * b) / 2.0 in a single instruction. These instructions would avoid needing to splat a constant to a vector register, but I think otherwise they
will only save one instruction per NewtonRaphson loop, not one instruction per NewtonRaphson iteration. (This is because the expression can be rewritten as (3/2  a/2 * b), where a/2 is the same for all iteration steps.)
To keep the ISA simple, and to avoid eating up opcode space, I'm recommending against adding these iteration instructions.


Excellent observation. at least 1 bits better. and when r*x close to 1, much better.


The current LUT generator assumes NbyN look up table.
I will load in my github Andrew's program modified to take input (index size) and output (estimate number of bits) arguments. (verilog still generates LUT, and test generates the values below),
Of course, how well the table can be synthesized is more importance than LUT dimensions per se. I have used yosys with varying results.
I continue to try to profile the accuracy within input segments rather than over total float range.
Other dimensions are possible with resultant increase of decrease in accuracy: parameters are: output width / input width array size
7/7 896 max recip error on [0.5, 1]: 2^7.36951 max rsqrt error on [0.25, 1]: 2^7.2998 8/7 1024 max recip error on [0.5, 1]: 2^7.78083 max rsqrt error on [0.25, 1]: 2^7.69943 9/7 1152 max recip error on [0.5, 1]: 2^7.89148 max rsqrt error on [0.25, 1]: 2^7.8783 10/7 1280 max recip error on [0.5, 1]: 2^7.94879 max rsqrt error on [0.25, 1]: 2^7.91063 11/7 1408 by 1 max recip error on [0.5, 1]: 2^7.97629 max rsqrt error on [0.25, 1]: 2^8 11/7 1408 by 0x1000 max recip error on [0.5, 1]: 2^8 max rsqrt error on [0.25, 1]: 2^8 6/8 1536 max recip error on [0.5, 1]: 2^6.90724 max rsqrt error on [0.25, 1]: 2^6.88438 7/8 1792 max recip error on [0.5, 1]: 2^7.78083 max rsqrt error on [0.25, 1]: 2^7.73897 8/8 2048 by 0X1000 max recip error on [0.5, 1]: 2^8.45311 max rsqrt error on [0.25, 1]: 2^8.36032 10/8 2560 by 0X1000 max recip error on [0.5, 1]: 2^8.88626 max rsqrt error on [0.25, 1]: 2^8.8425 10/8 2560 by 0X1 max recip error on [0.5, 1]: 2^8.88626 max rsqrt error on [0.25, 1]: 2^8.83164 12/8 max recip error on [0.5, 1]: 2^8.98041 max rsqrt error on [0.25, 1]: 2^8.9616 13/8 2560 max recip error on [0.5, 1]: 2^9 max rsqrt error on [0.25, 1]: 2^9 9/9 max recip error on [0.5, 1]: 2^9.47252 max rsqrt error on [0.25, 1]: 2^9.31953
observation: 7 input bits has a minimum max error of 2^8. 8 input bits has a minimum max error of 2^9.


David,
Are the max errors absolute? Or relative to the recip or rsqrt, which is presumably in the range (1.0, 2.0]?
That you use [0.5, 1] when you might have meant [0.5, 1) leaves some question about what is happening with the powers of two (even powers for rsqrt). Hopefully, they're always precise and there's no issue of error there.
Bill
toggle quoted message
Show quoted text
On 7/31/20 4:33 AM, David Horner wrote: EXTERNAL MAIL
The current LUT generator assumes NbyN look up table.
I will load in my github Andrew's program modified to take input (index size) and output (estimate number of bits) arguments. (verilog still generates LUT, and test generates the values below),
Of course, how well the table can be synthesized is more importance than LUT dimensions per se. I have used yosys with varying results.
I continue to try to profile the accuracy within input segments rather than over total float range.
Other dimensions are possible with resultant increase of decrease in accuracy: parameters are: output width / input width array size
7/7 896 max recip error on [0.5, 1]: 2^7.36951 max rsqrt error on [0.25, 1]: 2^7.2998 8/7 1024 max recip error on [0.5, 1]: 2^7.78083 max rsqrt error on [0.25, 1]: 2^7.69943 9/7 1152 max recip error on [0.5, 1]: 2^7.89148 max rsqrt error on [0.25, 1]: 2^7.8783 10/7 1280 max recip error on [0.5, 1]: 2^7.94879 max rsqrt error on [0.25, 1]: 2^7.91063 11/7 1408 by 1 max recip error on [0.5, 1]: 2^7.97629 max rsqrt error on [0.25, 1]: 2^8 11/7 1408 by 0x1000 max recip error on [0.5, 1]: 2^8 max rsqrt error on [0.25, 1]: 2^8 6/8 1536 max recip error on [0.5, 1]: 2^6.90724 max rsqrt error on [0.25, 1]: 2^6.88438 7/8 1792 max recip error on [0.5, 1]: 2^7.78083 max rsqrt error on [0.25, 1]: 2^7.73897 8/8 2048 by 0X1000 max recip error on [0.5, 1]: 2^8.45311 max rsqrt error on [0.25, 1]: 2^8.36032 10/8 2560 by 0X1000 max recip error on [0.5, 1]: 2^8.88626 max rsqrt error on [0.25, 1]: 2^8.8425 10/8 2560 by 0X1 max recip error on [0.5, 1]: 2^8.88626 max rsqrt error on [0.25, 1]: 2^8.83164 12/8 max recip error on [0.5, 1]: 2^8.98041 max rsqrt error on [0.25, 1]: 2^8.9616 13/8 2560 max recip error on [0.5, 1]: 2^9 max rsqrt error on [0.25, 1]: 2^9 9/9 max recip error on [0.5, 1]: 2^9.47252 max rsqrt error on [0.25, 1]: 2^9.31953
observation: 7 input bits has a minimum max error of 2^8. 8 input bits has a minimum max error of 2^9.


The error is relative error. The calculation is unchanged from Andrew's original. (Although I explicitly force double even when it shouldn't matter). The test range is from 0.5 to 1 inclusive. Again I left Andrew's choices unchanged. As you point out the 1 case should not contribute to max error in the reciprocal case as the error should be zero. The 1 case for rsqrt for odd powers of 2 exponent is non zero by definition as sqrt 2 is irrational. Andrew provides test_long which tests all single precision values that are not NaN.
I hope to post my code soon. I get a seg fault when no parm are provided on the command line. Argh. I belive it is relates to handling the argv as a vector of values. It appears the support construct moves some code out of reates out of conditional scope and thus frees even when no explicit allocation is made.
F6
toggle quoted message
Show quoted text
On Fri, Jul 31, 2020, 13:29 Bill Huffman, < huffman@...> wrote: David,
Are the max errors absolute? Or relative to the recip or rsqrt, which
is presumably in the range (1.0, 2.0]?
That you use [0.5, 1] when you might have meant [0.5, 1) leaves some
question about what is happening with the powers of two (even powers for
rsqrt). Hopefully, they're always precise and there's no issue of error
there.
Bill
On 7/31/20 4:33 AM, David Horner wrote:
> EXTERNAL MAIL
>
>
> The current LUT generator assumes NbyN look up table.
>
> I will load in my github Andrew's program modified to take input (index
> size) and output (estimate number of bits) arguments.
> (verilog still generates LUT, and test generates the values below),
>
> Of course, how well the table can be synthesized is more importance than
> LUT dimensions per se.
> I have used yosys with varying results.
>
> I continue to try to profile the accuracy within input segments rather
> than over total float range.
>
>
> Other dimensions are possible with resultant increase of decrease in
> accuracy:
> parameters are: output width / input width array size
>
> 7/7 896
> max recip error on [0.5, 1]: 2^7.36951
> max rsqrt error on [0.25, 1]: 2^7.2998
> 8/7 1024
> max recip error on [0.5, 1]: 2^7.78083
> max rsqrt error on [0.25, 1]: 2^7.69943
> 9/7 1152
> max recip error on [0.5, 1]: 2^7.89148
> max rsqrt error on [0.25, 1]: 2^7.8783
> 10/7 1280
> max recip error on [0.5, 1]: 2^7.94879
> max rsqrt error on [0.25, 1]: 2^7.91063
> 11/7 1408 by 1
> max recip error on [0.5, 1]: 2^7.97629
> max rsqrt error on [0.25, 1]: 2^8
> 11/7 1408 by 0x1000
> max recip error on [0.5, 1]: 2^8
> max rsqrt error on [0.25, 1]: 2^8
> 6/8 1536
> max recip error on [0.5, 1]: 2^6.90724
> max rsqrt error on [0.25, 1]: 2^6.88438
> 7/8 1792
> max recip error on [0.5, 1]: 2^7.78083
> max rsqrt error on [0.25, 1]: 2^7.73897
> 8/8 2048 by 0X1000
> max recip error on [0.5, 1]: 2^8.45311
> max rsqrt error on [0.25, 1]: 2^8.36032
> 10/8 2560 by 0X1000
> max recip error on [0.5, 1]: 2^8.88626
> max rsqrt error on [0.25, 1]: 2^8.8425
> 10/8 2560 by 0X1
> max recip error on [0.5, 1]: 2^8.88626
> max rsqrt error on [0.25, 1]: 2^8.83164
> 12/8
> max recip error on [0.5, 1]: 2^8.98041
> max rsqrt error on [0.25, 1]: 2^8.9616
> 13/8 2560
> max recip error on [0.5, 1]: 2^9
> max rsqrt error on [0.25, 1]: 2^9
> 9/9
> max recip error on [0.5, 1]: 2^9.47252
> max rsqrt error on [0.25, 1]: 2^9.31953
>
>
> observation:
> 7 input bits has a minimum max error of 2^8.
> 8 input bits has a minimum max error of 2^9.
>
>
>
>


David,
Because of the errors you get, I'm assuming your "output width" and "input width" do not include the hidden bit. Right?
It's interesting. I did a similar exercise a number of years ago and got a few hundredths of a bit better accuracy from 7/7 tables. It's possible I did it wrong. It's also possible that there's a slight improvement available.
If you want to send me the tables I can compare. Mine are in decimal numbers from 128 to 255. I could send you tables as well.
Bill
On 7/31/20 12:24 PM, David Horner wrote:
toggle quoted message
Show quoted text
EXTERNAL MAIL
The error is relative error.
The calculation is unchanged from Andrew's original. (Although I explicitly force double even when it shouldn't matter).
The test range is from 0.5 to 1 inclusive.
Again I left Andrew's choices unchanged.
As you point out the 1 case should not contribute to max error in the reciprocal case as the error should be zero. The 1 case for rsqrt for odd powers of 2 exponent is non zero by definition as sqrt 2 is irrational.
Andrew provides test_long which tests all single precision values that are not NaN.
I hope to post my code soon.
I get a seg fault when no parm are provided on the command line.
Argh. I belive it is relates to handling the argv as a vector of values. It appears the support construct moves some code out of reates out of conditional scope and thus frees even when no explicit allocation is made.
F6
On Fri, Jul 31, 2020, 13:29 Bill Huffman, < huffman@...> wrote:
David,
Are the max errors absolute? Or relative to the recip or rsqrt, which
is presumably in the range (1.0, 2.0]?
That you use [0.5, 1] when you might have meant [0.5, 1) leaves some
question about what is happening with the powers of two (even powers for
rsqrt). Hopefully, they're always precise and there's no issue of error
there.
Bill
On 7/31/20 4:33 AM, David Horner wrote:
> EXTERNAL MAIL
>
>
> The current LUT generator assumes NbyN look up table.
>
> I will load in my github Andrew's program modified to take input (index
> size) and output (estimate number of bits) arguments.
> (verilog still generates LUT, and test generates the values below),
>
> Of course, how well the table can be synthesized is more importance than
> LUT dimensions per se.
> I have used yosys with varying results.
>
> I continue to try to profile the accuracy within input segments rather
> than over total float range.
>
>
> Other dimensions are possible with resultant increase of decrease in
> accuracy:
> parameters are: output width / input width array size
>
> 7/7 896
> max recip error on [0.5, 1]: 2^7.36951
> max rsqrt error on [0.25, 1]: 2^7.2998
> 8/7 1024
> max recip error on [0.5, 1]: 2^7.78083
> max rsqrt error on [0.25, 1]: 2^7.69943
> 9/7 1152
> max recip error on [0.5, 1]: 2^7.89148
> max rsqrt error on [0.25, 1]: 2^7.8783
> 10/7 1280
> max recip error on [0.5, 1]: 2^7.94879
> max rsqrt error on [0.25, 1]: 2^7.91063
> 11/7 1408 by 1
> max recip error on [0.5, 1]: 2^7.97629
> max rsqrt error on [0.25, 1]: 2^8
> 11/7 1408 by 0x1000
> max recip error on [0.5, 1]: 2^8
> max rsqrt error on [0.25, 1]: 2^8
> 6/8 1536
> max recip error on [0.5, 1]: 2^6.90724
> max rsqrt error on [0.25, 1]: 2^6.88438
> 7/8 1792
> max recip error on [0.5, 1]: 2^7.78083
> max rsqrt error on [0.25, 1]: 2^7.73897
> 8/8 2048 by 0X1000
> max recip error on [0.5, 1]: 2^8.45311
> max rsqrt error on [0.25, 1]: 2^8.36032
> 10/8 2560 by 0X1000
> max recip error on [0.5, 1]: 2^8.88626
> max rsqrt error on [0.25, 1]: 2^8.8425
> 10/8 2560 by 0X1
> max recip error on [0.5, 1]: 2^8.88626
> max rsqrt error on [0.25, 1]: 2^8.83164
> 12/8
> max recip error on [0.5, 1]: 2^8.98041
> max rsqrt error on [0.25, 1]: 2^8.9616
> 13/8 2560
> max recip error on [0.5, 1]: 2^9
> max rsqrt error on [0.25, 1]: 2^9
> 9/9
> max recip error on [0.5, 1]: 2^9.47252
> max rsqrt error on [0.25, 1]: 2^9.31953
>
>
> observation:
> 7 input bits has a minimum max error of 2^8.
> 8 input bits has a minimum max error of 2^9.
>
>
>
>


This is the program Andrew wrote.
https://github.com/riscv/riscvvspec/blob/vfrecip/recip.cc
On 20200731 4:46 p.m., Bill Huffman
wrote:
David,
Because of the errors you get,
I'm assuming your "output width" and "input width" do not
include the hidden bit. Right?
That is correct, Andrew's approach
assumes the implicit high hidden bit.
It's interesting. I did a
similar exercise a number of years ago and got a few
hundredths of a bit better accuracy from 7/7 tables. It's
possible I did it wrong. It's also possible that there's a
slight improvement available.
Andrew chose a range from [xn ,
(x+1)n) perhaps (xn,(x+1)n] will work better.
I will give it a try.
If you want to send me the
tables I can compare. Mine are in decimal numbers from 128 to
255. I could send you tables as well.
As mentioned Andrew's verilog
directive creates a table; both input and output in range from 0
to 127.
I wouldn't expect the bias to make any significant difference.
I'd be happy to see your tables.
If you want I will send Andrew's program's output for 7x7.
And any of the other listed combinations from my mods to his
program.
I will post my mods even though I still get that seg fault with
null (or single) command line args.
Then you could run your own.
Bill
On 7/31/20 12:24 PM, David Horner
wrote:
EXTERNAL MAIL
The error is relative error.
The calculation is unchanged from Andrew's
original. (Although I explicitly force double even when it
shouldn't matter).
The test range is from 0.5 to 1 inclusive.
Again I left Andrew's choices unchanged.
As you point out the 1 case should not
contribute to max error in the reciprocal case as the error
should be zero. The 1 case for rsqrt for odd powers of 2
exponent is non zero by definition as sqrt 2 is irrational.
Andrew provides test_long which tests all
single precision values that are not NaN.
I hope to post my code soon.
I get a seg fault when no parm are provided on
the command line.
Argh. I belive it is relates to handling the
argv as a vector of values. It appears the support construct
moves some code out of reates out of conditional scope and
thus frees even when no explicit allocation is made.
F6
On Fri, Jul 31, 2020, 13:29
Bill Huffman, < huffman@...> wrote:
David,
Are the max errors absolute? Or relative to the recip or
rsqrt, which
is presumably in the range (1.0, 2.0]?
That you use [0.5, 1] when you might have meant [0.5, 1)
leaves some
question about what is happening with the powers of two
(even powers for
rsqrt). Hopefully, they're always precise and there's no
issue of error
there.
Bill
On 7/31/20 4:33 AM, David Horner wrote:
> EXTERNAL MAIL
>
>
> The current LUT generator assumes NbyN look up table.
>
> I will load in my github Andrew's program modified to
take input (index
> size) and output (estimate number of bits) arguments.
> (verilog still generates LUT, and test generates
the values below),
>
> Of course, how well the table can be synthesized is
more importance than
> LUT dimensions per se.
> I have used yosys with varying results.
>
> I continue to try to profile the accuracy within input
segments rather
> than over total float range.
>
>
> Other dimensions are possible with resultant increase
of decrease in
> accuracy:
> parameters are: output width / input width array size
>
> 7/7 896
> max recip error on [0.5, 1]: 2^7.36951
> max rsqrt error on [0.25, 1]: 2^7.2998
> 8/7 1024
> max recip error on [0.5, 1]: 2^7.78083
> max rsqrt error on [0.25, 1]: 2^7.69943
> 9/7 1152
> max recip error on [0.5, 1]: 2^7.89148
> max rsqrt error on [0.25, 1]: 2^7.8783
> 10/7 1280
> max recip error on [0.5, 1]: 2^7.94879
> max rsqrt error on [0.25, 1]: 2^7.91063
> 11/7 1408 by 1
> max recip error on [0.5, 1]: 2^7.97629
> max rsqrt error on [0.25, 1]: 2^8
> 11/7 1408 by 0x1000
> max recip error on [0.5, 1]: 2^8
> max rsqrt error on [0.25, 1]: 2^8
> 6/8 1536
> max recip error on [0.5, 1]: 2^6.90724
> max rsqrt error on [0.25, 1]: 2^6.88438
> 7/8 1792
> max recip error on [0.5, 1]: 2^7.78083
> max rsqrt error on [0.25, 1]: 2^7.73897
> 8/8 2048 by 0X1000
> max recip error on [0.5, 1]: 2^8.45311
> max rsqrt error on [0.25, 1]: 2^8.36032
> 10/8 2560 by 0X1000
> max recip error on [0.5, 1]: 2^8.88626
> max rsqrt error on [0.25, 1]: 2^8.8425
> 10/8 2560 by 0X1
> max recip error on [0.5, 1]: 2^8.88626
> max rsqrt error on [0.25, 1]: 2^8.83164
> 12/8
> max recip error on [0.5, 1]: 2^8.98041
> max rsqrt error on [0.25, 1]: 2^8.9616
> 13/8 2560
> max recip error on [0.5, 1]: 2^9
> max rsqrt error on [0.25, 1]: 2^9
> 9/9
> max recip error on [0.5, 1]: 2^9.47252
> max rsqrt error on [0.25, 1]: 2^9.31953
>
>
> observation:
> 7 input bits has a minimum max error of 2^8.
> 8 input bits has a minimum max error of 2^9.
>
>
>
>


David,
Here are a series of statements leading to my worst case answer:
 For the mantissa range 0xF5_0000 to 0xF5_FFFF, the reciprocal estimate is 0x85_0000
 The largest error is for 0xF5_0000
 The reciprocal of 0xF5_0000, to "infinite" precision is 0x85_BF37.612D
 The relative error, then, is (0x85_0000  0x85_BF37.612D)/0x85_BF37.612D
=> 0x0.016E_0000_0022
 The log2 of the absolute value of that error is: 7.484300
I don't have any errors as large as to have a log2 of 7.36951. Where did that error come from for you?
Bill
On 7/31/20 2:25 PM, DSHORNER wrote:
toggle quoted message
Show quoted text
EXTERNAL MAIL
This is the program Andrew wrote.
https://github.com/riscv/riscvvspec/blob/vfrecip/recip.cc
On 20200731 4:46 p.m., Bill Huffman wrote:
David,
Because of the errors you get, I'm assuming your "output width" and "input width" do not include the hidden bit. Right?
That is correct, Andrew's approach assumes the implicit high hidden bit.
It's interesting. I did a similar exercise a number of years ago and got a few hundredths of a bit better accuracy from 7/7 tables. It's possible I did it wrong. It's also possible that there's a slight improvement available.
Andrew chose a range from [xn , (x+1)n) perhaps (xn,(x+1)n] will work better.
I will give it a try.
If you want to send me the tables I can compare. Mine are in decimal numbers from 128 to 255. I could send you tables as well.
As mentioned Andrew's verilog directive creates a table; both input and output in range from 0 to 127.
I wouldn't expect the bias to make any significant difference.
I'd be happy to see your tables.
If you want I will send Andrew's program's output for 7x7.
And any of the other listed combinations from my mods to his program.
I will post my mods even though I still get that seg fault with null (or single) command line args.
Then you could run your own.
Bill
On 7/31/20 12:24 PM, David Horner wrote:
EXTERNAL MAIL
The error is relative error.
The calculation is unchanged from Andrew's original. (Although I explicitly force double even when it shouldn't matter).
The test range is from 0.5 to 1 inclusive.
Again I left Andrew's choices unchanged.
As you point out the 1 case should not contribute to max error in the reciprocal case as the error should be zero. The 1 case for rsqrt for odd powers of 2 exponent is non zero by definition as sqrt 2 is irrational.
Andrew provides test_long which tests all single precision values that are not NaN.
I hope to post my code soon.
I get a seg fault when no parm are provided on the command line.
Argh. I belive it is relates to handling the argv as a vector of values. It appears the support construct moves some code out of reates out of conditional scope and thus frees even when no explicit allocation is made.
F6
On Fri, Jul 31, 2020, 13:29 Bill Huffman, < huffman@...> wrote:
David,
Are the max errors absolute? Or relative to the recip or rsqrt, which
is presumably in the range (1.0, 2.0]?
That you use [0.5, 1] when you might have meant [0.5, 1) leaves some
question about what is happening with the powers of two (even powers for
rsqrt). Hopefully, they're always precise and there's no issue of error
there.
Bill
On 7/31/20 4:33 AM, David Horner wrote:
> EXTERNAL MAIL
>
>
> The current LUT generator assumes NbyN look up table.
>
> I will load in my github Andrew's program modified to take input (index
> size) and output (estimate number of bits) arguments.
> (verilog still generates LUT, and test generates the values below),
>
> Of course, how well the table can be synthesized is more importance than
> LUT dimensions per se.
> I have used yosys with varying results.
>
> I continue to try to profile the accuracy within input segments rather
> than over total float range.
>
>
> Other dimensions are possible with resultant increase of decrease in
> accuracy:
> parameters are: output width / input width array size
>
> 7/7 896
> max recip error on [0.5, 1]: 2^7.36951
> max rsqrt error on [0.25, 1]: 2^7.2998
> 8/7 1024
> max recip error on [0.5, 1]: 2^7.78083
> max rsqrt error on [0.25, 1]: 2^7.69943
> 9/7 1152
> max recip error on [0.5, 1]: 2^7.89148
> max rsqrt error on [0.25, 1]: 2^7.8783
> 10/7 1280
> max recip error on [0.5, 1]: 2^7.94879
> max rsqrt error on [0.25, 1]: 2^7.91063
> 11/7 1408 by 1
> max recip error on [0.5, 1]: 2^7.97629
> max rsqrt error on [0.25, 1]: 2^8
> 11/7 1408 by 0x1000
> max recip error on [0.5, 1]: 2^8
> max rsqrt error on [0.25, 1]: 2^8
> 6/8 1536
> max recip error on [0.5, 1]: 2^6.90724
> max rsqrt error on [0.25, 1]: 2^6.88438
> 7/8 1792
> max recip error on [0.5, 1]: 2^7.78083
> max rsqrt error on [0.25, 1]: 2^7.73897
> 8/8 2048 by 0X1000
> max recip error on [0.5, 1]: 2^8.45311
> max rsqrt error on [0.25, 1]: 2^8.36032
> 10/8 2560 by 0X1000
> max recip error on [0.5, 1]: 2^8.88626
> max rsqrt error on [0.25, 1]: 2^8.8425
> 10/8 2560 by 0X1
> max recip error on [0.5, 1]: 2^8.88626
> max rsqrt error on [0.25, 1]: 2^8.83164
> 12/8
> max recip error on [0.5, 1]: 2^8.98041
> max rsqrt error on [0.25, 1]: 2^8.9616
> 13/8 2560
> max recip error on [0.5, 1]: 2^9
> max rsqrt error on [0.25, 1]: 2^9
> 9/9
> max recip error on [0.5, 1]: 2^9.47252
> max rsqrt error on [0.25, 1]: 2^9.31953
>
>
> observation:
> 7 input bits has a minimum max error of 2^8.
> 8 input bits has a minimum max error of 2^9.
>
>
>
>


What I initially posted was a compilation from days previously, and
I pulled in some bogus test results.
Here is a fresh run :
./a.out 7 5 ;./a.out 7 6 ;./a.out 7 7
;./a.out 7 8 ;./a.out 7 9 ;./a.out 7 10 ;./a.out 7 11
;./a.out 8 7 ;./a.out 8 ;./a.out 8 9 ;./a.out 9 ;
ip 7 op 5 LUT #bits 640 verilog 0 test/testlong 1
max recip 7x5 error: 2^5.89148
max rsqrt 7x5 error: 2^5.98208
ip 7 op 6 LUT #bits 768 verilog 0 test/testlong 1
max recip 7x6 error: 2^6.79055
max rsqrt 7x6 error: 2^6.73312
ip 7 op 7 LUT #bits 896 verilog 0 test/testlong 1
max recip 7x7 error: 2^7.4843
max rsqrt 7x7 error: 2^7.31422
ip 7 op 8 LUT #bits 1024 verilog 0 test/testlong 1
max recip 7x8 error: 2^7.77603
max rsqrt 7x8 error: 2^7.6318
ip 7 op 9 LUT #bits 1152 verilog 0 test/testlong 1
max recip 7x9 error: 2^7.8889
max rsqrt 7x9 error: 2^7.87831
ip 7 op 10 LUT #bits 1280 verilog 0 test/testlong 1
max recip 7x10 error: 2^7.94879
max rsqrt 7x10 error: 2^7.89712
ip 7 op 11 LUT #bits 1408 verilog 0 test/testlong 1
max recip 7x11 error: 2^7.97629
max rsqrt 7x11 error: 2^8
ip 8 op 7 LUT #bits 1792 verilog 0 test/testlong 1
max recip 8x7 error: 2^7.77602
max rsqrt 8x7 error: 2^7.72555
estimate width, op=0, out of range reset to default
ip 8 op 8 LUT #bits 2048 verilog 0 test/testlong 1
max recip 8x8 error: 2^8.45311
max rsqrt 8x8 error: 2^8.25349
ip 8 op 9 LUT #bits 2304 verilog 0 test/testlong 1
max recip 8x9 error: 2^8.71923
max rsqrt 8x9 error: 2^8.67807
estimate width, op=0, out of range reset to default
ip 9 op 9 LUT #bits 4608 verilog 0 test/testlong 1
max recip 9x9 error: 2^9.43021
max rsqrt 9x9 error: 2^9.28082
On 20200801 12:27 a.m., Bill Huffman
wrote:
David,
Here are a series of statements
leading to my worst case answer:
 For the mantissa range 0xF5_0000 to
0xF5_FFFF, the reciprocal estimate is 0x85_0000
 The largest error is for
0xF5_0000
 The reciprocal of 0xF5_0000,
to "infinite" precision is 0x85_BF37.612D
 The relative error, then, is
(0x85_0000  0x85_BF37.612D)/0x85_BF37.612D =>
0x0.016E_0000_0022
 The log2 of the absolute value of that error is:
7.484300
I don't have any errors as large
as to have a log2 of 7.36951. Where did that error come from
for you?
It came from some testing I was
performing on adjacent index values.
Completely bogus as I mentioned above.
I will instrument the code for more details, but I suspect this
code has exactly the same worst case situation (for 7x7).
Bill
On 7/31/20 2:25 PM, DSHORNER wrote:
EXTERNAL MAIL
This is the program Andrew wrote.
https://github.com/riscv/riscvvspec/blob/vfrecip/recip.cc
On 20200731 4:46 p.m., Bill
Huffman wrote:
David,
Because of the errors you
get, I'm assuming your "output width" and "input width" do
not include the hidden bit. Right?
That is correct, Andrew's
approach assumes the implicit high hidden bit.
It's interesting. I did a
similar exercise a number of years ago and got a few
hundredths of a bit better accuracy from 7/7 tables. It's
possible I did it wrong. It's also possible that there's
a slight improvement available.
Andrew chose a range from [xn ,
(x+1)n) perhaps (xn,(x+1)n] will work better.
I will give it a try.
If you want to send me the
tables I can compare. Mine are in decimal numbers from
128 to 255. I could send you tables as well.
As mentioned Andrew's verilog
directive creates a table; both input and output in range
from 0 to 127.
I wouldn't expect the bias to make any significant difference.
I'd be happy to see your tables.
If you want I will send Andrew's program's output for 7x7.
And any of the other listed combinations from my mods to his
program.
I will post my mods even though I still get that seg fault
with null (or single) command line args.
Then you could run your own.


toggle quoted message
Show quoted text
What I initially posted was a compilation from days previously,
and I pulled in some bogus test results.
Here is a fresh run :
./a.out 7 5 ;./a.out 7 6 ;./a.out 7 7
;./a.out 7 8 ;./a.out 7 9 ;./a.out 7 10 ;./a.out 7 11
;./a.out 8 7 ;./a.out 8 ;./a.out 8 9 ;./a.out 9 ;
ip 7 op 5 LUT #bits 640 verilog 0 test/testlong 1
max recip 7x5 error: 2^5.89148
max rsqrt 7x5 error: 2^5.98208
ip 7 op 6 LUT #bits 768 verilog 0 test/testlong 1
max recip 7x6 error: 2^6.79055
max rsqrt 7x6 error: 2^6.73312
ip 7 op 7 LUT #bits 896 verilog 0 test/testlong 1
max recip 7x7 error: 2^7.4843
max rsqrt 7x7 error: 2^7.31422
ip 7 op 8 LUT #bits 1024 verilog 0 test/testlong 1
max recip 7x8 error: 2^7.77603
max rsqrt 7x8 error: 2^7.6318
ip 7 op 9 LUT #bits 1152 verilog 0 test/testlong 1
max recip 7x9 error: 2^7.8889
max rsqrt 7x9 error: 2^7.87831
ip 7 op 10 LUT #bits 1280 verilog 0 test/testlong 1
max recip 7x10 error: 2^7.94879
max rsqrt 7x10 error: 2^7.89712
ip 7 op 11 LUT #bits 1408 verilog 0 test/testlong 1
max recip 7x11 error: 2^7.97629
max rsqrt 7x11 error: 2^8
ip 8 op 7 LUT #bits 1792 verilog 0 test/testlong 1
max recip 8x7 error: 2^7.77602
max rsqrt 8x7 error: 2^7.72555
estimate width, op=0, out of range reset to default
ip 8 op 8 LUT #bits 2048 verilog 0 test/testlong 1
max recip 8x8 error: 2^8.45311
max rsqrt 8x8 error: 2^8.25349
ip 8 op 9 LUT #bits 2304 verilog 0 test/testlong 1
max recip 8x9 error: 2^8.71923
max rsqrt 8x9 error: 2^8.67807
estimate width, op=0, out of range reset to default
ip 9 op 9 LUT #bits 4608 verilog 0 test/testlong 1
max recip 9x9 error: 2^9.43021
max rsqrt 9x9 error: 2^9.28082
On 20200801 12:27 a.m., Bill
Huffman wrote:
David,
Here are a series of
statements leading to my worst case answer:
 For the mantissa range 0xF5_0000 to
0xF5_FFFF, the reciprocal estimate is 0x85_0000
 The largest error is for
0xF5_0000
 The reciprocal of
0xF5_0000, to "infinite" precision is 0x85_BF37.612D
 The relative error, then,
is (0x85_0000  0x85_BF37.612D)/0x85_BF37.612D =>
0x0.016E_0000_0022
 The log2 of the absolute value of that error
is: 7.484300
I don't have any errors as
large as to have a log2 of 7.36951. Where did that error
come from for you?
It came from some testing I was
performing on adjacent index values.
Completely bogus as I mentioned above.
I will instrument the code for more details, but I suspect this
code has exactly the same worst case situation (for 7x7).
Bill
On 7/31/20 2:25 PM, DSHORNER wrote:
EXTERNAL MAIL
This is the program Andrew wrote.
https://github.com/riscv/riscvvspec/blob/vfrecip/recip.cc
On 20200731 4:46 p.m., Bill
Huffman wrote:
David,
Because of the errors you
get, I'm assuming your "output width" and "input width"
do not include the hidden bit. Right?
That is correct, Andrew's
approach assumes the implicit high hidden bit.
It's interesting. I did a
similar exercise a number of years ago and got a few
hundredths of a bit better accuracy from 7/7 tables.
It's possible I did it wrong. It's also possible that
there's a slight improvement available.
Andrew chose a range from [xn ,
(x+1)n) perhaps (xn,(x+1)n] will work better.
I will give it a try.
If you want to send me the
tables I can compare. Mine are in decimal numbers from
128 to 255. I could send you tables as well.
As mentioned Andrew's verilog
directive creates a table; both input and output in range
from 0 to 127.
I wouldn't expect the bias to make any significant
difference.
I'd be happy to see your tables.
If you want I will send Andrew's program's output for 7x7.
And any of the other listed combinations from my mods to his
program.
I will post my mods even though I still get that seg fault
with null (or single) command line args.
Then you could run your own.

