Re: VFRECIP/VFRSQRT instructions
Mr Grigorios Magklis
Hi Andrew et al, Thank you for sending the code. I am attaching an updated version of recip.cc, implementing the complete Newton-Raphson sequence, using the proposed reciprocal estimate instructions as seed. The code also uses a slightly different way to calculate the error. If my calculations are not wrong, with a 7-bit accurate estimate, two iterations achieve ~1 ULP error for reciprocal and ~2 ULP error for reciprocal square root. I have a concern about not including special "step" instructions. The complete N-R sequences are as follows: float newton_raphson_recip(float r) { float x = recip(r); int k = fpclassify(x); if (k != FP_INFINITE && k != FP_ZERO) { float t = fmaf(-r, x, 2.0f); // fma x = x * t; // fmul t = fmaf(-r, x, 2.0f); // fma x = x * t; // fmul } return x; } float newton_raphson_rsqrt(float r) { float x = rsqrt(r); int k = fpclassify(r); if (k != FP_INFINITE && k != FP_ZERO) { float t = r * x; // fmul float h = 0.5 * x; // fmul t = fmaf(-t, x, 3.0f); // fma x = h * t; // fmul t = r * x; // fmul h = 0.5 * x; // fmul t = fmaf(-t, x, 3.0f); // fma x = h * t; // fmul } return x; } As you can see above, even with proper corner-case handling in the estimate instruction, we still need to do 0 and inf checking in the sequence. This is because, e.g. an input of 0 will produce an infinity result in the estimate, but then during the sequence you need to multiply the original value with the estimate, which generates a NaN which gets propagated all the way to the end. Moreover, if the code is vectorized, the condition will generate a predicate (in v0) which needs to be applied to all the iterations. If the sequence already was predicated, using v0, we need two extra instructions to save and restore the original value of v0 at the beginning and end of the sequence. Roughly, calculating the number of extra instructions in each sequence I get:
This is 6-7 extra instructions for recip and 7-8 extra instructions for rsqrt. The two N-R iterations plus estimate would be 5 instructions for recip and 9 instructions for rsqrt. This would be a total of 11-12 instructions for recip and 16-17 instructions for rsqrt. Apart from the extra instructions, the sequences would require 3-5 extra registers apart from the ~2 temporary registers needed to hold arithmetic results (rough calculation). Given v0 used for the predicate, and 1 register holding the input value, the rsqrt sequence would need more than 4 registers (7-9 I think), which means it would not be able to run on a configuration with LMUL=8 and maybe not with LMUL=4 either, unless we fill/spill registers to memory. If we had the following special "step" instructions the sequences would be much shorter (see also the attached recip_step.cc file): A recip_step instruction that calculates 2-a*b with the following special case handling:
A rsqrt_step instruction that calculates (3-a*b)/2 with the following special case handling:
With the above instructions, the N-R sequences are reduced to: float newton_raphson_recip_with_step(float r) { float x = recip(r); float t = recip_step(r, x); // fma x = x * t; // fmul t = recip_step(r, x); // fma x = x * t; // fmul return x; } float newton_raphson_rsqrt_with_step(float r) { float x = rsqrt(r); float t = r * x; // fmul t = rsqrt_step(t, x); // fma x = x * t; // fmul t = r * x; // fmul t = rsqrt_step(t, x); // fma x = x * t; // fmul return x; } Which is a total of 5 instructions for recip (vs 11-12 without recip_step) and 7 instructions for rsqrt (vs 16-17 without rsqrt_step), including under predication. Both sequences need 2 temporary registers, for a total of 4 registers (including mask and input value), so they can be executed for all LMUL values. For double precision, the analysis is similar (number of extra instructions and registers), but instead of two there are three N-R iterations. Thanks, Grigorios PS: I also tried the optimization that you propose of using (3/2 - a/2 * b) in the rsqrt sequence to save 1 instruction/iteration, but for subnormal inputs it generates significant error compared to the IEEE result.
|
|