Re: VFRECIP/VFRSQRT instructions


swallach
 

i have a question

if one implements square root using a non-restoring divide approach (which i did at one time)

is this ok within the proposed/described different approaches. ?



On Aug 14, 2020, at 1:46 AM, Krste Asanovic <krste@...> wrote:


As Andrew says, verification/compliance concerns outweighed allowing
more flexible definition. Also, the fixed 7b implementation was seen
as being cheap to provide even if more accurate approximations are
added later.

Krste

On Thu, 13 Aug 2020 17:37:01 -0700, "Andrew Waterman" <andrew@...> said:
| The task group did consider that possibility but concluded that forcing
| compatibility is more important. As Bill points out, more precise (or
| flexibly precise) variants can be defined in the future, since there's
| beaucoup opcode space available for unary operations.

| On Thu, Aug 13, 2020 at 4:55 PM Brian Grayson <brian.grayson@...>
| wrote:

| As it stands, I think the spec prevents an implementer from being more
| accurate than described, right? Should the spec specify "accurate to at
| least 7 bits" instead?
| I could envision an embedded implementer who would like just a few bits
| more accuracy and fewer (or none) Newton-Raphson iterations for their
| specific use-case.
| (I've seen architectures that state a minimum accuracy, but leave the
| actual accuracy up to the implementer, which is enough for standard
| software to do the right thing.)

| Brian

| On Thu, Aug 13, 2020 at 5:11 PM Bill Huffman <huffman@...> wrote:

| On 8/13/20 2:33 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| On Thu, Aug 13, 2020 at 2:29 PM Bill Huffman <huffman@...>
| wrote:

| I think maybe I'm done complaining. :-)

| Hopefully because we've converged, not simply due to exhaustion :)

| Happily, yes. :-)

| Except that the initial paragraph on recip operation needs the
| words "concatenated and" removed.

| Thanks.

| I'm going to merge the pull request now, but additional feedback
| is still welcome, of course.

| Sounds good.

| Bill

| Bill

| On 8/13/20 2:11 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| Good thinking. I've added analogous language for recip,
| too.

| On Thu, Aug 13, 2020 at 12:58 PM Bill Huffman <
| huffman@...> wrote:

| Andrew,

| I'll start at the top here... and with rsqrt since
| it's simpler. I think the table and most of the
| commentary is fine. I can follow the operation
| description. Sort of. But I'm trying to figure out
| how it can be improved. It currently says:

| For the non-exceptional cases, the result is
| computed as follows. Let the normalized input
| exponent be equal to the input exponent if the
| input is normal, or 0 minus the number of leading
| zeros in the significand otherwise. If the input
| is subnormal, the normalized input significand is
| given by shifting the input significand left by 1
| minus the normalized input exponent, discarding
| the leading 1 bit. The output exponent equals
| floor((3*B - 1 - the normalized input exponent) /
| 2). The output sign equals the input sign.

| The following table gives the seven MSBs of the
| output significand as a function of the LSB of the
| normalized input exponent and the six MSBs of the
| normalized input significand; the other bits of
| the output significand are zero.

| I wonder if a high level description given first might
| help. For example:

| For the non-exceptional cases the low bit of exponent
| and the six bits of significand (after the leading
| one) are concatenated and used to address the
| following table. The output of the table becomes the
| seven bits of the result significand (after the
| leading one) and the remainder of the result
| signifcand is zero. Denorm inputs are normalized and
| the exponent adjusted appropriately before the
| lookup. The output exponent is chosen to make the
| result approximate the reciprocal of the square root
| of the argument.

| More precisely, the result is computed as follows.
| .... <your description>

| Bill

| On 8/12/20 9:19 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| On Wed, Aug 12, 2020 at 8:36 PM Bill Huffman <
| huffman@...> wrote:

| On 8/12/20 7:05 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| On Wed, Aug 12, 2020 at 6:56 PM Bill
| Huffman <huffman@...> wrote:

| On 8/12/20 4:21 PM, Andrew Waterman
| wrote:

| EXTERNAL MAIL

| On Wed, Aug 12, 2020 at 3:37 PM Bill
| Huffman <huffman@...> wrote:

| On 8/12/20 3:32 PM, Andrew Waterman
| wrote:

| EXTERNAL MAIL

| On Wed, Aug 12, 2020 at 3:18 PM Bill
| Huffman <huffman@...> wrote:

| On 8/11/20 4:11 PM, Andrew Waterman
| wrote:

| EXTERNAL MAIL

| On Tue, Aug 11, 2020 at 3:35 PM Bill
| Huffman <huffman@...> wrote:

| On 8/11/20 3:00 PM, Andrew Waterman
| wrote:

| EXTERNAL MAIL

| On Tue, Aug 11, 2020 at 1:56 PM Bill
| Huffman <huffman@...> wrote:

| Hi Andrew,

| I'm looking at the cases where the
| reciprocal is near the boundary
| between finite and infinite or between
| normal and denormal. Are you trying
| to get the boundaries approximately
| right? Or exactly? For example, the
| point at which the reciprocal of a
| large positive denorm falls over the
| boundary between MAXPOS and +Inf is
| different for RUP and RNE. The
| setting of OF changes at the same
| point. There's yet another point at
| which OF changes for RDN even though
| the answer doesn't change. You don't
| show UF set anywhere.

| There are no cases where UF should be
| raised because there are no cases
| where denormalization causes loss of
| precision. When the result is
| subnormal, it is only subnormal by
| either one or two positions; the
| denormalized 7-bit significand plus
| two bits of right-shift fits within
| all of our formats' significands.
| (This property doesn't hold for
| bfloat16, but that point might be moot
| if our variant of that format always
| flushes subnormals to zero.)

| Ah, so you're counting the 7-bit (plus
| hidden bit) result as the absolutely
| correct answer. There's no
| relationship here to the infinite
| precision reciprocal we're
| approximating. This is an instruction
| that throws away 16 bits of input
| mantissa, does a table lookup, and
| gives an answer that's exactly 7 bits
| (plus hidden bit). The relationship
| of this instruction to a reciprocal is
| one of motivation and not closer than
| that.

| I think that's the answer to the
| paradigm question I had. I'll think
| about that a bit and see what I think
| of your edge case results and flags
| then.

| Ah, that clarifies your earlier
| question. Yeah, LMK what you think.

| With that re-orientation to what the
| instruction means, it looks correct.
| I have a couple of comments:

| ★ Just above the table you use the
| concept of the instruction's
| "domain." But the idea of its domain
| does not seem very clear to me. I
| lean toward removing the statement and
| depending on the table.
| ★ In the first normative paragraph
| after the table, you use the number of
| leading zeros in the significand.
| That assumes that the term
| "significand" does not include the
| "hidden" bit, which is zero in the
| case of interest. I think a
| single-precision significand may be
| considered to be 23 bits by some and
| 24 bits by others, leading to some
| confusion about that sentence. It
| might work to reference the leading
| zeros in the represented part of the
| significand.
| ★ As I read farther, it's pretty
| confusing. I worry for most people
| reading it. I wonder if there should
| be a second table referenced where the
| first table says "estimate of 1/x" and
| dealing only with the magnitude of the
| argument. The second table would have
| five rows labeled by operand range -
| as below - and detail each range with
| regard to exponent and
| denormalization:
| â—Ž 2^(-B-1) =< x < 2^(-B)
| â—Ž 2^(-B) =< x < 2^(-B+1)
| â—Ž 2^(-B+1) =< x < 2^(B-1)
| â—Ž 2^(B-1) =< x < 2^(B)
| â—Ž 2^(B) =< x < 2^(B+1)
| ★ The reciprocal square root would
| be a little different but the same
| idea would apply.

| Any of that make sense?

| Yeah, let me play around with the
| presentation a bit. I'm not sure
| whether breaking it into two tables or
| expanding the current table will be
| clearer, but your suggestion holds
| either way. Thanks for being my
| guinea pig.

| I almost suggested expanding the
| current table. That makes it quite a
| bit larger. But then, it also means
| there's no need to clarify the
| relationship between the two tables.
| Maybe that's better. And it doesn't
| expand the recip sqrt table.

| How about this... it's a beast, but I
| think it works.
| https://github.com/riscv/riscv-v-spec/blob/vfrecip/v-spec.adoc#149-vector-floating-point-reciprocal-estimate-instruction

| It is pretty big... I'm just looking
| at the recip at this point. I have a
| couple of thoughts:

| Yeah, but big is OK, I think.

| Probably so.

| I didn't change the rsqrt table at all.
| Since the subnormal cases are mostly
| uninteresting, I think the NOTE that
| positive subnormal and normal inputs
| always produce normal outputs suffices.

| That's probably OK. It's much less
| confusing. I wonder if two examples for each
| (recip and rsqrt) would help. One with a
| denormal input and the other normal?

| I had been hoping that the reference C code would
| scratch that itch, but you're probably right.
| I've added a tiny example and a huge example for
| each.

| â–¡ In the "Output" column for the 5
| new positive and negative entries, you
| have ... > y > ... but I think you
| should have ... >= y > ... because
| when the input is 127 the table has
| output 0. So when the input is near
| the "left" end of the input range as
| expressed in the table, the output is
| all the way at the left end of the
| output range and needs the "equal."

| It's actually correct as-is, because the
| output value is never exactly a power of
| 2. When the input is exactly a power of
| 2, the result is always slightly larger
| than the true reciprocal. (It's the
| reciprocal of some number near the
| midpoint of the interval interval ( 2^n,
| nextafter(2^n) )).

| When the input mantissa (including hidden bit)
| is 0xFF0000, the output mantissa is 0x800000,
| if I'm reading the table correctly - 127 in
| leads to zero out.

| The second row in the table has input:

| -2^B+1 < x ≤ -2^B (normal)

| Table input 127 is near the left end of the
| range while table input 0 is absolutely at the
| right end.

| The left end is not representable but is just
| farther from zero than than 0xFF7F_FFFF
| single-precision. The right end is
| 0xFF00_0000 single-precision. These turn into
| 127 and 0 as table inputs and into 0 and 127
| as table outputs. Then they're 0x8020_0000
| and 0x803F_C000 as single-precision. So the
| left end is equal to -2^-(B+1).

| and output is listed as:

| -2^-(B+1) > y > -2^-B (subnormal, sig
| [MSB:MSB-1]=01)

| but should allow the equal on the left,
| shouldn't it?

|
| My mistake. I was thinking of the fact that
| power-of-2 inputs never produce power-of-2
| outputs. You're of course right that
| just-smaller-than-power-of-2 inputs do produce
| power-of-2 outputs. Thanks for the correction.

| I also reordered the spec so that vfrsqrte7 shows
| up before vfrece7, since the former is so much
| simpler to explain. More sanity-checking
| appreciated.
| https://github.com/riscv/riscv-v-spec/blob/vfrecip/v-spec.adoc#149-vector-floating-point-reciprocal-square-root-estimate-instruction

| Bill

| â–¡ The expressions of subnormal are
| still awkward. What about (subnormal
| 01...) or (subnormal 1...) and explain
| later what that means. It would be
| easier to read (and the table would be
| a bit smaller).

| Thanks, I was hoping someone would suggest
| a better way of expressing that.

| Bill

|

| Bill

| Bill

| Bill

| As to the large positive denorm input
| case: the only case where this scheme
| and IEEE (1.0 / x) differ in the
| finity of the result, or differ in
| whether OF is raised, is for the exact
| input 2^-(B+1), depending on the
| rounding mode. We always produce a
| finite result for this case, but
| there's an arguable reason for it:
| we're actually computing the
| reciprocal of some number near the
| midpoint of the interval ( 2^-(B+1),
| nextafter(2^-(B+1)) ), the result of
| which is finite, regardless of the
| rounding mode.

| So, I'm wondering what your paradigm
| is for the edge cases. I can see it
| might not be worth being too
| complicated since the answer isn't
| very exact. The paradigm is further
| complicated by the idea that the
| answer may be refined by further
| steps. :-)

| Yeah... the intent was to have
| reasonable fidelity. I think you can
| argue the 2^-(B+1) case either way,
| but other ISAs have resolved it the
| same way I did. And it's clearly a
| feature that corner-case detection
| doesn't depend on the significand
| (except for its zeroness, that is).

| Bill

| On 8/10/20 8:54 PM, Andrew Waterman
| wrote:

| EXTERNAL MAIL

| I've PRed a full definition of these
| instructions. Please sanity-check my
| work:
| https://github.com/riscv/riscv-v-spec/blob/78191da47644053d0605b21628e1f5e7961ad5bf/v-spec.adoc#149-vector-floating-point-reciprocal-estimate-instruction

| On Mon, Aug 3, 2020 at 5:44 PM Bill
| Huffman <huffman@...> wrote:

| On 8/3/20 1:41 PM, Andrew Waterman
| wrote:

| EXTERNAL MAIL

| On Mon, Aug 3, 2020 at 12:40 PM Bill
| Huffman <huffman@...> wrote:

| The recip table matches mine as does
| the worst case error.

| I have one different entry in the
| square root table. For entry 77,
| where you have 36, I have 37. I'm not
| sure whether it matters. Also, ages
| ago, I got a very small difference in
| worst case error of 2^-7.317 but I
| haven't gone back to trace anything
| down about that.

| Thanks for validating against your
| table, Bill.

| With my value for that entry, the
| worst error on the interval of
| interest is 2^-7.32041, for input
| 0x3f1a0000. With yours, it's 2^
| -7.3164 for 0x3f1bfffd.

| I agree with your computation with a
| really tiny difference (I get that it
| just barely rounds to 2^-7.32040). I
| can't say why I got 37 when I did it
| 8-10 years ago - and I don't think I'm
| going to chase that. I'm good with 36
| at that position in the table.

| So, I'm good with the table values
| below.

| Bill

| Presumably the error's slightly
| smaller for my scheme because I'm
| picking the output value that
| minimizes the maximum error on the
| interval, rather than picking the
| midpoint or similar. Of course, the
| overall worst error is unaffected.

| Bill

| On 8/3/20 11:38 AM, DSHORNER wrote:

| EXTERNAL MAIL

| Now annotated version --detail
| https://github.com/David-Horner/recip/blob/master/vrecip.cc

| For the 7x7 below notice the biased
| value does not exceed 21 for recip (5
| of 7 bits) and 15 for rsqrt (4 of 7
| bits).

| ip 7 op 7 LUT #bits 896 verilog 0
| test/test-long 1
| Recip7x7LUT (input [6:0] in, output
| reg [6:0] out);
| in[6:0] corresponds to sig[S-1:S-6]
| out[6:0] corresponds to sig[S-1:S-6]
| biased : ((ipN-1) - in) << (op - ip)
| // or >> if neg
| base bias 127 left-shift 0
| right-shift 0
| 0: out = 127 biased 0; lerr
| 0.00390625 rerr 0.00387573 larg 0.5
| rarg 0.503906
| 1: out = 125 biased 1; lerr 0.0039978
| rerr 0.00372314 larg 0.503906 rarg
| 0.507812
| 2: out = 123 biased 2; lerr
| 0.00421143 rerr 0.00344849 larg
| 0.507812 rarg 0.511719
| 3: out = 121 biased 3; lerr
| 0.00454712 rerr 0.00305176 larg
| 0.511719 rarg 0.515625
| 4: out = 119 biased 4; lerr
| 0.00500488 rerr 0.00253296 larg
| 0.515625 rarg 0.519531
| 5: out = 117 biased 5; lerr
| 0.00558472 rerr 0.00189209 larg
| 0.519531 rarg 0.523438
| 6: out = 116 biased 5; lerr
| 0.00219727 rerr 0.00524902 larg
| 0.523438 rarg 0.527344
| 7: out = 114 biased 6; lerr
| 0.00299072 rerr 0.00439453 larg
| 0.527344 rarg 0.53125
| 8: out = 112 biased 7; lerr
| 0.00390625 rerr 0.00341797 larg
| 0.53125 rarg 0.535156
| 9: out = 110 biased 8; lerr
| 0.00494385 rerr 0.00231934 larg
| 0.535156 rarg 0.539062
| 10: out = 109 biased 8; lerr
| 0.00189209 rerr 0.00534058 larg
| 0.539062 rarg 0.542969
| 11: out = 107 biased 9; lerr
| 0.00314331 rerr 0.00402832 larg
| 0.542969 rarg 0.546875
| 12: out = 105 biased 10; lerr
| 0.0045166 rerr 0.00259399 larg
| 0.546875 rarg 0.550781
| 13: out = 104 biased 10; lerr
| 0.00170898 rerr 0.00537109 larg
| 0.550781 rarg 0.554688
| 14: out = 102 biased 11; lerr
| 0.0032959 rerr 0.00372314 larg
| 0.554688 rarg 0.558594
| 15: out = 100 biased 12; lerr
| 0.00500488 rerr 0.00195312 larg
| 0.558594 rarg 0.5625
| 16: out = 99 biased 12; lerr
| 0.00244141 rerr 0.00448608 larg 0.5625
| rarg 0.566406
| 17: out = 97 biased 13; lerr
| 0.00436401 rerr 0.00250244 larg
| 0.566406 rarg 0.570312
| 18: out = 96 biased 13; lerr
| 0.00195312 rerr 0.00488281 larg
| 0.570312 rarg 0.574219
| 19: out = 94 biased 14; lerr
| 0.00408936 rerr 0.00268555 larg
| 0.574219 rarg 0.578125
| 20: out = 93 biased 14; lerr
| 0.00183105 rerr 0.00491333 larg
| 0.578125 rarg 0.582031
| 21: out = 91 biased 15; lerr
| 0.00418091 rerr 0.00250244 larg
| 0.582031 rarg 0.585938
| 22: out = 90 biased 15; lerr
| 0.0020752 rerr 0.00457764 larg
| 0.585938 rarg 0.589844
| 23: out = 88 biased 16; lerr
| 0.00463867 rerr 0.00195312 larg
| 0.589844 rarg 0.59375
| 24: out = 87 biased 16; lerr
| 0.00268555 rerr 0.00387573 larg
| 0.59375 rarg 0.597656
| 25: out = 85 biased 17; lerr
| 0.00546265 rerr 0.0010376 larg
| 0.597656 rarg 0.601562
| 26: out = 84 biased 17; lerr
| 0.00366211 rerr 0.00280762 larg
| 0.601562 rarg 0.605469
| 27: out = 83 biased 17; lerr
| 0.00192261 rerr 0.0045166 larg
| 0.605469 rarg 0.609375
| 28: out = 81 biased 18; lerr
| 0.00500488 rerr 0.00137329 larg
| 0.609375 rarg 0.613281
| 29: out = 80 biased 18; lerr
| 0.00341797 rerr 0.00292969 larg
| 0.613281 rarg 0.617188
| 30: out = 79 biased 18; lerr
| 0.00189209 rerr 0.00442505 larg
| 0.617188 rarg 0.621094
| 31: out = 77 biased 19; lerr
| 0.00527954 rerr 0.000976562 larg
| 0.621094 rarg 0.625
| 32: out = 76 biased 19; lerr
| 0.00390625 rerr 0.00231934 larg 0.625
| rarg 0.628906
| 33: out = 75 biased 19; lerr
| 0.00259399 rerr 0.00360107 larg
| 0.628906 rarg 0.632812
| 34: out = 74 biased 19; lerr
| 0.00134277 rerr 0.00482178 larg
| 0.632812 rarg 0.636719
| 35: out = 72 biased 20; lerr
| 0.00512695 rerr 0.000976562 larg
| 0.636719 rarg 0.640625
| 36: out = 71 biased 20; lerr
| 0.00402832 rerr 0.00204468 larg
| 0.640625 rarg 0.644531
| 37: out = 70 biased 20; lerr
| 0.00299072 rerr 0.00305176 larg
| 0.644531 rarg 0.648438
| 38: out = 69 biased 20; lerr
| 0.00201416 rerr 0.0039978 larg
| 0.648438 rarg 0.652344
| 39: out = 68 biased 20; lerr
| 0.00109863 rerr 0.00488281 larg
| 0.652344 rarg 0.65625
| 40: out = 66 biased 21; lerr
| 0.00537109 rerr 0.000549316 larg
| 0.65625 rarg 0.660156
| 41: out = 65 biased 21; lerr
| 0.00460815 rerr 0.00128174 larg
| 0.660156 rarg 0.664062
| 42: out = 64 biased 21; lerr
| 0.00390625 rerr 0.00195312 larg
| 0.664062 rarg 0.667969
| 43: out = 63 biased 21; lerr
| 0.00326538 rerr 0.00256348 larg
| 0.667969 rarg 0.671875
| 44: out = 62 biased 21; lerr
| 0.00268555 rerr 0.00311279 larg
| 0.671875 rarg 0.675781
| 45: out = 61 biased 21; lerr
| 0.00216675 rerr 0.00360107 larg
| 0.675781 rarg 0.679688
| 46: out = 60 biased 21; lerr
| 0.00170898 rerr 0.00402832 larg
| 0.679688 rarg 0.683594
| 47: out = 59 biased 21; lerr
| 0.00131226 rerr 0.00439453 larg
| 0.683594 rarg 0.6875
| 48: out = 58 biased 21; lerr
| 0.000976562 rerr 0.00469971 larg
| 0.6875 rarg 0.691406
| 49: out = 57 biased 21; lerr
| 0.000701904 rerr 0.00494385 larg
| 0.691406 rarg 0.695312
| 50: out = 56 biased 21; lerr
| 0.000488281 rerr 0.00512695 larg
| 0.695312 rarg 0.699219
| 51: out = 55 biased 21; lerr
| 0.000335693 rerr 0.00524902 larg
| 0.699219 rarg 0.703125
| 52: out = 54 biased 21; lerr
| 0.000244141 rerr 0.00531006 larg
| 0.703125 rarg 0.707031
| 53: out = 53 biased 21; lerr
| 0.000213623 rerr 0.00531006 larg
| 0.707031 rarg 0.710938
| 54: out = 52 biased 21; lerr
| 0.000244141 rerr 0.00524902 larg
| 0.710938 rarg 0.714844
| 55: out = 51 biased 21; lerr
| 0.000335693 rerr 0.00512695 larg
| 0.714844 rarg 0.71875
| 56: out = 50 biased 21; lerr
| 0.000488281 rerr 0.00494385 larg
| 0.71875 rarg 0.722656
| 57: out = 49 biased 21; lerr
| 0.000701904 rerr 0.00469971 larg
| 0.722656 rarg 0.726562
| 58: out = 48 biased 21; lerr
| 0.000976562 rerr 0.00439453 larg
| 0.726562 rarg 0.730469
| 59: out = 47 biased 21; lerr
| 0.00131226 rerr 0.00402832 larg
| 0.730469 rarg 0.734375
| 60: out = 46 biased 21; lerr
| 0.00170898 rerr 0.00360107 larg
| 0.734375 rarg 0.738281
| 61: out = 45 biased 21; lerr
| 0.00216675 rerr 0.00311279 larg
| 0.738281 rarg 0.742188
| 62: out = 44 biased 21; lerr
| 0.00268555 rerr 0.00256348 larg
| 0.742188 rarg 0.746094
| 63: out = 43 biased 21; lerr
| 0.00326538 rerr 0.00195312 larg
| 0.746094 rarg 0.75
| 64: out = 42 biased 21; lerr
| 0.00390625 rerr 0.00128174 larg 0.75
| rarg 0.753906
| 65: out = 41 biased 21; lerr
| 0.00460815 rerr 0.000549316 larg
| 0.753906 rarg 0.757812
| 66: out = 40 biased 21; lerr
| 0.00537109 rerr 0.000244141 larg
| 0.757812 rarg 0.761719
| 67: out = 40 biased 20; lerr
| 0.000244141 rerr 0.00488281 larg
| 0.761719 rarg 0.765625
| 68: out = 39 biased 20; lerr
| 0.00109863 rerr 0.0039978 larg
| 0.765625 rarg 0.769531
| 69: out = 38 biased 20; lerr
| 0.00201416 rerr 0.00305176 larg
| 0.769531 rarg 0.773438
| 70: out = 37 biased 20; lerr
| 0.00299072 rerr 0.00204468 larg
| 0.773438 rarg 0.777344
| 71: out = 36 biased 20; lerr
| 0.00402832 rerr 0.000976562 larg
| 0.777344 rarg 0.78125
| 72: out = 35 biased 20; lerr
| 0.00512695 rerr 0.000152588 larg
| 0.78125 rarg 0.785156
| 73: out = 35 biased 19; lerr
| 0.000152588 rerr 0.00482178 larg
| 0.785156 rarg 0.789062
| 74: out = 34 biased 19; lerr
| 0.00134277 rerr 0.00360107 larg
| 0.789062 rarg 0.792969
| 75: out = 33 biased 19; lerr
| 0.00259399 rerr 0.00231934 larg
| 0.792969 rarg 0.796875
| 76: out = 32 biased 19; lerr
| 0.00390625 rerr 0.000976562 larg
| 0.796875 rarg 0.800781
| 77: out = 31 biased 19; lerr
| 0.00527954 rerr 0.000427246 larg
| 0.800781 rarg 0.804688
| 78: out = 31 biased 18; lerr
| 0.000427246 rerr 0.00442505 larg
| 0.804688 rarg 0.808594
| 79: out = 30 biased 18; lerr
| 0.00189209 rerr 0.00292969 larg
| 0.808594 rarg 0.8125
| 80: out = 29 biased 18; lerr
| 0.00341797 rerr 0.00137329 larg 0.8125
| rarg 0.816406
| 81: out = 28 biased 18; lerr
| 0.00500488 rerr 0.000244141 larg
| 0.816406 rarg 0.820312
| 82: out = 28 biased 17; lerr
| 0.000244141 rerr 0.0045166 larg
| 0.820312 rarg 0.824219
| 83: out = 27 biased 17; lerr
| 0.00192261 rerr 0.00280762 larg
| 0.824219 rarg 0.828125
| 84: out = 26 biased 17; lerr
| 0.00366211 rerr 0.0010376 larg
| 0.828125 rarg 0.832031
| 85: out = 25 biased 17; lerr
| 0.00546265 rerr 0.000793457 larg
| 0.832031 rarg 0.835938
| 86: out = 25 biased 16; lerr
| 0.000793457 rerr 0.00387573 larg
| 0.835938 rarg 0.839844
| 87: out = 24 biased 16; lerr
| 0.00268555 rerr 0.00195312 larg
| 0.839844 rarg 0.84375
| 88: out = 23 biased 16; lerr
| 0.00463867 rerr 3.05176E-05 larg
| 0.84375 rarg 0.847656
| 89: out = 23 biased 15; lerr
| 3.05176E-05 rerr 0.00457764 larg
| 0.847656 rarg 0.851562
| 90: out = 22 biased 15; lerr
| 0.0020752 rerr 0.00250244 larg
| 0.851562 rarg 0.855469
| 91: out = 21 biased 15; lerr
| 0.00418091 rerr 0.000366211 larg
| 0.855469 rarg 0.859375
| 92: out = 21 biased 14; lerr
| 0.000366211 rerr 0.00491333 larg
| 0.859375 rarg 0.863281
| 93: out = 20 biased 14; lerr
| 0.00183105 rerr 0.00268555 larg
| 0.863281 rarg 0.867188
| 94: out = 19 biased 14; lerr
| 0.00408936 rerr 0.000396729 larg
| 0.867188 rarg 0.871094
| 95: out = 19 biased 13; lerr
| 0.000396729 rerr 0.00488281 larg
| 0.871094 rarg 0.875
| 96: out = 18 biased 13; lerr
| 0.00195312 rerr 0.00250244 larg 0.875
| rarg 0.878906
| 97: out = 17 biased 13; lerr
| 0.00436401 rerr 6.10352E-05 larg
| 0.878906 rarg 0.882812
| 98: out = 17 biased 12; lerr
| 6.10352E-05 rerr 0.00448608 larg
| 0.882812 rarg 0.886719
| 99: out = 16 biased 12; lerr
| 0.00244141 rerr 0.00195312 larg
| 0.886719 rarg 0.890625
| 100: out = 15 biased 12; lerr
| 0.00500488 rerr 0.000640869 larg
| 0.890625 rarg 0.894531
| 101: out = 15 biased 11; lerr
| 0.000640869 rerr 0.00372314 larg
| 0.894531 rarg 0.898438
| 102: out = 14 biased 11; lerr
| 0.0032959 rerr 0.0010376 larg 0.898438
| rarg 0.902344
| 103: out = 14 biased 10; lerr
| 0.0010376 rerr 0.00537109 larg
| 0.902344 rarg 0.90625
| 104: out = 13 biased 10; lerr
| 0.00170898 rerr 0.00259399 larg
| 0.90625 rarg 0.910156
| 105: out = 12 biased 10; lerr
| 0.0045166 rerr 0.000244141 larg
| 0.910156 rarg 0.914062
| 106: out = 12 biased 9; lerr
| 0.000244141 rerr 0.00402832 larg
| 0.914062 rarg 0.917969
| 107: out = 11 biased 9; lerr
| 0.00314331 rerr 0.00109863 larg
| 0.917969 rarg 0.921875
| 108: out = 11 biased 8; lerr
| 0.00109863 rerr 0.00534058 larg
| 0.921875 rarg 0.925781
| 109: out = 10 biased 8; lerr
| 0.00189209 rerr 0.00231934 larg
| 0.925781 rarg 0.929688
| 110: out = 9 biased 8; lerr
| 0.00494385 rerr 0.000762939 larg
| 0.929688 rarg 0.933594
| 111: out = 9 biased 7; lerr
| 0.000762939 rerr 0.00341797 larg
| 0.933594 rarg 0.9375
| 112: out = 8 biased 7; lerr
| 0.00390625 rerr 0.000244141 larg
| 0.9375 rarg 0.941406
| 113: out = 8 biased 6; lerr
| 0.000244141 rerr 0.00439453 larg
| 0.941406 rarg 0.945312
| 114: out = 7 biased 6; lerr
| 0.00299072 rerr 0.00112915 larg
| 0.945312 rarg 0.949219
| 115: out = 7 biased 5; lerr
| 0.00112915 rerr 0.00524902 larg
| 0.949219 rarg 0.953125
| 116: out = 6 biased 5; lerr
| 0.00219727 rerr 0.00189209 larg
| 0.953125 rarg 0.957031
| 117: out = 5 biased 5; lerr
| 0.00558472 rerr 0.00152588 larg
| 0.957031 rarg 0.960938
| 118: out = 5 biased 4; lerr
| 0.00152588 rerr 0.00253296 larg
| 0.960938 rarg 0.964844
| 119: out = 4 biased 4; lerr
| 0.00500488 rerr 0.000976562 larg
| 0.964844 rarg 0.96875
| 120: out = 4 biased 3; lerr
| 0.000976562 rerr 0.00305176 larg
| 0.96875 rarg 0.972656
| 121: out = 3 biased 3; lerr
| 0.00454712 rerr 0.000549316 larg
| 0.972656 rarg 0.976562
| 122: out = 3 biased 2; lerr
| 0.000549316 rerr 0.00344849 larg
| 0.976562 rarg 0.980469
| 123: out = 2 biased 2; lerr
| 0.00421143 rerr 0.000244141 larg
| 0.980469 rarg 0.984375
| 124: out = 2 biased 1; lerr
| 0.000244141 rerr 0.00372314 larg
| 0.984375 rarg 0.988281
| 125: out = 1 biased 1; lerr 0.0039978
| rerr 6.10352E-05 larg 0.988281 rarg
| 0.992188
| 126: out = 1 biased 0; lerr
| 6.10352E-05 rerr 0.00387573 larg
| 0.992188 rarg 0.996094
| 127: out = 0 biased 0; lerr
| 0.00390625 rerr 0 larg 0.996094 rarg 1

| ... [removed hex data dumping]

| RSqrt7x7LUT (input [6:0] in, output
| reg [6:0] out);
| // in[6] corresponds to exp[0]
| // in[5:0] corresponds to sig
| [S-1:S-5]
| // out[6:0] corresponds to sig
| [S-1:S-6]
| // biased : ((ipN-1) - in) << (op -
| ip)
| 0: out 127 biased 0; lerr 0.00390625
| rerr 0.00384557 larg 0.25 rarg
| 0.253906
| 1: out 125 biased 1; lerr 0.00402773
| rerr 0.00360435 larg 0.253906 rarg
| 0.257812
| 2: out 123 biased 2; lerr 0.00432928
| rerr 0.00318533 larg 0.257812 rarg
| 0.261719
| 3: out 121 biased 3; lerr 0.00480818
| rerr 0.00259111 larg 0.261719 rarg
| 0.265625
| 4: out 119 biased 4; lerr 0.00546183
| rerr 0.00182426 larg 0.265625 rarg
| 0.269531
| 5: out 118 biased 4; lerr 0.0022317
| rerr 0.00497249 larg 0.269531 rarg
| 0.273438
| 6: out 116 biased 5; lerr 0.00319802
| rerr 0.00389675 larg 0.273438 rarg
| 0.277344
| 7: out 114 biased 6; lerr 0.00433191
| rerr 0.00265532 larg 0.277344 rarg
| 0.28125
| 8: out 113 biased 6; lerr 0.00148789
| rerr 0.00542232 larg 0.28125 rarg
| 0.285156
| 9: out 111 biased 7; lerr 0.00292144
| rerr 0.00388464 larg 0.285156 rarg
| 0.289062
| 10: out 109 biased 8; lerr 0.00451607
| rerr 0.0021876 larg 0.289062 rarg
| 0.292969
| 11: out 108 biased 8; lerr 0.00204104
| rerr 0.00458999 larg 0.292969 rarg
| 0.296875
| 12: out 106 biased 9; lerr 0.00392348
| rerr 0.00260824 larg 0.296875 rarg
| 0.300781
| 13: out 105 biased 9; lerr 0.00167641
| rerr 0.00478529 larg 0.300781 rarg
| 0.304688
| 14: out 103 biased 10; lerr
| 0.00383947 rerr 0.00252584 larg
| 0.304688 rarg 0.308594
| 15: out 102 biased 10; lerr 0.0018141
| rerr 0.00448366 larg 0.308594 rarg
| 0.3125
| 16: out 100 biased 11; lerr
| 0.00425098 rerr 0.00195312 larg 0.3125
| rarg 0.316406
| 17: out 99 biased 11; lerr 0.00244141
| rerr 0.00369747 larg 0.316406 rarg
| 0.320312
| 18: out 97 biased 12; lerr 0.00514568
| rerr 0.000902127 larg 0.320312 rarg
| 0.324219
| 19: out 96 biased 12; lerr 0.00354633
| rerr 0.00243843 larg 0.324219 rarg
| 0.328125
| 20: out 95 biased 12; lerr 0.00203674
| rerr 0.00388594 larg 0.328125 rarg
| 0.332031
| 21: out 93 biased 13; lerr 0.00511752
| rerr 0.000717621 larg 0.332031 rarg
| 0.335938
| 22: out 92 biased 13; lerr 0.00381051
| rerr 0.00196455 larg 0.335938 rarg
| 0.339844
| 23: out 91 biased 13; lerr 0.00258984
| rerr 0.00312603 larg 0.339844 rarg
| 0.34375
| 24: out 90 biased 13; lerr 0.00145446
| rerr 0.00420307 larg 0.34375 rarg
| 0.347656
| 25: out 88 biased 14; lerr 0.0050098
| rerr 0.000564416 larg 0.347656 rarg
| 0.351562
| 26: out 87 biased 14; lerr 0.00406783
| rerr 0.00144985 larg 0.351562 rarg
| 0.355469
| 27: out 86 biased 14; lerr 0.00320806
| rerr 0.00225385 larg 0.355469 rarg
| 0.359375
| 28: out 85 biased 14; lerr 0.00242958
| rerr 0.00297735 larg 0.359375 rarg
| 0.363281
| 29: out 84 biased 14; lerr 0.00173146
| rerr 0.00362122 larg 0.363281 rarg
| 0.367188
| 30: out 83 biased 14; lerr 0.00111284
| rerr 0.00418633 larg 0.367188 rarg
| 0.371094
| 31: out 82 biased 14; lerr
| 0.000572846 rerr 0.00467353 larg
| 0.371094 rarg 0.375
| 32: out 80 biased 15; lerr 0.00489479
| rerr 0.00027462 larg 0.375 rarg
| 0.378906
| 33: out 79 biased 15; lerr 0.00453439
| rerr 0.000583717 larg 0.378906 rarg
| 0.382812
| 34: out 78 biased 15; lerr 0.00425002
| rerr 0.000817442 larg 0.382812 rarg
| 0.386719
| 35: out 77 biased 15; lerr 0.0040409
| rerr 0.000976562 larg 0.386719 rarg
| 0.390625
| 36: out 76 biased 15; lerr 0.00390625
| rerr 0.00106183 larg 0.390625 rarg
| 0.394531
| 37: out 75 biased 15; lerr 0.00384534
| rerr 0.00107398 larg 0.394531 rarg
| 0.398438
| 38: out 74 biased 15; lerr 0.00385742
| rerr 0.00101372 larg 0.398438 rarg
| 0.402344
| 39: out 73 biased 15; lerr 0.00394179
| rerr 0.00088176 larg 0.402344 rarg
| 0.40625
| 40: out 72 biased 15; lerr 0.00409775
| rerr 0.000678786 larg 0.40625 rarg
| 0.410156
| 41: out 71 biased 15; lerr 0.00432461
| rerr 0.000405468 larg 0.410156 rarg
| 0.414062
| 42: out 70 biased 15; lerr 0.0046217
| rerr 6.24637E-05 larg 0.414062 rarg
| 0.417969
| 43: out 70 biased 14; lerr
| 6.24637E-05 rerr 0.00472478 larg
| 0.417969 rarg 0.421875
| 44: out 69 biased 14; lerr
| 0.000349583 rerr 0.00426776 larg
| 0.421875 rarg 0.425781
| 45: out 68 biased 14; lerr
| 0.000830041 rerr 0.00374284 larg
| 0.425781 rarg 0.429688
| 46: out 67 biased 14; lerr 0.00137829
| rerr 0.00315063 larg 0.429688 rarg
| 0.433594
| 47: out 66 biased 14; lerr 0.00199374
| rerr 0.00249171 larg 0.433594 rarg
| 0.4375
| 48: out 65 biased 14; lerr 0.00267578
| rerr 0.00176667 larg 0.4375 rarg
| 0.441406
| 49: out 64 biased 14; lerr 0.00342383
| rerr 0.000976086 larg 0.441406 rarg
| 0.445312
| 50: out 63 biased 14; lerr 0.00423733
| rerr 0.000120513 larg 0.445312 rarg
| 0.449219
| 51: out 63 biased 13; lerr
| 0.000120513 rerr 0.00445945 larg
| 0.449219 rarg 0.453125
| 52: out 62 biased 13; lerr
| 0.000799499 rerr 0.00349816 larg
| 0.453125 rarg 0.457031
| 53: out 61 biased 13; lerr 0.00178341
| rerr 0.00247339 larg 0.457031 rarg
| 0.460938
| 54: out 60 biased 13; lerr 0.0028307
| rerr 0.00138568 larg 0.460938 rarg
| 0.464844
| 55: out 59 biased 13; lerr 0.00394084
| rerr 0.00023553 larg 0.464844 rarg
| 0.46875
| 56: out 59 biased 12; lerr 0.00023553
| rerr 0.00439453 larg 0.46875 rarg
| 0.472656
| 57: out 58 biased 12; lerr
| 0.000976562 rerr 0.00314314 larg
| 0.472656 rarg 0.476562
| 58: out 57 biased 12; lerr 0.0022501
| rerr 0.00183069 larg 0.476562 rarg
| 0.480469
| 59: out 56 biased 12; lerr 0.00358461
| rerr 0.000457659 larg 0.480469 rarg
| 0.484375
| 60: out 56 biased 11; lerr
| 0.000457659 rerr 0.00448366 larg
| 0.484375 rarg 0.488281
| 61: out 55 biased 11; lerr
| 0.000975489 rerr 0.00301265 larg
| 0.488281 rarg 0.492188
| 62: out 54 biased 11; lerr 0.00246829
| rerr 0.00148234 larg 0.492188 rarg
| 0.496094
| 63: out 53 biased 11; lerr 0.00402031
| rerr 0.000106817 larg 0.496094 rarg
| 0.5
| 64: out 52 biased 11; lerr 0.00563109
| rerr 0.00210731 larg 0.5 rarg 0.507812
| 65: out 51 biased 11; lerr 0.00345996
| rerr 0.00417648 larg 0.507812 rarg
| 0.515625
| 66: out 50 biased 11; lerr 0.00143345
| rerr 0.00610301 larg 0.515625 rarg
| 0.523438
| 67: out 48 biased 12; lerr 0.00520152
| rerr 0.00219486 larg 0.523438 rarg
| 0.53125
| 68: out 47 biased 12; lerr 0.00349943
| rerr 0.00380104 larg 0.53125 rarg
| 0.539062
| 69: out 46 biased 12; lerr 0.00193497
| rerr 0.00527137 larg 0.539062 rarg
| 0.546875
| 70: out 44 biased 13; lerr 0.00628347
| rerr 0.000789331 larg 0.546875 rarg
| 0.554688
| 71: out 43 biased 13; lerr 0.00502921
| rerr 0.00195312 larg 0.554688 rarg
| 0.5625
| 72: out 42 biased 13; lerr 0.00390625
| rerr 0.00298721 larg 0.5625 rarg
| 0.570312
| 73: out 41 biased 13; lerr 0.00291271
| rerr 0.00389343 larg 0.570312 rarg
| 0.578125
| 74: out 40 biased 13; lerr 0.00204677
| rerr 0.00467353 larg 0.578125 rarg
| 0.585938
| 75: out 39 biased 13; lerr 0.00130667
| rerr 0.00532924 larg 0.585938 rarg
| 0.59375
| 76: out 38 biased 13; lerr
| 0.000690699 rerr 0.00586222 larg
| 0.59375 rarg 0.601562
| 77: out 36 biased 14; lerr 0.0062566
| rerr 0.000175461 larg 0.601562 rarg
| 0.609375
| 78: out 35 biased 14; lerr 0.00592317
| rerr 0.000428823 larg 0.609375 rarg
| 0.617188
| 79: out 34 biased 14; lerr 0.00570878
| rerr 0.000564416 larg 0.617188 rarg
| 0.625
| 80: out 33 biased 14; lerr 0.00561191
| rerr 0.000583717 larg 0.625 rarg
| 0.632812
| 81: out 32 biased 14; lerr 0.00563109
| rerr 0.000488162 larg 0.632812 rarg
| 0.640625
| 82: out 31 biased 14; lerr 0.00576489
| rerr 0.000279149 larg 0.640625 rarg
| 0.648438
| 83: out 30 biased 14; lerr 0.00601191
| rerr 4.19626E-05 larg 0.648438 rarg
| 0.65625
| 84: out 30 biased 13; lerr
| 4.19626E-05 rerr 0.00589256 larg
| 0.65625 rarg 0.664062
| 85: out 29 biased 13; lerr 0.00047385
| rerr 0.00538852 larg 0.664062 rarg
| 0.671875
| 86: out 28 biased 13; lerr 0.00101522
| rerr 0.00477604 larg 0.671875 rarg
| 0.679688
| 87: out 27 biased 13; lerr 0.00166483
| rerr 0.00405633 larg 0.679688 rarg
| 0.6875
| 88: out 26 biased 13; lerr 0.00242145
| rerr 0.0032306 larg 0.6875 rarg
| 0.695312
| 89: out 25 biased 13; lerr 0.00328389
| rerr 0.0023 larg 0.695312 rarg
| 0.703125
| 90: out 24 biased 13; lerr 0.00425098
| rerr 0.00126568 larg 0.703125 rarg
| 0.710938
| 91: out 23 biased 13; lerr 0.0053216
| rerr 0.000128738 larg 0.710938 rarg
| 0.71875
| 92: out 23 biased 12; lerr
| 0.000128738 rerr 0.00554953 larg
| 0.71875 rarg 0.726562
| 93: out 22 biased 12; lerr 0.00110974
| rerr 0.00424628 larg 0.726562 rarg
| 0.734375
| 94: out 21 biased 12; lerr 0.0024487
| rerr 0.00284339 larg 0.734375 rarg
| 0.742188
| 95: out 20 biased 12; lerr 0.0038871
| rerr 0.00134187 larg 0.742188 rarg
| 0.75
| 96: out 19 biased 12; lerr 0.00542395
| rerr 0.000257287 larg 0.75 rarg
| 0.757812
| 97: out 19 biased 11; lerr
| 0.000257287 rerr 0.00488281 larg
| 0.757812 rarg 0.765625
| 98: out 18 biased 11; lerr 0.00195312
| rerr 0.00312603 larg 0.765625 rarg
| 0.773438
| 99: out 17 biased 11; lerr 0.0037447
| rerr 0.00127425 larg 0.773438 rarg
| 0.78125
| 100: out 16 biased 11; lerr
| 0.00563109 rerr 0.000671612 larg
| 0.78125 rarg 0.789062
| 101: out 16 biased 10; lerr
| 0.000671612 rerr 0.00426337 larg
| 0.789062 rarg 0.796875
| 102: out 15 biased 10; lerr
| 0.00271068 rerr 0.00216607 larg
| 0.796875 rarg 0.804688
| 103: out 14 biased 10; lerr
| 0.00484208 rerr 2.28884E-05 larg
| 0.804688 rarg 0.8125
| 104: out 14 biased 9; lerr
| 2.28884E-05 rerr 0.00477319 larg
| 0.8125 rarg 0.820312
| 105: out 13 biased 9; lerr 0.00230268
| rerr 0.00243701 larg 0.820312 rarg
| 0.828125
| 106: out 12 biased 9; lerr 0.00467248
| rerr 1.1444E-05 larg 0.828125 rarg
| 0.835938
| 107: out 12 biased 8; lerr 1.1444E-05
| rerr 0.00467353 larg 0.835938 rarg
| 0.84375
| 108: out 11 biased 8; lerr 0.00250271
| rerr 0.00210469 larg 0.84375 rarg
| 0.851562
| 109: out 10 biased 8; lerr 0.0051047
| rerr 0.000551376 larg 0.851562 rarg
| 0.859375
| 110: out 10 biased 7; lerr
| 0.000551376 rerr 0.00398129 larg
| 0.859375 rarg 0.867188
| 111: out 9 biased 7; lerr 0.00329393
| rerr 0.00118567 larg 0.867188 rarg
| 0.875
| 112: out 9 biased 6; lerr 0.00118567
| rerr 0.00564531 larg 0.875 rarg
| 0.882812
| 113: out 8 biased 6; lerr 0.00169516
| rerr 0.00271239 larg 0.882812 rarg
| 0.890625
| 114: out 7 biased 6; lerr 0.0046605
| rerr 0.000304507 larg 0.890625 rarg
| 0.898438
| 115: out 7 biased 5; lerr 0.000304507
| rerr 0.00403259 larg 0.898438 rarg
| 0.90625
| 116: out 6 biased 5; lerr 0.00340469
| rerr 0.00088176 larg 0.90625 rarg
| 0.914062
| 117: out 6 biased 4; lerr 0.00088176
| rerr 0.00514993 larg 0.914062 rarg
| 0.921875
| 118: out 5 biased 4; lerr 0.00235119
| rerr 0.00186722 larg 0.921875 rarg
| 0.929688
| 119: out 4 biased 4; lerr 0.00566562
| rerr 0.00149648 larg 0.929688 rarg
| 0.9375
| 120: out 4 biased 3; lerr 0.00149648
| rerr 0.00265532 larg 0.9375 rarg
| 0.945312
| 121: out 3 biased 3; lerr 0.00494055
| rerr 0.0008372 larg 0.945312 rarg
| 0.953125
| 122: out 3 biased 2; lerr 0.0008372
| rerr 0.00324937 larg 0.953125 rarg
| 0.960938
| 123: out 2 biased 2; lerr 0.00440902
| rerr 0.000370094 larg 0.960938 rarg
| 0.96875
| 124: out 2 biased 1; lerr 0.000370094
| rerr 0.00365258 larg 0.96875 rarg
| 0.976562
| 125: out 1 biased 1; lerr 0.00406783
| rerr 9.20338E-05 larg 0.976562 rarg
| 0.984375
| 126: out 1 biased 0; lerr 9.20338E-05
| rerr 0.00386801 larg 0.984375 rarg
| 0.992188
| 127: out 0 biased 0; lerr 0.00391391
| rerr 0 larg 0.992188 rarg 1

| ... [removed hex data dumping]

| max recip 7x7 error at 0.519531:
| 0.00558472 or 2^-7.4843
| max rsqrt 7x7 error at 0.546875:
| 0.00628347 or 2^-7.31422

| On 2020-08-03 1:17 p.m., Bill Huffman
| wrote:

| I should have said that my results are
| for the 7/7 case. And it sounds like
| we're in agreement then. We probably
| have the same table.

| Bill

| On 8/2/20 9:50 AM, DSHORNER wrote:

| EXTERNAL MAIL

| This is the link to the revised code
| that does n by m LUT

| https://github.com/David-Horner/recip/blob/master/vrecip.cc

| On 2020-08-01 4:51 p.m., David Horner
| via lists.riscv.org wrote:

|



http://bsc.es/disclaimer

Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.