Re: Integer Overflow/Saturation Operations
Andy Glew Si5
For extended precision arithmetic, e.g. such as is often performed in cryptography, 2X widening multiply accumulate is the best that I have found. (And as far as I know other members of the cryptography working group - It is a topic of much discussion.) Something of the form of
vd.128[i] += vs2.64[i] * vs1.64[i]
i.e. VMACCU.VV, with SEW = 64 bits (Assuming I remember correctly that SEW is the pre-widened element width)
Overflow problems are avoided by putting only 56 bits of meaningful data in each of the 64 bits => 112 bits of product => 16 guard bits in the accumulator. Every 2^16 (less a few) iterations you need to propagate carries. If 16 guard bits are not enough, then put only 48 bits of meaningful data => 32 guard bits.
VMACC.VV, the signed version, is similarly useful for people who are using signed redundant data rather than unsigned.
This approach, of course, only works well if operations such as VRGATHER or scatter/gather (indexed) memory accesses are efficient, or at least not horribly slow.
It only wants to take advantage of the largest possible width doubling multiply. I.e. if you can do 64*64=+128, then it doesn't need 32*32=+64, except for running code written with that size but has not been optimized to take advantage of wider multiply.
People from other companies report success doing such computations with 24 bits used in every 32-bit word, and even 28 in 32 bits - although that requires bit manipulation, not just byte manipulation. In any case, we nearly always want to use the largest possible multiplier.
In a packed SIMD manner (aka "divided elements"), the 56r64 approach works well with cross-multiplies:
vd.128[i] += vs2.128[i].hi64 * vs1.128[i].lo64 + vs2.128[i].lo64 * vs1.128[i].hi64
Although again that is not in the current vector extension.
Exact (non-saturating, non-lossy) integer/fixed point DSP, of course, really wants 4X widening operations, such as
vd.32[i] += vs2.8[i] * vs1.8[i]
As well as mixed width
vd.32[i] += vs2.16[i] * vs1.8[i]
but these are not in the current vector proposal.
Being familiar with 4X widening operations like the above I tried to use them for cryptography, but it's just plain more efficient to use 2X widening, if you can arrange to get 56 bits in each 64-bit vector element efficiently enough.
These examples show how 2X widening multiply accumulate can be used even without saturation or overflow flags.
However, if you only provided a saturating 2X widening multiply accumulate, extended precision arithmetic could still use the 56r64 (and 112r128) approach above and just back off a few iterations before propagating carry.
From: Cds <cohen.steed@...>
Sent: Friday, August 07, 2020 8:51AM
To: Tech-Vector-Ext <tech-vector-ext@...>
Subject: [RISC-V] [tech-vector-ext] Integer Overflow/Saturation Operations
On 8/7/2020 8:51 AM, CDS wrote:
Vector-widening multiply & accumulate instructions:
How else could these instructions be used practically? What is the expectation for utility when the operations overflow quickly?
--- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis