Re: Integer Overflow/Saturation Operations
Andy Glew Si5
For extended precision arithmetic,
e.g. such as is often performed in cryptography, 2X widening
multiply accumulate is the best that I have found. (And as far as
I know other members of the cryptography working group  It is a
topic of much discussion.) Something of the form of
vd.128[i] += vs2.64[i] * vs1.64[i]
i.e. VMACCU.VV, with SEW = 64 bits (Assuming I remember
correctly that SEW is the prewidened element width)
Overflow problems are avoided by putting only 56 bits of
meaningful data in each of the 64 bits => 112 bits of product
=> 16 guard bits in the accumulator. Every 2^16 (less a
few) iterations you need to propagate carries. If 16 guard bits
are not enough, then put only 48 bits of meaningful data => 32
guard bits.
VMACC.VV, the signed version, is similarly useful for people
who are using signed redundant data rather than unsigned.

This approach, of course, only works well if operations such
as VRGATHER or scatter/gather (indexed) memory accesses are
efficient, or at least not horribly slow.
It only wants to take advantage of the largest possible width
doubling multiply. I.e. if you can do 64*64=+128, then it doesn't
need 32*32=+64, except for running code written with that size
but has not been optimized to take advantage of wider multiply.
People from other companies report success doing such
computations with 24 bits used in every 32bit word, and even 28
in 32 bits  although that requires bit manipulation, not just
byte manipulation. In any case, we nearly always want to use the
largest possible multiplier.

In a packed SIMD manner (aka "divided elements"), the 56r64
approach works well with crossmultiplies:
vd.128[i] += vs2.128[i].hi64 * vs1.128[i].lo64 +
vs2.128[i].lo64 * vs1.128[i].hi64
Although again that is not in the current vector extension.

Exact (nonsaturating, nonlossy) integer/fixed point DSP, of
course, really wants 4X widening operations, such as
vd.32[i] += vs2.8[i] * vs1.8[i]
As well as mixed width
vd.32[i] += vs2.16[i] * vs1.8[i]
but these are not in the current vector proposal.
Being familiar with 4X widening operations like the above I
tried to use them for cryptography, but it's just plain more
efficient to use 2X widening, if you can arrange to get 56 bits
in each 64bit vector element efficiently enough.

These examples show how 2X widening multiply accumulate can be
used even without saturation or overflow flags.
However, if you only provided a saturating 2X widening
multiply accumulate, extended precision arithmetic could still use
the 56r64 (and 112r128) approach above and just back off a few
iterations before propagating carry.
From: Cds <cohen.steed@...> Sent: Friday, August 07, 2020 8:51AM To: TechVectorExt <techvectorext@...> Subject: [RISCV] [techvectorext] Integer Overflow/Saturation Operations
On 8/7/2020 8:51 AM, CDS wrote:
Vectorwidening multiply & accumulate instructions:
How else could these instructions be used practically? What is the expectation for utility when the operations overflow quickly? 
Sorry: Typos (SpeechOs?) Writing Errors <= Speech Recognition
<= Computeritis

