#### Re: Integer Overflow/Saturation Operations

Andy Glew Si5

For extended precision arithmetic, e.g.  such as is often performed in cryptography, 2X widening  multiply accumulate is the best that I have found. (And as far as I know other members of the cryptography working group -  It is a topic of much discussion.)  Something of the form of

vd.128[i] += vs2.64[i] * vs1.64[i]

i.e. VMACCU.VV, with SEW = 64 bits  (Assuming I remember correctly that SEW is the pre-widened  element width)

Overflow problems are avoided  by putting only 56 bits  of meaningful data in each of the 64 bits => 112 bits of product => 16 guard bits  in the accumulator.   Every 2^16 (less a few)  iterations you need to propagate carries.  If 16 guard bits are not enough,  then put only 48 bits of meaningful data => 32 guard bits.

VMACC.VV,  the signed version, is similarly useful for people who are using signed redundant data rather than unsigned.

--

This approach,  of course, only works well if  operations such as VRGATHER or scatter/gather (indexed)  memory accesses are efficient, or at least not horribly slow.

It only wants to take advantage of the largest possible width doubling multiply. I.e. if you can do 64*64=+128,  then it doesn't need 32*32=+64,  except for  running code written with that size but has not been optimized to take advantage of wider multiply.

People from other companies report success doing such computations with 24  bits used in every 32-bit word,  and even 28 in 32 bits -  although that requires bit manipulation, not just byte manipulation.  In any case, we nearly always want to use the largest possible multiplier.

--

In a packed SIMD manner (aka "divided elements"), the 56r64 approach works well with cross-multiplies:

vd.128[i] += vs2.128[i].hi64 * vs1.128[i].lo64    +    vs2.128[i].lo64 * vs1.128[i].hi64

Although again that is not  in the current vector extension.

--

Exact (non-saturating,  non-lossy)  integer/fixed point DSP, of course, really wants 4X widening operations, such as

vd.32[i] += vs2.8[i] * vs1.8[i]

As well as mixed width

vd.32[i] += vs2.16[i] * vs1.8[i]

but these are not in the current  vector proposal.

Being familiar with 4X widening operations like the above I tried to  use them for cryptography, but it's just plain more efficient to use 2X widening,  if you can  arrange to get 56 bits in each 64-bit  vector element efficiently enough.

--

These examples show how 2X widening multiply accumulate can be used even without saturation  or overflow flags.

However, if you only provided a  saturating 2X widening multiply accumulate, extended precision arithmetic could still use the 56r64 (and 112r128)  approach above and just back off a few iterations before propagating carry.

From: Cds <cohen.steed@...>
Sent: Friday, August 07, 2020 8:51AM
To: Tech-Vector-Ext <tech-vector-ext@...>
Subject: [RISC-V] [tech-vector-ext] Integer Overflow/Saturation Operations

On 8/7/2020 8:51 AM, CDS wrote:

Vector-widening multiply & accumulate instructions:

• These instructions, signed or unsigned, will quickly overflow in even simple cases.
• Given absence of flagging (e.g. OVERFLOW), a saturating version of these instructions would prevent users from making unintentional errors.
• The current specification leaves the user doing clunky processing to check for overflow after every iteration in a loop.

How else could these instructions be used practically? What is the expectation for utility when the operations overflow quickly?

--- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis

Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.