#### Integer Overflow/Saturation Operations

CDS <cohen.steed@...>

Vector-widening multiply & accumulate instructions:

• These instructions, signed or unsigned, will quickly overflow in even simple cases.
• Given absence of flagging (e.g. OVERFLOW), a saturating version of these instructions would prevent users from making unintentional errors.
• The current specification leaves the user doing clunky processing to check for overflow after every iteration in a loop.

How else could these instructions be used practically? What is the expectation for utility when the operations overflow quickly?

Andy Glew Si5

For extended precision arithmetic, e.g.  such as is often performed in cryptography, 2X widening  multiply accumulate is the best that I have found. (And as far as I know other members of the cryptography working group -  It is a topic of much discussion.)  Something of the form of

vd.128[i] += vs2.64[i] * vs1.64[i]

i.e. VMACCU.VV, with SEW = 64 bits  (Assuming I remember correctly that SEW is the pre-widened  element width)

Overflow problems are avoided  by putting only 56 bits  of meaningful data in each of the 64 bits => 112 bits of product => 16 guard bits  in the accumulator.   Every 2^16 (less a few)  iterations you need to propagate carries.  If 16 guard bits are not enough,  then put only 48 bits of meaningful data => 32 guard bits.

VMACC.VV,  the signed version, is similarly useful for people who are using signed redundant data rather than unsigned.

--

This approach,  of course, only works well if  operations such as VRGATHER or scatter/gather (indexed)  memory accesses are efficient, or at least not horribly slow.

It only wants to take advantage of the largest possible width doubling multiply. I.e. if you can do 64*64=+128,  then it doesn't need 32*32=+64,  except for  running code written with that size but has not been optimized to take advantage of wider multiply.

People from other companies report success doing such computations with 24  bits used in every 32-bit word,  and even 28 in 32 bits -  although that requires bit manipulation, not just byte manipulation.  In any case, we nearly always want to use the largest possible multiplier.

--

In a packed SIMD manner (aka "divided elements"), the 56r64 approach works well with cross-multiplies:

vd.128[i] += vs2.128[i].hi64 * vs1.128[i].lo64    +    vs2.128[i].lo64 * vs1.128[i].hi64

Although again that is not  in the current vector extension.

--

Exact (non-saturating,  non-lossy)  integer/fixed point DSP, of course, really wants 4X widening operations, such as

vd.32[i] += vs2.8[i] * vs1.8[i]

As well as mixed width

vd.32[i] += vs2.16[i] * vs1.8[i]

but these are not in the current  vector proposal.

Being familiar with 4X widening operations like the above I tried to  use them for cryptography, but it's just plain more efficient to use 2X widening,  if you can  arrange to get 56 bits in each 64-bit  vector element efficiently enough.

--

These examples show how 2X widening multiply accumulate can be used even without saturation  or overflow flags.

However, if you only provided a  saturating 2X widening multiply accumulate, extended precision arithmetic could still use the 56r64 (and 112r128)  approach above and just back off a few iterations before propagating carry.

From: Cds <cohen.steed@...>
Sent: Friday, August 07, 2020 8:51AM
To: Tech-Vector-Ext <tech-vector-ext@...>
Subject: [RISC-V] [tech-vector-ext] Integer Overflow/Saturation Operations

On 8/7/2020 8:51 AM, CDS wrote:

Vector-widening multiply & accumulate instructions:

• These instructions, signed or unsigned, will quickly overflow in even simple cases.
• Given absence of flagging (e.g. OVERFLOW), a saturating version of these instructions would prevent users from making unintentional errors.
• The current specification leaves the user doing clunky processing to check for overflow after every iteration in a loop.

How else could these instructions be used practically? What is the expectation for utility when the operations overflow quickly?

--- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis

CDS <cohen.steed@...>

Andy,

The concern I'm raising is less about "How do I avoid overflow?" and more about "Why are we avoiding the specification of saturating instructions, or an overflow flag?"

I can always avoid overflow by using smaller numbers, or using a larger data path. Sometimes, I won't know that I've exceeded my data path limits. Sure, design for safe margins, yet we rely on hardware signaling to inform us when we've gone beyond a hard limit. Another way of stating the question: Why NOT have saturation or overflow for these instructions? The alternative is a silent data corruption.

Mikael <mikael.mortensen@...>

As Cohen responded. The examples provided only work, as you rightly noted, if you have implied guard bits in the system.
This is fine, with the caveat that the ISA definition doesn't provision for the users to have manually manage guard bits in their choice of math to fully utilize the vector engine.
The open question is, what would a user of the vector ISA expect the results to be of int64 += int32 * int32? fixedPoint64 += fixedPoint32 * fixedPoint32?

From our perspective we are looking at the ISA extension as defined. Which is without guard bits in the accumulation, and from this its easy to demonstrate that even for signed widening MAC operations overflow happens quite easily. I think we all agree on this point.
I would also like to stress here that we are not trying to build an argument for added guard bits a special accumulation registers, as this would break the vector register model.

Now the challenges we find is the only way out of this conundrum, as you point out, is to use the optional quad widening operation.
As you point out, for DSP operations, integer/fixed point, this is an issue that is unresolved.
Fundamentally this boils down to a realization that the vector ISA as defined is not well suited for DSP operations, unless float is used. Which is the point we are trying to get across.

Saturation is a short-cut to at least detect when an overflow happens. This has one caveat, which is saturation in a DSP filter is a sign of issues. You really do not want the innards of a DSP filter to saturate.
As currently defined. The vector ISA either mandates the use of the optional quad widening instruction for DSP math, which implies the DSP math is optional to begin with, or DSP math has to be performed in floating point, which defeats the purpose of supporting fixed point DSP math to begin with.

There are alternatives involve utilizing block float style fixed point coding. This does introduce extra overhead for each fixed point math operation as well as a need to keep track of various decimal points throughout the signal path.

 1 - 4 of 4