Date
1  4 of 4
Integer Overflow/Saturation Operations
CDS <cohen.steed@...>
Vectorwidening multiply & accumulate instructions:
How else could these instructions be used practically? What is the expectation for utility when the operations overflow quickly?


Andy Glew Si5
For extended precision arithmetic,
e.g. such as is often performed in cryptography, 2X widening
multiply accumulate is the best that I have found. (And as far as
I know other members of the cryptography working group  It is a
topic of much discussion.) Something of the form of
vd.128[i] += vs2.64[i] * vs1.64[i]
i.e. VMACCU.VV, with SEW = 64 bits (Assuming I remember
correctly that SEW is the prewidened element width)
Overflow problems are avoided by putting only 56 bits of
meaningful data in each of the 64 bits => 112 bits of product
=> 16 guard bits in the accumulator. Every 2^16 (less a
few) iterations you need to propagate carries. If 16 guard bits
are not enough, then put only 48 bits of meaningful data => 32
guard bits.
VMACC.VV, the signed version, is similarly useful for people
who are using signed redundant data rather than unsigned.

This approach, of course, only works well if operations such
as VRGATHER or scatter/gather (indexed) memory accesses are
efficient, or at least not horribly slow.
It only wants to take advantage of the largest possible width
doubling multiply. I.e. if you can do 64*64=+128, then it doesn't
need 32*32=+64, except for running code written with that size
but has not been optimized to take advantage of wider multiply.
People from other companies report success doing such
computations with 24 bits used in every 32bit word, and even 28
in 32 bits  although that requires bit manipulation, not just
byte manipulation. In any case, we nearly always want to use the
largest possible multiplier.

In a packed SIMD manner (aka "divided elements"), the 56r64
approach works well with crossmultiplies:
vd.128[i] += vs2.128[i].hi64 * vs1.128[i].lo64 +
vs2.128[i].lo64 * vs1.128[i].hi64
Although again that is not in the current vector extension.

Exact (nonsaturating, nonlossy) integer/fixed point DSP, of
course, really wants 4X widening operations, such as
vd.32[i] += vs2.8[i] * vs1.8[i]
As well as mixed width
vd.32[i] += vs2.16[i] * vs1.8[i]
but these are not in the current vector proposal.
Being familiar with 4X widening operations like the above I
tried to use them for cryptography, but it's just plain more
efficient to use 2X widening, if you can arrange to get 56 bits
in each 64bit vector element efficiently enough.

These examples show how 2X widening multiply accumulate can be
used even without saturation or overflow flags.
However, if you only provided a saturating 2X widening
multiply accumulate, extended precision arithmetic could still use
the 56r64 (and 112r128) approach above and just back off a few
iterations before propagating carry.
From: Cds <cohen.steed@...> Sent: Friday, August 07, 2020 8:51AM To: TechVectorExt <techvectorext@...> Subject: [RISCV] [techvectorext] Integer Overflow/Saturation Operations
On 8/7/2020 8:51 AM, CDS wrote:
Vectorwidening multiply & accumulate instructions:
How else could these instructions be used practically? What is the expectation for utility when the operations overflow quickly? 
Sorry: Typos (SpeechOs?) Writing Errors <= Speech Recognition
<= Computeritis


CDS <cohen.steed@...>
Andy,
Thank you for your response. The concern I'm raising is less about "How do I avoid overflow?" and more about "Why are we avoiding the specification of saturating instructions, or an overflow flag?" I can always avoid overflow by using smaller numbers, or using a larger data path. Sometimes, I won't know that I've exceeded my data path limits. Sure, design for safe margins, yet we rely on hardware signaling to inform us when we've gone beyond a hard limit. Another way of stating the question: Why NOT have saturation or overflow for these instructions? The alternative is a silent data corruption.


Mikael
As Cohen responded. The examples provided only work, as you rightly noted, if you have implied guard bits in the system.
This is fine, with the caveat that the ISA definition doesn't provision for the users to have manually manage guard bits in their choice of math to fully utilize the vector engine. The open question is, what would a user of the vector ISA expect the results to be of int64 += int32 * int32? fixedPoint64 += fixedPoint32 * fixedPoint32? From our perspective we are looking at the ISA extension as defined. Which is without guard bits in the accumulation, and from this its easy to demonstrate that even for signed widening MAC operations overflow happens quite easily. I think we all agree on this point. I would also like to stress here that we are not trying to build an argument for added guard bits a special accumulation registers, as this would break the vector register model. Now the challenges we find is the only way out of this conundrum, as you point out, is to use the optional quad widening operation. As you point out, for DSP operations, integer/fixed point, this is an issue that is unresolved. Fundamentally this boils down to a realization that the vector ISA as defined is not well suited for DSP operations, unless float is used. Which is the point we are trying to get across. Saturation is a shortcut to at least detect when an overflow happens. This has one caveat, which is saturation in a DSP filter is a sign of issues. You really do not want the innards of a DSP filter to saturate. As currently defined. The vector ISA either mandates the use of the optional quad widening instruction for DSP math, which implies the DSP math is optional to begin with, or DSP math has to be performed in floating point, which defeats the purpose of supporting fixed point DSP math to begin with. There are alternatives involve utilizing block float style fixed point coding. This does introduce extra overhead for each fixed point math operation as well as a need to keep track of various decimal points throughout the signal path.

