[riscv/riscv-v-spec] For V1.0 - Make unsigned scalar integer in widening instructions 2 * SEW (#427) (and signed)


David Horner
 

This was on Github; as not every one subscribes and it will be considered at TG, I include it on this list.

First Krste’s synopsys, then my (modified) Github reply, then my thoughts for the TG and lastly the original post for reference.

kasanovic commented

Concretely, this proposal is to change the widening integer vector-scalar operations to treat their x register input (rs1) as 2*SEW not SEW, first for unsigned, then possibly also for signed.

I think the biggest benefit would be in the widening multiply instructions, but that also implies extra hardware (larger multiplier array) over what is needed for current case. I can see there is also some benefit to short computations into a wider accumulator with a large initial value, but also see that some cases will need more scalar instructions. There is non-zero hardware cost even just for adds to handle the new case, with possibly some software overhead in other cases (e.g., when using an XLEN-wide scalar load to bring in packed 8b elements into one register, then feeding these one at a time into a widening operation - the current spec provides the mask). Also, this would be a large change at a late stage in the proposal process.

I will bring up again in next TG meeting.

David-Horner commented

Thank you very much for the synopsis.

x input is naturally limited by XLEN. Once that is exceeded zero extend (or sign extend) takes over (thus maximum xlen precision)

the biggest benefit would be in the widening multiply instructions

As Krste alludes, this form of the multiply is not otherwise available.

For widening ops (including multiply), PoR is EEW has to be a supported SEW width.
As a result multiply and add circuitry is already in the design as PoR is also that SEW <= XLEN for V.
i.e. RVV64: vmulu.vx at SEW=64 has sufficient multiplier circuitry for the expanded vwmulu.vx at SEW=32.


Simple micro-architectures are likely to leverage the same circuity for vmulu as vwmulu (possibly vadd circuitry for vwadd, but the cost to duplicate is much less).
High end performance machines may duplicating circuity in difference execution units and so widening units may pay a price for the increased functionality/performance.



feeding these one at a time into a widening operation - the current spec provides the mask

The original proposal discusses this - a single andi is sufficient for byte.

<digression detailing scalar affects vector>
The "feeding one at a time" would entail a shift likely, so an andi for each of them 
(except the last if logical shift)
This would simplistically be mixed vector/scalar code, with scalar feeding vector -
 with possible interlock delays between scalar and vector processors.
Alternatively,  four x registers could each have their byte values established before
 the set of vector instructions that consume the values. 
This could relax the interface constraints. 
But again a single addi for each byte is not arduous.
</digression detailing scalar affects vector>
this would be a large change at a late stage in the proposal process.

Agreed. However, it cannot be retrofitted. The benefits are there, the timing is not excellent.


Further Comments.



The “feeding these one at a time into a widening operation” application highlights the versatility of this format as 16 bit values are directly masked, but values 9 and 10 (packed 3 to RV32 register) and 11 to 15 also available for shifting into a 16 bit op.



On reflection, I now advocate for the signed variants: The same 10 instruction variants except signed vs “u”.

As I mentioned these are potentially more valuable (because especially the signed adds are more heavily used), but also because offsetting bias values are more naturally expressed as negative values.

And finally because my current opinion is the scalar overhead was weighted too heavily.

The reduction of x to the current SEW to allow PoR emulation is less relevant than what use is made of the extended functionality. Compiler optimizations can compensate when only SEW width values are required. And when the compiler chooses shift sign extension over code restructuring (for example negating the intermediate value) it would be as a result of trade-offs. (e.g. on those machines sll;sra are fused)

My opinion is: had this formulation been proposed and considered earlier in the vector widening instruction set evolution, it would now be included in the base. It has substantive value at low cost.



Original proposal: David-Horner commented •

This suggestion weighs the benefits of increased scalar range of 2 * SEW unsigned rs1 (X register input) with

  • possible requirement in RV code to clear upper SEW bits

  • gate changes to support 2 * SEW input from previous SEW length

First the increased scalar range.

a) Biasing, preparing and tailoring wide accumulator values. Not all accumulations will start from zero, or SEW size values.
Especially when SEW is 8 or 16 the range of values to prime an accumulator is insufficient in many use cases. #287 opened by Zpedro would avoid a SEW/LMUL switch to condition its accumulators with this change.

b) Extended operating range for foundation operation.
e.g. in SEW=8 double widening multiply allows rs1 X[14– n:0]bits * vs2[n:0]bits operation without overflow. (and quad widening vqmaccsu.vx will allow full range of 2 * SEW rs1 by SEW vs2)

c) Such enhanced operations are available in code flow without changing SEW.
Enhanced operations would be:
vwaddu.vx
vwaddu.wx
vwsubu.vx
vwsubu.wx
vwmulu.vx
vwmulsu.vx
vwmaccu.vx
vwmaccsu.vx (or vwmaccus.vx see #426)
vqmaccu.vx
vqmaccsu.vx (or vwmaccus.vx)

Note: once SEW reaches XLEN there is no benefit for this enhancement.
Nevertheless three enhanced modes (at SEW=8,16 and 32) for RVV64 is substantive.

Second, the hardship to condition X register values for 2 * SEW.
There may be an additional RV base instruction (and X register) needed to clear top bits of the X register to simulate current behaviour.
For SEW=8 andi x’FF’ would suffice
For SEW=16 andi x’3FF’ may suffice, but lui; andi combo will always.
For SEW=32 lui; andi is required

In other areas, it appears the general feeling is that trading a few RV base operations for enhanced RVV functionality is a good tradeoff.

Third, how disruptive to current micro-architecture designs is this change.
Given vwopwv operands already provide 2 * SEW input along with a SEW input, the circuitry is already there.
The opcode decoding is also minimal.
The gating of the extended X range is present for current designs to support SEW=XLEN.
The multiple ops are probably the most affected and then only for minimal implementations using multi-step calculations.

Overall I see this as a win with little hardship. However, I definitely need to have hardware gurus’ input.

What about 2 * SEW for signed rs1 input?

The trade-off for scalar are more substantial at SEW=8 and 16. RV64 has addw to sign extend at SEW = 32. Thus sign extension in RVV widening is more valuable to avoid sll;sra combination.
I don’t plan to advocate 2 * SEW for signed rs1, but I don’t think I can reasonably oppose as the trade-offs are close.