[riscv/riscv-v-spec] For V1.0 - Make unsigned scalar integer in widening instructions 2 * SEW (#427) (and signed)

David Horner

I posted a comment to the closed #427
Not everyone subscribes to GitHub, so I post it below,

I am requesting  this proposal be reconsidered/re-evaluated for V1.0 inclusion in light of the posting:

Some additional comments to the post.

Increased overhead.

An extra SEW bits need to be distributed to the execution units,
 which on a large VLEN machine could be multiple and physically dispersed on the chip.
More lines to toggle.

Yes, there is extra power, however only once, the scalar values remain resident through all successive iterations on different channels.

There is not additional distribution circuitry, the sew=XLEN case will have to be wired in and
    is thus available for the sew=XLEN/2 case (which has EEW of XLEN for the rs1).

The additional power/complexity/transfer is self limiting, once sew>=XLEN no extra SEW bits are transferred.

Potential Usage:

It is not to save hardware (much can be reused), but to increase functionality.
We have instructions

# Widening unsigned integer add/subtract, 2*SEW = 2*SEW +/- SEW
vwaddu.wv  vd, vs2,  vs1, vm # vector-vector
vwaddu.wx  vd, vs2,  rs1, vm # vector-scalar

of the form:

VWADDU.WV:    vd(at 2*sew)[i] := vs2.(at 2*sew) [i] + zext((to 2*sew) vs1(at sew width) [i])
VWADDU.WX:    vd(at 2*sew)[i] := vs2.(at 2*sew) [i] + zext((to 2*sew) narrow((to sew bits) rs1))

the WX form would become:

VWADDU.WV:    vd(at 2*sew)[i] := vs2.(at 2*sew) [i] + narrow((to sew bits) rs1)

It effectively becomes a 2sew add scalar and replaces the sequence:

vsetvli 0,0, sew2n
vsetvli 0,0, sewn

In general using 2sew of rs1 allows a scalar input range commensurate with the target rather than the source in the vector widening operation.

Github Posting: https://github.com/riscv/riscv-v-spec/issues/427#issuecomment-666487664

Comments on resolution and request to re-frame

We discussed proposal to widen scalar input to widening operations to
2 * SEW.

For widening multiplies, this would double the size of
multiplier arrays required.

When SEW < XLEN, I noted that the double size multiplier arrays (conceptually)

  1. already exist for the next (SEW) level up non-widening multiplies, and
  2. the widened output would be compatible with that next up multiply.
    Note as well that only vector operator needs to be distributed to the appropriate wider multiplier as the scalar value is “constant” across all multiply operations.

This approach is quite appropriate for many micro-archs; those Uarchs that internally have SLEN=VLEN (= channel width) , and of these, especially those that are register write limited.
In these Uarch the VLEN register 0 would be written in one cycle (or process set) with register 1 in the next (and if LMUL>1 register 2, 3, etc. with each subsequent cycle (or process set). The throughput would be 1/2 of discrete multiplier units per vector operand but as the register write would be saturated there is no actual loss.

This approach does not work well for SLEN<VLEN (and perhaps multiple active channels) that might distribute both vector source from register groups to multiplier units, and double width results to distant register ports. Possibly further complicated by renamed register segments.
These Uarch would rather have dedicated SEWxSEW multiply units (potentially sharing segments of the same multipliers for the next (2 * SEW) level up, extended to provide a double width result.
The benefit of such a configuration is full hardware throughput, that would be tailored to “normal” vector register file read port rate. In that a channel (likely SLEN) width slice would be generating double width element in the same physical register (but potentially to renamed segments of the register) the advantage seen in the simpler SLEN=VLEN design (of consecutively writing full VLEN registers) is not present.

There is further impetus to optimizer the SEW=8 case. Both in the vector x vector and vector x scalar use are expected to be a common use case,. But further, 8 bit is the extreme situation for number of elements, source operand distribution and/or widened result distribution. And lastly, the 8x8 multiplier array is relatively small, so the investment in gates pays substantial dividends on the smaller bit sizes.

The group discussed using a microarchitectural check on scalar width to select a narrower
multiplier, but ... [keep in mind that this dynamic selection is for SLEN<VLEN type Uarchs)

With the scalar 2 * SEW introduction, dynamically selecting between the two approaches would require reading the scalar value , and determining if the upper half (SEW bits) of it were zeros (or all ones for signed), in which case it could use the optimized approach. If the high SEW were not just the sign, then the fall back to using the 2 * SEW multipliers approach would be used. This dynamic re-configuring was rightly trounced. Evaluating the high SEW bits would occur much to late in the process introducing stalls or complex read ahead X register circuitry that is not needed anywhere else and would likely impact cycle timing. Dead on arrival.

group consensus was that this information should be
supplied through a different opcode.

Given that multiplies would
provide the larger benefit, and that adds would then have a
non-uniform format, the decisions was made to stay with the PoR.

I believe this narrative correctly reflects the reasoning.

I fully agree with this final conclusion that uniformity persuades to handling all potential integer 2 * SEW the same way.

However, I believe I must take blame for framing the issue as a duality. Either leverage the next level (2 * SEW) multiplier or optimize with a narrower widening multiplier circuit. An the latter would require a dynamic “macro” selection between them.

I did not present alternatives that change the narrative and basis for decision.

Firstly, a full double width multiplier is not necessary (but certainly sufficient) for the integer SEWx(2 * SEW) case. By definition, the high SEW bits of the vector operand are zero and do not participate in the (2 * SEW)x(2 * SEW) circuitry. Further, only SEW bits of the product of the vector with the high SEW bits of the scalar are retained, and thus need only be generated and summed.

Especially when ELEN > XLEN, but even with lower SEW, the widening multiply (and even non-widening) will likely be implemented as temporal iterations of sum of partial products, in some cases this will be driven by the desire to keep cycle time constrained. This temporal circuity could be utilized for conditionally summing the High-SEW-Scalar-bits with the SEW-vector on the narrow multiply. Thus zero/sign high bits of scalar are not a selection between LARGE/narrow but rather an optimization of the narrow process.

The optimization of narrow multiplies can be incorporated independently for various sizes of SEW. The cost to add the upper X-register SEW-bits is nominal at 8bits and still small at 16 bits. For a RV32 these are the only two integer widening of concern for 2 * SEW scalar. For RV64 the only other integer widening 2 * SEW is 32 bit.
A tradeoff between

  1. half throughput (use next level up full multiplier), or
  2. (as above) conditional temporal, or
  3. parallel partial-product generation and fast sum hardware
    can be chosen independent of upper 32 bits of X register.

Re-framing the proposal in these terms changes the question from a dichotomy to a continuum of design options that can be effectively implemented (and as efficient as possible) on simple Uarch designs without hobbling performant designs.

The question then becomes one of worth for complexity at V1.0.

In this context I believe it is worthy, especially as Krste remarked for the expanded multiple.

Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.