[riscv/riscv-v-spec] For V1.0 - Make unsigned scalar integer in widening instructions 2 * SEW (#427) (and signed)
I posted a comment to the closed #427
Not everyone subscribes to GitHub, so I post it below,
I am requesting this proposal be reconsidered/re-evaluated for V1.0 inclusion in light of the posting:
Some additional comments to the post.
An extra SEW bits need to be distributed to the execution units,
Yes, there is extra power, however only once, the scalar values remain resident through all successive iterations on different channels.
There is not additional distribution circuitry, the sew=XLEN case will have to be wired in and
is thus available for the sew=XLEN/2 case (which has EEW of XLEN for the rs1).
The additional power/complexity/transfer is self limiting, once sew>=XLEN no extra SEW bits are transferred.
It is not to save hardware (much can be reused), but to increase functionality.
We have instructions
# Widening unsigned integer add/subtract, 2*SEW = 2*SEW +/- SEW
vwaddu.wv vd, vs2, vs1, vm # vector-vector
vwaddu.wx vd, vs2, rs1, vm # vector-scalar
of the form:
VWADDU.WV: vd(at 2*sew)[i] := vs2.(at 2*sew) [i] + zext((to 2*sew) vs1(at sew width) [i])
VWADDU.WX: vd(at 2*sew)[i] := vs2.(at 2*sew) [i] + zext((to 2*sew) narrow((to sew bits) rs1))
the WX form would become:
VWADDU.WV: vd(at 2*sew)[i] := vs2.(at 2*sew) [i] + narrow((to sew bits) rs1)
It effectively becomes a 2sew add scalar and replaces the sequence:
vsetvli 0,0, sew2n
vsetvli 0,0, sewn
In general using 2sew of rs1 allows a scalar input range commensurate with the target rather than the source in the vector widening operation.
Github Posting: https://github.com/riscv/riscv-v-spec/issues/427#issuecomment-666487664
Comments on resolution and request to re-frame
When SEW < XLEN, I noted that the double size multiplier arrays (conceptually)
This approach is quite appropriate for many micro-archs; those
Uarchs that internally have SLEN=VLEN (= channel width) , and of
these, especially those that are register write limited.
This approach does not work well for SLEN<VLEN (and perhaps
multiple active channels) that might distribute both vector
source from register groups to multiplier units, and double
width results to distant register ports. Possibly further
complicated by renamed register segments.
There is further impetus to optimizer the SEW=8 case. Both in the vector x vector and vector x scalar use are expected to be a common use case,. But further, 8 bit is the extreme situation for number of elements, source operand distribution and/or widened result distribution. And lastly, the 8x8 multiplier array is relatively small, so the investment in gates pays substantial dividends on the smaller bit sizes.
With the scalar 2 * SEW introduction, dynamically selecting between the two approaches would require reading the scalar value , and determining if the upper half (SEW bits) of it were zeros (or all ones for signed), in which case it could use the optimized approach. If the high SEW were not just the sign, then the fall back to using the 2 * SEW multipliers approach would be used. This dynamic re-configuring was rightly trounced. Evaluating the high SEW bits would occur much to late in the process introducing stalls or complex read ahead X register circuitry that is not needed anywhere else and would likely impact cycle timing. Dead on arrival.
I believe this narrative correctly reflects the reasoning.
I fully agree with this final conclusion that uniformity persuades to handling all potential integer 2 * SEW the same way.
However, I believe I must take blame for framing the issue as a duality. Either leverage the next level (2 * SEW) multiplier or optimize with a narrower widening multiplier circuit. An the latter would require a dynamic “macro” selection between them.
I did not present alternatives that change the narrative and basis for decision.
Firstly, a full double width multiplier is not necessary (but certainly sufficient) for the integer SEWx(2 * SEW) case. By definition, the high SEW bits of the vector operand are zero and do not participate in the (2 * SEW)x(2 * SEW) circuitry. Further, only SEW bits of the product of the vector with the high SEW bits of the scalar are retained, and thus need only be generated and summed.
Especially when ELEN > XLEN, but even with lower SEW, the widening multiply (and even non-widening) will likely be implemented as temporal iterations of sum of partial products, in some cases this will be driven by the desire to keep cycle time constrained. This temporal circuity could be utilized for conditionally summing the High-SEW-Scalar-bits with the SEW-vector on the narrow multiply. Thus zero/sign high bits of scalar are not a selection between LARGE/narrow but rather an optimization of the narrow process.
The optimization of narrow multiplies can be incorporated
independently for various sizes of SEW. The cost to add the
upper X-register SEW-bits is nominal at 8bits and still small at
16 bits. For a RV32 these are the only two integer widening of
concern for 2 * SEW scalar. For RV64 the only other integer
widening 2 * SEW is 32 bit.
Re-framing the proposal in these terms changes the question from a dichotomy to a continuum of design options that can be effectively implemented (and as efficient as possible) on simple Uarch designs without hobbling performant designs.
The question then becomes one of worth for complexity at V1.0.
In this context I believe it is worthy, especially as Krste remarked for the expanded multiple.