[riscv/riscv-v-spec] For V1.0 - Make unsigned scalar integer in widening instructions 2 * SEW (#427) (and signed)


mark
 

great!

again this is meant as informational for when this goes to vote.

this should be discussable now in email with questions and comments.

please include this in the ratification materials (place in a github a sub folder labeled change-rationales).

this is a continual improvement process. please send stephano and I email on how to improve the resulting content or the process at any time.

Thank you!
Mark

On Thu, Aug 6, 2020 at 9:23 PM David Horner <ds2horner@...> wrote:
I filled out the  RISC-V Policy: Change and Extension Rationale
as best I could for the issue #427. I believe it is accessible by all. But I will also paste the contents below.

https://lists.riscv.org/g/tech-vector-ext/files/Change%20Extension%20Rationale%20Submission%20For%20riscv-v-spec%20issue%20%23427.docx

Name: Change and Extension Rationale Submission for ricsv-v-spec issue #427


  1. David Horner

  2. In GitHub riscv-v-spec issue #427 originally April 21,2020;

    Closed: July 24; reconsideration July 30;

    As Change Rationale Aug. 6,2020

  3. Individual as memeber of Vector TG.

  4. August 2020, prior to V1.0 submission for ratification

  5. The Kickoff and/or Freeze Milestones.

    Idoes not need Roadmap visibility.

    It is a refinement to a set of Integer Widening instructions.

  6. List of questions please explain your answers where appropriate (like why did you say yes):

    1. Not a functionality gap? Rather it is an apparent formulation that can improve application performance by avoiding vtype mode shifts.

    2. A horizontal attribute enhancement affecting performance.

      Twice the standard integer scalar range is available for widening integer instructions.

    3. No change to ratified ISA specification, Vector extension in progress.

    4. This request is for a completely new rendering of proposed Vector features.

    5. This can be done with already proposed instructions?

      In general it requires:

      i) executing the current widening with integer identity value

      (1 for multiply, zero otherwise)

      ii) mode switch to twice current Selected Element Width (SEW)

      iii) perform corresponding adjustment on step i) widened vector results

      (multiply by or add/subtract widened integer value, as appropriate)

      iv) mode switch to original SEW.

    6. Users/markets which benefit are restricted to V users

      in which 2*SEW integer values are handled in widening scalar ops.

    7. No expected to affect base or derived or custom profiles ?

    8. Compliance tests and compiler generation will need to handle an enlarged integer scalar register.

      1. No changes in the number of cycles needed for any handler entry and exit, and changes in the number of save/restores required.

      2. Changes required to support this extension are typical of other vector instructions tweaks,

      3. No known resources who have time to implement either or both of the above to work.

    9. I expect the impact on logic/gates to be small. Less invasive that ordinal based mask encoding. Much less disruptive than removing SLEN visibility. More comparable to the mixed width vrgatherei16 instruction that is being added.

    10. It would not be optional.

    11. It is no more discoverable than any of the other base vector instructions.

    12. Concerns for widening multiply were the problem of leveraging the multiply units needed for the next higher SEW for the current SEW. Concern is that the SEW level multiply unit will have to be enlarged. Initial estimates were by a factor of 2.

      Given that the multiplication result is widening to 2*SEW, some of the needed circuitry already is present for an expanded integer input. The expanded multiplication result will be truncated to 2*SEW, and so, for these teo reasons doubling of the circuit is not required. As a result a mitigation that dynamically selected paths based on zero (or sign extended) upper SEW integer bits is not required. Such a scheme was correctly rejected as inappropriate for most implementations, but it does not materially factor into the discussion as partitioning the next higher multiplier circuitry should be adequate for all anticipated implementations.





--
Mark I Himelstein
CTO RISC-V International
+1-408-250-6611
twitter @mark_riscv


David Horner
 

I filled out the  RISC-V Policy: Change and Extension Rationale
as best I could for the issue #427. I believe it is accessible by all. But I will also paste the contents below.

https://lists.riscv.org/g/tech-vector-ext/files/Change%20Extension%20Rationale%20Submission%20For%20riscv-v-spec%20issue%20%23427.docx

Name: Change and Extension Rationale Submission for ricsv-v-spec issue #427


  1. David Horner

  2. In GitHub riscv-v-spec issue #427 originally April 21,2020;

    Closed: July 24; reconsideration July 30;

    As Change Rationale Aug. 6,2020

  3. Individual as memeber of Vector TG.

  4. August 2020, prior to V1.0 submission for ratification

  5. The Kickoff and/or Freeze Milestones.

    Idoes not need Roadmap visibility.

    It is a refinement to a set of Integer Widening instructions.

  6. List of questions please explain your answers where appropriate (like why did you say yes):

    1. Not a functionality gap? Rather it is an apparent formulation that can improve application performance by avoiding vtype mode shifts.

    2. A horizontal attribute enhancement affecting performance.

      Twice the standard integer scalar range is available for widening integer instructions.

    3. No change to ratified ISA specification, Vector extension in progress.

    4. This request is for a completely new rendering of proposed Vector features.

    5. This can be done with already proposed instructions?

      In general it requires:

      i) executing the current widening with integer identity value

      (1 for multiply, zero otherwise)

      ii) mode switch to twice current Selected Element Width (SEW)

      iii) perform corresponding adjustment on step i) widened vector results

      (multiply by or add/subtract widened integer value, as appropriate)

      iv) mode switch to original SEW.

    6. Users/markets which benefit are restricted to V users

      in which 2*SEW integer values are handled in widening scalar ops.

    7. No expected to affect base or derived or custom profiles ?

    8. Compliance tests and compiler generation will need to handle an enlarged integer scalar register.

      1. No changes in the number of cycles needed for any handler entry and exit, and changes in the number of save/restores required.

      2. Changes required to support this extension are typical of other vector instructions tweaks,

      3. No known resources who have time to implement either or both of the above to work.

    9. I expect the impact on logic/gates to be small. Less invasive that ordinal based mask encoding. Much less disruptive than removing SLEN visibility. More comparable to the mixed width vrgatherei16 instruction that is being added.

    10. It would not be optional.

    11. It is no more discoverable than any of the other base vector instructions.

    12. Concerns for widening multiply were the problem of leveraging the multiply units needed for the next higher SEW for the current SEW. Concern is that the SEW level multiply unit will have to be enlarged. Initial estimates were by a factor of 2.

      Given that the multiplication result is widening to 2*SEW, some of the needed circuitry already is present for an expanded integer input. The expanded multiplication result will be truncated to 2*SEW, and so, for these teo reasons doubling of the circuit is not required. As a result a mitigation that dynamically selected paths based on zero (or sign extended) upper SEW integer bits is not required. Such a scheme was correctly rejected as inappropriate for most implementations, but it does not materially factor into the discussion as partitioning the next higher multiplier circuitry should be adequate for all anticipated implementations.






David Horner
 



I posted a comment to the closed #427
Not everyone subscribes to GitHub, so I post it below,

I am requesting  this proposal be reconsidered/re-evaluated for V1.0 inclusion in light of the posting:

Some additional comments to the post.

Increased overhead.

An extra SEW bits need to be distributed to the execution units,
 which on a large VLEN machine could be multiple and physically dispersed on the chip.
More lines to toggle.

Yes, there is extra power, however only once, the scalar values remain resident through all successive iterations on different channels.

There is not additional distribution circuitry, the sew=XLEN case will have to be wired in and
    is thus available for the sew=XLEN/2 case (which has EEW of XLEN for the rs1).

The additional power/complexity/transfer is self limiting, once sew>=XLEN no extra SEW bits are transferred.


Potential Usage:

It is not to save hardware (much can be reused), but to increase functionality.
We have instructions

# Widening unsigned integer add/subtract, 2*SEW = 2*SEW +/- SEW
vwaddu.wv  vd, vs2,  vs1, vm # vector-vector
vwaddu.wx  vd, vs2,  rs1, vm # vector-scalar

of the form:

VWADDU.WV:    vd(at 2*sew)[i] := vs2.(at 2*sew) [i] + zext((to 2*sew) vs1(at sew width) [i])
VWADDU.WX:    vd(at 2*sew)[i] := vs2.(at 2*sew) [i] + zext((to 2*sew) narrow((to sew bits) rs1))


the WX form would become:

VWADDU.WV:    vd(at 2*sew)[i] := vs2.(at 2*sew) [i] + narrow((to sew bits) rs1)

It effectively becomes a 2sew add scalar and replaces the sequence:

vsetvli 0,0, sew2n
VADDU.VX
vsetvli 0,0, sewn

In general using 2sew of rs1 allows a scalar input range commensurate with the target rather than the source in the vector widening operation.


Github Posting: https://github.com/riscv/riscv-v-spec/issues/427#issuecomment-666487664

Comments on resolution and request to re-frame

We discussed proposal to widen scalar input to widening operations to
2 * SEW.

For widening multiplies, this would double the size of
multiplier arrays required.

When SEW < XLEN, I noted that the double size multiplier arrays (conceptually)

  1. already exist for the next (SEW) level up non-widening multiplies, and
  2. the widened output would be compatible with that next up multiply.
    Note as well that only vector operator needs to be distributed to the appropriate wider multiplier as the scalar value is “constant” across all multiply operations.

This approach is quite appropriate for many micro-archs; those Uarchs that internally have SLEN=VLEN (= channel width) , and of these, especially those that are register write limited.
In these Uarch the VLEN register 0 would be written in one cycle (or process set) with register 1 in the next (and if LMUL>1 register 2, 3, etc. with each subsequent cycle (or process set). The throughput would be 1/2 of discrete multiplier units per vector operand but as the register write would be saturated there is no actual loss.

This approach does not work well for SLEN<VLEN (and perhaps multiple active channels) that might distribute both vector source from register groups to multiplier units, and double width results to distant register ports. Possibly further complicated by renamed register segments.
These Uarch would rather have dedicated SEWxSEW multiply units (potentially sharing segments of the same multipliers for the next (2 * SEW) level up, extended to provide a double width result.
The benefit of such a configuration is full hardware throughput, that would be tailored to “normal” vector register file read port rate. In that a channel (likely SLEN) width slice would be generating double width element in the same physical register (but potentially to renamed segments of the register) the advantage seen in the simpler SLEN=VLEN design (of consecutively writing full VLEN registers) is not present.

There is further impetus to optimizer the SEW=8 case. Both in the vector x vector and vector x scalar use are expected to be a common use case,. But further, 8 bit is the extreme situation for number of elements, source operand distribution and/or widened result distribution. And lastly, the 8x8 multiplier array is relatively small, so the investment in gates pays substantial dividends on the smaller bit sizes.

The group discussed using a microarchitectural check on scalar width to select a narrower
multiplier, but ... [keep in mind that this dynamic selection is for SLEN<VLEN type Uarchs)

With the scalar 2 * SEW introduction, dynamically selecting between the two approaches would require reading the scalar value , and determining if the upper half (SEW bits) of it were zeros (or all ones for signed), in which case it could use the optimized approach. If the high SEW were not just the sign, then the fall back to using the 2 * SEW multipliers approach would be used. This dynamic re-configuring was rightly trounced. Evaluating the high SEW bits would occur much to late in the process introducing stalls or complex read ahead X register circuitry that is not needed anywhere else and would likely impact cycle timing. Dead on arrival.

group consensus was that this information should be
supplied through a different opcode.

Given that multiplies would
provide the larger benefit, and that adds would then have a
non-uniform format, the decisions was made to stay with the PoR.

I believe this narrative correctly reflects the reasoning.

I fully agree with this final conclusion that uniformity persuades to handling all potential integer 2 * SEW the same way.

However, I believe I must take blame for framing the issue as a duality. Either leverage the next level (2 * SEW) multiplier or optimize with a narrower widening multiplier circuit. An the latter would require a dynamic “macro” selection between them.

I did not present alternatives that change the narrative and basis for decision.

Firstly, a full double width multiplier is not necessary (but certainly sufficient) for the integer SEWx(2 * SEW) case. By definition, the high SEW bits of the vector operand are zero and do not participate in the (2 * SEW)x(2 * SEW) circuitry. Further, only SEW bits of the product of the vector with the high SEW bits of the scalar are retained, and thus need only be generated and summed.

Especially when ELEN > XLEN, but even with lower SEW, the widening multiply (and even non-widening) will likely be implemented as temporal iterations of sum of partial products, in some cases this will be driven by the desire to keep cycle time constrained. This temporal circuity could be utilized for conditionally summing the High-SEW-Scalar-bits with the SEW-vector on the narrow multiply. Thus zero/sign high bits of scalar are not a selection between LARGE/narrow but rather an optimization of the narrow process.

The optimization of narrow multiplies can be incorporated independently for various sizes of SEW. The cost to add the upper X-register SEW-bits is nominal at 8bits and still small at 16 bits. For a RV32 these are the only two integer widening of concern for 2 * SEW scalar. For RV64 the only other integer widening 2 * SEW is 32 bit.
A tradeoff between

  1. half throughput (use next level up full multiplier), or
  2. (as above) conditional temporal, or
  3. parallel partial-product generation and fast sum hardware
    can be chosen independent of upper 32 bits of X register.

Re-framing the proposal in these terms changes the question from a dichotomy to a continuum of design options that can be effectively implemented (and as efficient as possible) on simple Uarch designs without hobbling performant designs.

The question then becomes one of worth for complexity at V1.0.

In this context I believe it is worthy, especially as Krste remarked for the expanded multiple.