make SEW be the largest element width


Krste Asanovic
 

I added my proposal to github:
https://github.com/riscv/riscv-v-spec/issues/425

appended below for those not following github

Krste

This proposal is a modification of earlier idea to add effective
element width to load/store instructions to mitigate dropping
fixed-width load/stores and to provide greater efficiency for
mixed-width floating-point codes.

This proposal redefines SEW to be the largest element width (LEW?),
and correspondingly the definition of widening/narrowing operations:

Previously a double-widening add was defined as:
2*SEW = SEW + SEW
the new proposal is to specify
SEW = SEW/2 + SEW/2

This proposal does not change the behavior/implementation of existing
instructions, except to change how the effective EW and effective LMUL
are obtained from vtype.

For load/store instructions, we could modify these to have relative
element widths that are fractions of SEW {SEW, SEW/2, SEW/4, SEW/8}.
Assembly syntax could be something like vle, vlef2, vlef4, vlfe8.
There is a challenge in readability that these are relative to last
vtype setting, and also when SEW is less than 64, some become useless.

I think for this reason we should stick with fixed vle8, vle16, vle32,
vle, in base encoding {8,16,32,SEW}. These are more readable and can
be used to interleave load/stores of larger than SEW values without
changing vtype (e.g., moving 32b values when SEW=16). Extending past
base, we could add relative EEW load/store using unused mop bit, for
example.

Mask registers have an EEW of SEW/LMUL, and so defining the relative
sizes as fractions of SEW also makes it more likely that code can
save/restore a mask register without changing vtype. The fixed sizes
will cover some of these cases, though might have to consider
saving/restoring more than needed if SEW/LMUL < 8.

There is possibly actually a minor hardware checking advantage to
knowing that SEW holds the largest possible element width, since
following instructions cannot request an effective SEW larger than
this. Even with fixed-size load/stores, if the sizes are 8,16,32,SEW
then all of these will also be legal on standard implementations.

Some of the examples we've been discussing.

----------------------------------------------------------------------
The "worst-case" example from TG slide:

# int32_t a[i] = int8_t b[i] << 15
# With fixed-width load/store
vsetvli t0, a0, e32,m1
vlb.v v4, (rb)
vsll.vi v4, v4, 15
vsw.v v4, (ra)

# This proposal, assuming quad-widening and fractional LMUL
vsetvli t0, a0, e32,m1
vle8.v v1, (rb) # load bytes into e8,m1 register
vqcvt.x.x.v v4, v1 # Quad-widen to 32 bits
vsll.vi v4, v4, 15 # 32-bit shift
vse32.v v4, (ra) # could also use vse

Obviously, adding a quad-widening left shift would remove any
difference between two.

----------------------------------------------------------------------
The previous widening compute operation:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32

vsetvli t0, a0, e32,m8 # vtype SEW=32b
vle16.v v4, (a1)
vle16.v v8, (a2)
vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of source operands is SEW/2,LMUL/2
vse32.v v16, (a3) # EEW=32b, no vsetvli needed

----------------------------------------------------------------------
The example code from #362

Original fixed-width code assuming 32b float.

vsetvli t0, a0, e32,m8 # Assuming 32b float
loop:
[...] # Other instructions with fixed vl and sew
vlb.v v0, (x12) # Get byte value
vadd.vx v16, v0, x11 # Add scalar integer offset
vfcvt.f.x.v v16, v16 # Convert to 32b floating vlaue
[...]

converts to:

vsetvli t0, a0, e32, m8
loop:
[...]
vle8.v v0, (x12) # Load bytes
vqcvt.x.x.v v16, v0 # Quad-widening sign-extension
vadd.vx v16, v16, x11 # Add offset
vfcvt.f.x.v v16, v16 # Convert to float
[...]

which is just one more instruction in inner loop.

With a quad-widening add (vqadd.vx), there would be no penalty in this
case:

vsetvli t0, a0, e32, m8
loop:
[...]
vle8.v v0, (x12) # Load bytes
vqadd.vx v16, v0, x11 # Add offset, quad-widen
vfcvt.f.x.v v16, v16 # Convert to float
[...]

The vle8 version also reduce the number of vector registers tied up
buffering load values when the loop is software-pipelined/loop
unrolled, allowing more loads in flight for a given LMUL.

Basically, in software-pipelined/unrolled loops, the widen-at-load
instructions tie up more architectural registers than widen-at-use
when the widen-at-use is part of instruction, reducing flexibility in
scheduling and/or reducing LMUL.

Even if an explicit widen instruction is used, at that point the
values are in the registers and the widen will proceed at max rate,
generally reducing architectural register occupancy versus having the
wider values in-flight from memory. For example:

setvli t0, a0, e32,m8

vlb.v v0, () # v0-v7 tied up while
. # scheduling around memory latency
.
.
vadd.vx v16, v0, x11 # consume here
vlb.v v0, () # schedule next iteration here
.

versus

setvli t0, a0, e32,m8

vle8.v v0, (x12) # only v0-v1 tied up
. vle8.v v2, (x13) # fetch next iteration into v2-v3
. #
.
vqcvt.x.x.v v16, v0 # Quad-widening sign-extension
vadd.vx v16,v16,x11 # Add offset
.

As mentioned earlier, the widened loads also might run at slower rate
unless full register write port bandwidth is given.

----------------------------------------------------------------------
Other code example extracted from 3x3 convolutions with 32b += 16b * 8b


vsetvli t0, a0, e16,m2
vle8.v v1, (x) # occupies v1
vwcvt.x.x.v v4, v1 # widen v1 into v4-v5
vsetvli t0, a0, e32,m4
vwmacc.vx v8, x11, v4 # widening mul-add into v8-v11

----------------------------------------------------------------------


David Horner
 

First thoughts below

- as a response to #424 and

- we need only consider SEW = LEW (new) and 1/2LEW (POR)

   

On 2020-04-19 1:00 a.m., Krste Asanovic wrote:
I added my proposal to github:
  https://github.com/riscv/riscv-v-spec/issues/425

appended below for those not following github

Krste

This proposal is a modification of earlier idea to add effective
element width to load/store instructions to mitigate dropping
fixed-width load/stores and to provide greater efficiency for
mixed-width floating-point codes.

This proposal redefines SEW to be the largest element width (LEW?),
When completed is it LEWd?
and correspondingly the definition of widening/narrowing operations:

Previously a double-widening add was defined as:
           2*SEW = SEW + SEW
the new proposal is to specify
             SEW = SEW/2 + SEW/2

With the corresponding lmul = lmul/2 + lmul/2

Which biases towards fractional lmul and thus interleaved vs striped groupings, a direction I endorse.

(I expect it is not surprise that I lean towards interleave and enhancing fractional LMUL.

Hopefully to the point that striped LMUL is a secondary mechanism if not obsoleted)

Thus SEW as LEW was one of my preferred options.


SEW at 1/2LEW is the current POR, and has the advantage that the majority of operations are defined and performed at this SEW level.

Conditioning data for widening ops and setting masks occurs at the 1/2LEW level.

This assumes the dominant widening operations are double and not quad/octal. I have little reason to suspect otherwise.

Straight conversion to 1/2LEW from 1/4 and 1/8 is also important, but I have little reason to suspect they are so important as to compromise the efficiency of other operations.


Thus, I believe SEW at 1/4LEW and 1/8 LEW are of little value much too removed from where the real work occurs.

Once again I postulate that the best choice is determined by the programs activities as both SEW = LEW and 1/2LEW have merit.


As a response to #424 this is may be a tactical acceptance of my assertion:

KEY POINT: A local optimal SEW:LMUL with exceptions is I believe possible.

        Scaled load/store goes a long way to achieve that by providing such exceptions in a key activity.

        However, I believe we may need another exception mechanism.

REQUEST: What I am hoping is that we can get consensus on the above KEY POINT. Then, move on to an efficient exception mechanism.

However, it may be only addressing the limitations/tradeoff inherent with POV load/store and recently proposed packed fixed load store and subsequent widening mechanisms. All good stuff, but not a general as my KEY POINT.

More comments to come.



David Horner
 

For those not following on  github:

On 2020-04-19 1:00 a.m., Krste Asanovic wrote:

....

I think for this reason
[re-positioned below]
we should stick with fixed vle8, vle16, vle32,
vle, in base encoding {8,16,32,SEW}. These are more readable and can
be used to interleave load/stores of larger than SEW values without
changing vtype (e.g., moving 32b values when SEW=16).

[re-positioned here]
For load/store instructions, we could modify these to have relative
element widths that are fractions of SEW {SEW, SEW/2, SEW/4, SEW/8}.
Assembly syntax could be something like vle, vlef2, vlef4, vlfe8.
There is a challenge in readability that these are relative to last
vtype setting,



SEW and LMUL values are essential to correct code execution regardless of load/store width encoding.

They should be assembler directive variables set automatically by vsetvli (and vsetvl when its xs2 argument is statically defined).

For dynamic xs2 and vsetvl, a manual assembler directive should be available.

This should help in various situations, including validation that SEW/LMUL ratio is maintained by a given vsetvli, and also for load/store syntax:

With this in place the assembler code can translate e8 to the corre

      and also when SEW is less than 64, some become useless.
ct SEW * factor value in load/stores eliminating the readability concern.

I agree that the base should have as robust an encoding without over committing available bits.

Thus I would want to also want to move 32b when SEW=16, and in addition

  • move 64b values when SEW=16
  • move 64b values when SEW=256, or 128 or 512
  • and various more combinations
  • and not waste encoding when SEW < 64.

I also believe that load/store are so important, so pivotal (e.g. matrix transforms) that flexibility and efficiency are both mandated.

The compress load/store format seeks to address efficiency.

The encoding needs the flexibility of SEW * factors.

I propose the factors be depent upon current SEW value

For SEW=8 the encoding yields factors of 1,2,4 and 8

For SEW=16 the encoding yields factors of 1/2, 1, 2 and 4

For SEW of 32 and above the encoding yields factors of 1/4, 1/2, 1 and 2.

Thus we always support LEW = SEW *2 operations and support load/store SEW/2 and SEW/4 when they exist.