make SEW be the largest element width


Krste Asanovic
 

I added my proposal to github:
https://github.com/riscv/riscv-v-spec/issues/425

appended below for those not following github

Krste

This proposal is a modification of earlier idea to add effective
element width to load/store instructions to mitigate dropping
fixed-width load/stores and to provide greater efficiency for
mixed-width floating-point codes.

This proposal redefines SEW to be the largest element width (LEW?),
and correspondingly the definition of widening/narrowing operations:

Previously a double-widening add was defined as:
2*SEW = SEW + SEW
the new proposal is to specify
SEW = SEW/2 + SEW/2

This proposal does not change the behavior/implementation of existing
instructions, except to change how the effective EW and effective LMUL
are obtained from vtype.

For load/store instructions, we could modify these to have relative
element widths that are fractions of SEW {SEW, SEW/2, SEW/4, SEW/8}.
Assembly syntax could be something like vle, vlef2, vlef4, vlfe8.
There is a challenge in readability that these are relative to last
vtype setting, and also when SEW is less than 64, some become useless.

I think for this reason we should stick with fixed vle8, vle16, vle32,
vle, in base encoding {8,16,32,SEW}. These are more readable and can
be used to interleave load/stores of larger than SEW values without
changing vtype (e.g., moving 32b values when SEW=16). Extending past
base, we could add relative EEW load/store using unused mop bit, for
example.

Mask registers have an EEW of SEW/LMUL, and so defining the relative
sizes as fractions of SEW also makes it more likely that code can
save/restore a mask register without changing vtype. The fixed sizes
will cover some of these cases, though might have to consider
saving/restoring more than needed if SEW/LMUL < 8.

There is possibly actually a minor hardware checking advantage to
knowing that SEW holds the largest possible element width, since
following instructions cannot request an effective SEW larger than
this. Even with fixed-size load/stores, if the sizes are 8,16,32,SEW
then all of these will also be legal on standard implementations.

Some of the examples we've been discussing.

----------------------------------------------------------------------
The "worst-case" example from TG slide:

# int32_t a[i] = int8_t b[i] << 15
# With fixed-width load/store
vsetvli t0, a0, e32,m1
vlb.v v4, (rb)
vsll.vi v4, v4, 15
vsw.v v4, (ra)

# This proposal, assuming quad-widening and fractional LMUL
vsetvli t0, a0, e32,m1
vle8.v v1, (rb) # load bytes into e8,m1 register
vqcvt.x.x.v v4, v1 # Quad-widen to 32 bits
vsll.vi v4, v4, 15 # 32-bit shift
vse32.v v4, (ra) # could also use vse

Obviously, adding a quad-widening left shift would remove any
difference between two.

----------------------------------------------------------------------
The previous widening compute operation:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32

vsetvli t0, a0, e32,m8 # vtype SEW=32b
vle16.v v4, (a1)
vle16.v v8, (a2)
vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of source operands is SEW/2,LMUL/2
vse32.v v16, (a3) # EEW=32b, no vsetvli needed

----------------------------------------------------------------------
The example code from #362

Original fixed-width code assuming 32b float.

vsetvli t0, a0, e32,m8 # Assuming 32b float
loop:
[...] # Other instructions with fixed vl and sew
vlb.v v0, (x12) # Get byte value
vadd.vx v16, v0, x11 # Add scalar integer offset
vfcvt.f.x.v v16, v16 # Convert to 32b floating vlaue
[...]

converts to:

vsetvli t0, a0, e32, m8
loop:
[...]
vle8.v v0, (x12) # Load bytes
vqcvt.x.x.v v16, v0 # Quad-widening sign-extension
vadd.vx v16, v16, x11 # Add offset
vfcvt.f.x.v v16, v16 # Convert to float
[...]

which is just one more instruction in inner loop.

With a quad-widening add (vqadd.vx), there would be no penalty in this
case:

vsetvli t0, a0, e32, m8
loop:
[...]
vle8.v v0, (x12) # Load bytes
vqadd.vx v16, v0, x11 # Add offset, quad-widen
vfcvt.f.x.v v16, v16 # Convert to float
[...]

The vle8 version also reduce the number of vector registers tied up
buffering load values when the loop is software-pipelined/loop
unrolled, allowing more loads in flight for a given LMUL.

Basically, in software-pipelined/unrolled loops, the widen-at-load
instructions tie up more architectural registers than widen-at-use
when the widen-at-use is part of instruction, reducing flexibility in
scheduling and/or reducing LMUL.

Even if an explicit widen instruction is used, at that point the
values are in the registers and the widen will proceed at max rate,
generally reducing architectural register occupancy versus having the
wider values in-flight from memory. For example:

setvli t0, a0, e32,m8

vlb.v v0, () # v0-v7 tied up while
. # scheduling around memory latency
.
.
vadd.vx v16, v0, x11 # consume here
vlb.v v0, () # schedule next iteration here
.

versus

setvli t0, a0, e32,m8

vle8.v v0, (x12) # only v0-v1 tied up
. vle8.v v2, (x13) # fetch next iteration into v2-v3
. #
.
vqcvt.x.x.v v16, v0 # Quad-widening sign-extension
vadd.vx v16,v16,x11 # Add offset
.

As mentioned earlier, the widened loads also might run at slower rate
unless full register write port bandwidth is given.

----------------------------------------------------------------------
Other code example extracted from 3x3 convolutions with 32b += 16b * 8b


vsetvli t0, a0, e16,m2
vle8.v v1, (x) # occupies v1
vwcvt.x.x.v v4, v1 # widen v1 into v4-v5
vsetvli t0, a0, e32,m4
vwmacc.vx v8, x11, v4 # widening mul-add into v8-v11

----------------------------------------------------------------------

Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.