Effective element width encoding in vector load/stores

Krste Asanovic

There are two separate issues noted with the proposal to fixed-size
vector load/stores.  One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required.  We've discussed adding more widening instructions to help
with the latter.  I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.

This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension.  However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.

Using the width field, EEW is encoded as one of {8,16,32,SEW}.  This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.

Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.

The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.

This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.

So for example for unit-stride, we can reduce the original code:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vsetvli x0, x0, e32,m8
   vle.v v16, (a3)
   vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vsetvli x0, x0, e32,m8
   vse.v v16, (a3)         # EEW=32b, no vsetvli needed

down to:

vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vle32.v v16, (a3)       # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vse32.v v16, (a3)       # EEW=32b, no vsetvli needed

removing three vsetvli instructions from inner loop.

Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.

I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.

I think this also solves our indexed load/store problem.  We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices.  One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.

# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7  # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7

# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8  # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1, 


Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.