Re: RISC-V Vector Task Group: fractional LMUL


Nick Knight
 

Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its input register group(s)?

--Nick Knight


On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote:

In the last meeting, we discussed a problem that would be introduced
if we were to drop the fixed-size (b/h/w) variants of the vector
load/stores and only have the SEW-size (e) variants.  From the
minutes:

    The current design uses constant SEW/LMUL ratios to align data
    types of different element widths.  If only SEW-sized load/stores
    were available, then a computation using a mixture of element
    widths would have to use larger LMUL for larger SEW values, which
    effectively reduces the number of available registers and so
    increases register pressure.  The fixed width load/stores allow,
    e.g., a byte to be loaded into a vector register with four-byte
    width with LMUL=1 so avoids this issue.

Considering the case of a byte (8b) load into a word (32b) register.
The effect of a byte load is to use only one quarter of the bits in a
register, with widening to replicate zero/sign bits into the other
bits of the register.

A different strategy to use a portion of the bits in a vector register
would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4,
1/8.  This has the effect of supporting a given SEW/LMUL ratio with
smaller LMUL values.  This can be done without adding additional state
to the machine, but only by adding a new variant of vsetvli that sets
vl according to a shorter VLMAX calculated with the appropriate
reduction in VLEN.

E.g.,    vsetvli rd, rs1, e8,f2    # LMUL=1/2
         vsetvli rd, rs1, e8,f4    # LMUL=1/4
         vsetvli rd, rs1, e8,f8    # LMUL=1/8

These instructions leave LMUL=1 in vtype, and the machine executes the
instructions as before, just the vl will be shorter in these
instructions.

The same effect could be achieved without any new ISA instructions by
performing vsetvli with widest SEW to set vl, then repeat with vsetvli
with rd=x0,rs1=x0 to keep this vl value.  However, this would add an
additional instruction in the general case (sometimes, widest
operation isn't naturally the first in a loop), but in other cases vl
is fixed throughout a loop and can arrange so first setvl uses widest
SEW, so the additional instructions can be avoided.

With or without new vsetvli implementation for fractional LMUL, there
are still more dynamic instructions required in general than the
fixed-size loads into SEW elements, which don't need to change SEW.

We can discuss further in the next task group meeting tomorrow.
Members can find login details on the members task group calendar.

Krste




Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.