Re: RISC-V Vector Task Group: fractional LMUL
I'm realizing the idea doesn't quite work unless machine actually
stores the fractional LMUL value (to cope with SLEN and widening
instruction behavior), and so would need the new instruction and an
extra bit of state in vtype.
But with this change, yes.
For floating-point code, where there is no way to load a narrower type
into a wider element, this can be used to increase the number of
registers of a wider width, e.g., in a matrix multiply accumulating
double-precision values that are the product of single-precision
values, using widening muladds, can arrange as something like:
vsetvli x0, x0, e32,f2 # Fractional Lmul
vle.v v0, (bmatp) # Get row of matrix
flw f1, (x15) # Get scalar
vfwmacc.vf v1, f1, v0 # One row of outer-product
add x15, x15, acolstride # Bump pointer
flw f31, (x15) # Get scalar
vfwmacc.vf v31, f31, v0 # Last row of outer-product
which is holding 31 rows of the destination matrix accumulators as
doubles while performing widening muladds from a single-precision
vector load held in v0. This is probably overkill register blocking
for this particular example, but shows the general improvement.
| Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its inputOn Thu, 6 Feb 2020 22:05:50 -0800, "Nick Knight" <nick.knight@...> said:
| register group(s)?
| --Nick Knight
| On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote:
| In the last meeting, we discussed a problem that would be introduced
| if we were to drop the fixed-size (b/h/w) variants of the vector
| load/stores and only have the SEW-size (e) variants. From the
| The current design uses constant SEW/LMUL ratios to align data
| types of different element widths. If only SEW-sized load/stores
| were available, then a computation using a mixture of element
| widths would have to use larger LMUL for larger SEW values, which
| effectively reduces the number of available registers and so
| increases register pressure. The fixed width load/stores allow,
| e.g., a byte to be loaded into a vector register with four-byte
| width with LMUL=1 so avoids this issue.
| Considering the case of a byte (8b) load into a word (32b) register.
| The effect of a byte load is to use only one quarter of the bits in a
| register, with widening to replicate zero/sign bits into the other
| bits of the register.
| A different strategy to use a portion of the bits in a vector register
| would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4,
| 1/8. This has the effect of supporting a given SEW/LMUL ratio with
| smaller LMUL values. This can be done without adding additional state
| to the machine, but only by adding a new variant of vsetvli that sets
| vl according to a shorter VLMAX calculated with the appropriate
| reduction in VLEN.
| E.g., vsetvli rd, rs1, e8,f2 # LMUL=1/2
| vsetvli rd, rs1, e8,f4 # LMUL=1/4
| vsetvli rd, rs1, e8,f8 # LMUL=1/8
| These instructions leave LMUL=1 in vtype, and the machine executes the
| instructions as before, just the vl will be shorter in these
| The same effect could be achieved without any new ISA instructions by
| performing vsetvli with widest SEW to set vl, then repeat with vsetvli
| with rd=x0,rs1=x0 to keep this vl value. However, this would add an
| additional instruction in the general case (sometimes, widest
| operation isn't naturally the first in a loop), but in other cases vl
| is fixed throughout a loop and can arrange so first setvl uses widest
| SEW, so the additional instructions can be avoided.
| With or without new vsetvli implementation for fractional LMUL, there
| are still more dynamic instructions required in general than the
| fixed-size loads into SEW elements, which don't need to change SEW.
| We can discuss further in the next task group meeting tomorrow.
| Members can find login details on the members task group calendar.