#### RISC-V Vector Task Group: fractional LMUL

Krste Asanovic

In the last meeting, we discussed a problem that would be introduced
if we were to drop the fixed-size (b/h/w) variants of the vector
load/stores and only have the SEW-size (e) variants. From the
minutes:

The current design uses constant SEW/LMUL ratios to align data
types of different element widths. If only SEW-sized load/stores
were available, then a computation using a mixture of element
widths would have to use larger LMUL for larger SEW values, which
effectively reduces the number of available registers and so
increases register pressure. The fixed width load/stores allow,
e.g., a byte to be loaded into a vector register with four-byte
width with LMUL=1 so avoids this issue.

Considering the case of a byte (8b) load into a word (32b) register.
The effect of a byte load is to use only one quarter of the bits in a
register, with widening to replicate zero/sign bits into the other
bits of the register.

A different strategy to use a portion of the bits in a vector register
would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4,
1/8. This has the effect of supporting a given SEW/LMUL ratio with
to the machine, but only by adding a new variant of vsetvli that sets
vl according to a shorter VLMAX calculated with the appropriate
reduction in VLEN.

E.g., vsetvli rd, rs1, e8,f2 # LMUL=1/2
vsetvli rd, rs1, e8,f4 # LMUL=1/4
vsetvli rd, rs1, e8,f8 # LMUL=1/8

These instructions leave LMUL=1 in vtype, and the machine executes the
instructions as before, just the vl will be shorter in these
instructions.

The same effect could be achieved without any new ISA instructions by
performing vsetvli with widest SEW to set vl, then repeat with vsetvli
with rd=x0,rs1=x0 to keep this vl value. However, this would add an
additional instruction in the general case (sometimes, widest
operation isn't naturally the first in a loop), but in other cases vl
is fixed throughout a loop and can arrange so first setvl uses widest
SEW, so the additional instructions can be avoided.

With or without new vsetvli implementation for fractional LMUL, there
are still more dynamic instructions required in general than the
fixed-size loads into SEW elements, which don't need to change SEW.

We can discuss further in the next task group meeting tomorrow.

Krste

Nick Knight

Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its input register group(s)?

--Nick Knight

On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote:

In the last meeting, we discussed a problem that would be introduced
if we were to drop the fixed-size (b/h/w) variants of the vector
load/stores and only have the SEW-size (e) variants.  From the
minutes:

The current design uses constant SEW/LMUL ratios to align data
types of different element widths.  If only SEW-sized load/stores
were available, then a computation using a mixture of element
widths would have to use larger LMUL for larger SEW values, which
effectively reduces the number of available registers and so
increases register pressure.  The fixed width load/stores allow,
e.g., a byte to be loaded into a vector register with four-byte
width with LMUL=1 so avoids this issue.

Considering the case of a byte (8b) load into a word (32b) register.
The effect of a byte load is to use only one quarter of the bits in a
register, with widening to replicate zero/sign bits into the other
bits of the register.

A different strategy to use a portion of the bits in a vector register
would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4,
1/8.  This has the effect of supporting a given SEW/LMUL ratio with
to the machine, but only by adding a new variant of vsetvli that sets
vl according to a shorter VLMAX calculated with the appropriate
reduction in VLEN.

E.g.,    vsetvli rd, rs1, e8,f2    # LMUL=1/2
vsetvli rd, rs1, e8,f4    # LMUL=1/4
vsetvli rd, rs1, e8,f8    # LMUL=1/8

These instructions leave LMUL=1 in vtype, and the machine executes the
instructions as before, just the vl will be shorter in these
instructions.

The same effect could be achieved without any new ISA instructions by
performing vsetvli with widest SEW to set vl, then repeat with vsetvli
with rd=x0,rs1=x0 to keep this vl value.  However, this would add an
additional instruction in the general case (sometimes, widest
operation isn't naturally the first in a loop), but in other cases vl
is fixed throughout a loop and can arrange so first setvl uses widest
SEW, so the additional instructions can be avoided.

With or without new vsetvli implementation for fractional LMUL, there
are still more dynamic instructions required in general than the
fixed-size loads into SEW elements, which don't need to change SEW.

We can discuss further in the next task group meeting tomorrow.

Krste

Krste Asanovic

I'm realizing the idea doesn't quite work unless machine actually
stores the fractional LMUL value (to cope with SLEN and widening
instruction behavior), and so would need the new instruction and an
extra bit of state in vtype.

But with this change, yes.

For floating-point code, where there is no way to load a narrower type
into a wider element, this can be used to increase the number of
registers of a wider width, e.g., in a matrix multiply accumulating
double-precision values that are the product of single-precision
values, using widening muladds, can arrange as something like:

vsetvli x0, x0, e32,f2 # Fractional Lmul
vle.v v0, (bmatp) # Get row of matrix
flw f1, (x15) # Get scalar
vfwmacc.vf v1, f1, v0 # One row of outer-product
add x15, x15, acolstride # Bump pointer
...
flw f31, (x15) # Get scalar
vfwmacc.vf v31, f31, v0 # Last row of outer-product
...

which is holding 31 rows of the destination matrix accumulators as
doubles while performing widening muladds from a single-precision
vector load held in v0. This is probably overkill register blocking
for this particular example, but shows the general improvement.

Krste

On Thu, 6 Feb 2020 22:05:50 -0800, "Nick Knight" <nick.knight@...> said:
| Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its input
| register group(s)?

| --Nick Knight

| On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote:

| In the last meeting, we discussed a problem that would be introduced
| if we were to drop the fixed-size (b/h/w) variants of the vector
| load/stores and only have the SEW-size (e) variants.  From the
| minutes:

|     The current design uses constant SEW/LMUL ratios to align data
|     types of different element widths.  If only SEW-sized load/stores
|     were available, then a computation using a mixture of element
|     widths would have to use larger LMUL for larger SEW values, which
|     effectively reduces the number of available registers and so
|     increases register pressure.  The fixed width load/stores allow,
|     e.g., a byte to be loaded into a vector register with four-byte
|     width with LMUL=1 so avoids this issue.

| Considering the case of a byte (8b) load into a word (32b) register.
| The effect of a byte load is to use only one quarter of the bits in a
| register, with widening to replicate zero/sign bits into the other
| bits of the register.

| A different strategy to use a portion of the bits in a vector register
| would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4,
| 1/8.  This has the effect of supporting a given SEW/LMUL ratio with
| smaller LMUL values.  This can be done without adding additional state
| to the machine, but only by adding a new variant of vsetvli that sets
| vl according to a shorter VLMAX calculated with the appropriate
| reduction in VLEN.

| E.g.,    vsetvli rd, rs1, e8,f2    # LMUL=1/2
|          vsetvli rd, rs1, e8,f4    # LMUL=1/4
|          vsetvli rd, rs1, e8,f8    # LMUL=1/8

| These instructions leave LMUL=1 in vtype, and the machine executes the
| instructions as before, just the vl will be shorter in these
| instructions.

| The same effect could be achieved without any new ISA instructions by
| performing vsetvli with widest SEW to set vl, then repeat with vsetvli
| with rd=x0,rs1=x0 to keep this vl value.  However, this would add an
| additional instruction in the general case (sometimes, widest
| operation isn't naturally the first in a loop), but in other cases vl
| is fixed throughout a loop and can arrange so first setvl uses widest
| SEW, so the additional instructions can be avoided.

| With or without new vsetvli implementation for fractional LMUL, there
| are still more dynamic instructions required in general than the
| fixed-size loads into SEW elements, which don't need to change SEW.

| We can discuss further in the next task group meeting tomorrow.
| Members can find login details on the members task group calendar.

| Krste

|

David Horner

Special lmul code in vsetvl{i} to derive LML from existing SEW/LMUL and provided sew code.

As mentioned in the TG we suggested widen LMUL to 3 bits with 7 (explicit) states.

I suggest we name them l1,l2,l4,l8 as before and lf2, lf4, lf8 the fractional 1/2, 1/4, 1/8 respectively.

lf16 may be useful in larger machines, but already lf8 cannot be used on minimal RV32 machines.

( I recommend we use lf2, lf4 and lf8 rather than f2, f4, f8 to avoid confusion with float and to associate with lmul as a fraction value)

I propose the remaining value be used to derive LMUL from the existing SEW/LMUL and provided sew code.

Specifically, new LMUL = ( new sew / old SEW )  * old LMUL which retains the initial SEW/LMUL ratio.

Tentatively I use lx as the mnemonic for the lmul code.

I recommend the immediate variant using lx have the pseudo-opcode  vsetlmuli.

As a shorthand I refer to vsetlv that uses the lx code in rs1 as vsetlmul.

Removing the fixed size load/stores motivates the increased lmul range and increases the need to issue additional instructions.

This facility compliments the increased lmul range and mitigate the need to explicitly specify the correct lmul in the additional associated vsetlmul{i} instructions.

As a result the first vsetvl{i} instruction establishes the SEW/LMUL ratio and subsequent vsetlmul{i} instructions maintain it.

Notably, values of vl that  were legal in the original vsetvl{i} are legal in subsequent (legal) vsetlmul{i} instructions.

The additional fractional lmul values provide a substantial dynamic range and reduces the possibility of an illegal calculated LMUL.

Even so, the calculated LMUL can be out of range for vsetlmul{i} which will set vill.

Other proposals have the objective of changing SEW and LMUL and enforcing that vl and/or the SEW/LMUL ratio not change.

This proposal compliments those recommendations.

In particular it addresses issue #365 (What is the case about vsetvl{i} x0, x0, {vtypei|rs2} without keeping VLMAX the same)

when rs1 is x0 to use the current value of vl , the use of lx ensures vl, if possible, will remain the same as will VLMAX and SEW/LMUL ratio and if not possible vill will be generated.

To have our lf16 and lx too.

As I recommended initially the lmul code in vsetvl{i} that would otherwise be used for lf16 is used for lx.

That means that vtype lmul field will not have a lf16 value set by an explicit lmul code.

However, we could allow lf16 to be calculated and stored by an instruction using lx.

A read of vtype will correctly show the lf16 code.

Restoring vtype from the value read from vtype is a multistep step process.

If lmul code in not lf16 then issue vsetvl with save vtype in rs2

if lmul code is lf16

change sew code in rs2 to one larger than provided. (e.g. use e16 if e8 in save vtpye)

set lmul in rs2 to lf8

issue vsetvl with revised sew and lmul vtype in rs2

change sew code in rs2 back to original (current example from e16 back to e8)

change lmul in rs2 to lx

issue vsetvl with original seq and lmul=lx vtype in rs2

lmul will now be lf16 and sew code same as saved.

This approach will work for all values of sew and lf16 that are valid in vtype.

The question is if lf16 is worth the extra effort, or does it potentially cause confusion/failures for application code.

This is standalone from the proposal, but complements it, so I submit it here.

vsetvl is defined not to trap on illegal vtype codes.

It sets vill instead and clears the remainder of vtype which triggers illegal operation in subsequent vector instructions that rely on it.

This facilitates low overhead interrogation of the hardware to determine supported SEW size and ediv support (and any future functionality in vtype)

Arguably, vsetlmul{i} should trap if the original vtype has vill set. But I recommend more than that.

An implementation can trap on any instruction or sub-instruction to emulate, take corrective action, to trigger context switch, to report anomaly, etc.

As a result an implementation could choose to provide the behaviour I suggest below and still be compliant.

However, I recommend we make trap mandatory on certain cases of vsetvl{i} use.

These traps would leave undisturbed  vl and vtype as they were before instruction execution.

Specifically trap if:

1) lmul = lx and original vtype has vill set.

2) lmul = lx and calculated lmul is not valid

3) rd = x0 and rs1 = x0 (use existing vl for rs1) and VLMAX changes (that is SEW/LMUL ratio is not maintained)

(this only occurs with explicit sew/lmul. if lmul=lx it is either case 1 or 2.)

A precise trap is extremely helpful to debug these cases.

Case 1. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the cause for vill.

Case 2. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul.

Case 3. There is not compelling reason to use rd = x0 and rs1 = x0 to interrogate the hardware so maintaining the VLMAX is the probable intent given existing vl is provided.

If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul.

On 2020-02-07 1:40 a.m., Krste Asanovic wrote:

```I'm realizing the idea doesn't quite work unless machine actually
stores the fractional LMUL value (to cope with SLEN and widening
instruction behavior), and so would need the new instruction and an
extra bit of state in vtype.

But with this change, yes.

For floating-point code, where there is no way to load a narrower type
into a wider element, this can be used to increase the number of
registers of a wider width, e.g., in a matrix multiply accumulating
double-precision values that are the product of single-precision
values, using widening muladds, can arrange as something like:

vsetvli    x0, x0, e32,f2     # Fractional Lmul
vle.v      v0, (bmatp)        # Get row of matrix
flw f1, (x15)                 # Get scalar
vfwmacc.vf v1, f1, v0         # One row of outer-product
add x15, x15, acolstride      # Bump pointer
...
flw f31, (x15)                # Get scalar
vfwmacc.vf v31, f31, v0       # Last row of outer-product
...

which is holding 31 rows of the destination matrix accumulators as
doubles while performing widening muladds from a single-precision
vector load held in v0.  This is probably overkill register blocking
for this particular example, but shows the general improvement.

Krste
```

David Horner

I  left the encoding unspecified in the proposal.

That was intentional as I saw various tradeoffs.

However, I now recommend the codes be in order of increasing VLMAX value as so:

vlmul

mnemonic

LMUL

VLMAX

#groups

Grouped registers

0

0

0

lf16

1/16

VLEN/SEW/16

32

vn (single register, lower 1/16th)

0

0

1

lf8

1/8

VLEN/SEW/8

32

vn (single register, lower 1/8th)

0

1

0

lf4

1/4

VLEN/SEW/4

32

vn (single register, lower 1/4th)

0

1

1

lf2

1/2

VLEN/SEW/2

32

vn (single register, lower 1/2th)

1

0

0

l1

1

VLEN/SEW

32

vn (single register in group)

1

0

1

l2

2

2*VLEN/SEW

16

vn, vn+1

1

1

0

l4

4

4*VLEN/SEW

8

vn, …​, vn+3

1

1

1

l8

8

8*VLEN/SEW

4

vn, …​, vn+7

This allows for a quick "andi and bz" check of the "special" LMUL 1/16

It also allows the formula
new LMUL = ( new sew / old SEW )  * old LMUL
to be implemented as
vlmul = new vsew - old vsew + old vlmul

The first column could be inverted to allow the original code values with no appreciable hardware cost, but it complicates software for the two above cases.

On 2020-02-09 12:44 a.m., David Horner via Lists.Riscv.Org wrote:

Special lmul code in vsetvl{i} to derive LML from existing SEW/LMUL and provided sew code.

As mentioned in the TG we suggested widen LMUL to 3 bits with 7 (explicit) states.

I suggest we name them l1,l2,l4,l8 as before and lf2, lf4, lf8 the fractional 1/2, 1/4, 1/8 respectively.

lf16 may be useful in larger machines, but already lf8 cannot be used on minimal RV32 machines.

( I recommend we use lf2, lf4 and lf8 rather than f2, f4, f8 to avoid confusion with float and to associate with lmul as a fraction value)

I propose the remaining value be used to derive LMUL from the existing SEW/LMUL and provided sew code.

Specifically, new LMUL = ( new sew / old SEW )  * old LMUL which retains the initial SEW/LMUL ratio.

 1 - 5 of 5