Re: RISC-V Vector Task Group: fractional LMUL


David Horner
 

Special lmul code in vsetvl{i} to derive LML from existing SEW/LMUL and provided sew code.

As mentioned in the TG we suggested widen LMUL to 3 bits with 7 (explicit) states.

I suggest we name them l1,l2,l4,l8 as before and lf2, lf4, lf8 the fractional 1/2, 1/4, 1/8 respectively.

lf16 may be useful in larger machines, but already lf8 cannot be used on minimal RV32 machines.

( I recommend we use lf2, lf4 and lf8 rather than f2, f4, f8 to avoid confusion with float and to associate with lmul as a fraction value)

I propose the remaining value be used to derive LMUL from the existing SEW/LMUL and provided sew code.

Specifically, new LMUL = ( new sew / old SEW )  * old LMUL which retains the initial SEW/LMUL ratio.


Tentatively I use lx as the mnemonic for the lmul code.

I recommend the immediate variant using lx have the pseudo-opcode  vsetlmuli.

As a shorthand I refer to vsetlv that uses the lx code in rs1 as vsetlmul.


Removing the fixed size load/stores motivates the increased lmul range and increases the need to issue additional instructions.

This facility compliments the increased lmul range and mitigate the need to explicitly specify the correct lmul in the additional associated vsetlmul{i} instructions.

As a result the first vsetvl{i} instruction establishes the SEW/LMUL ratio and subsequent vsetlmul{i} instructions maintain it.


Notably, values of vl that  were legal in the original vsetvl{i} are legal in subsequent (legal) vsetlmul{i} instructions.


The additional fractional lmul values provide a substantial dynamic range and reduces the possibility of an illegal calculated LMUL.

Even so, the calculated LMUL can be out of range for vsetlmul{i} which will set vill.


Other proposals have the objective of changing SEW and LMUL and enforcing that vl and/or the SEW/LMUL ratio not change.

This proposal compliments those recommendations.

In particular it addresses issue #365 (What is the case about vsetvl{i} x0, x0, {vtypei|rs2} without keeping VLMAX the same)

when rs1 is x0 to use the current value of vl , the use of lx ensures vl, if possible, will remain the same as will VLMAX and SEW/LMUL ratio and if not possible vill will be generated.


To have our lf16 and lx too.

As I recommended initially the lmul code in vsetvl{i} that would otherwise be used for lf16 is used for lx.

That means that vtype lmul field will not have a lf16 value set by an explicit lmul code.

However, we could allow lf16 to be calculated and stored by an instruction using lx.

A read of vtype will correctly show the lf16 code.

Restoring vtype from the value read from vtype is a multistep step process.

    If lmul code in not lf16 then issue vsetvl with save vtype in rs2

    if lmul code is lf16

        change sew code in rs2 to one larger than provided. (e.g. use e16 if e8 in save vtpye)

        set lmul in rs2 to lf8

        issue vsetvl with revised sew and lmul vtype in rs2

        change sew code in rs2 back to original (current example from e16 back to e8)

        change lmul in rs2 to lx

        issue vsetvl with original seq and lmul=lx vtype in rs2

        lmul will now be lf16 and sew code same as saved.

This approach will work for all values of sew and lf16 that are valid in vtype.

The question is if lf16 is worth the extra effort, or does it potentially cause confusion/failures for application code.


Additional suggestion:

This is standalone from the proposal, but complements it, so I submit it here.

vsetvl is defined not to trap on illegal vtype codes.

It sets vill instead and clears the remainder of vtype which triggers illegal operation in subsequent vector instructions that rely on it.

This facilitates low overhead interrogation of the hardware to determine supported SEW size and ediv support (and any future functionality in vtype)

Arguably, vsetlmul{i} should trap if the original vtype has vill set. But I recommend more than that.

An implementation can trap on any instruction or sub-instruction to emulate, take corrective action, to trigger context switch, to report anomaly, etc.

As a result an implementation could choose to provide the behaviour I suggest below and still be compliant.

However, I recommend we make trap mandatory on certain cases of vsetvl{i} use.

These traps would leave undisturbed  vl and vtype as they were before instruction execution.

Specifically trap if:

     1) lmul = lx and original vtype has vill set.

     2) lmul = lx and calculated lmul is not valid 

     3) rd = x0 and rs1 = x0 (use existing vl for rs1) and VLMAX changes (that is SEW/LMUL ratio is not maintained)

            (this only occurs with explicit sew/lmul. if lmul=lx it is either case 1 or 2.)


A precise trap is extremely helpful to debug these cases.

Case 1. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the cause for vill.

Case 2. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul.

Case 3. There is not compelling reason to use rd = x0 and rs1 = x0 to interrogate the hardware so maintaining the VLMAX is the probable intent given existing vl is provided.

                If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul.

     

On 2020-02-07 1:40 a.m., Krste Asanovic wrote:

I'm realizing the idea doesn't quite work unless machine actually
stores the fractional LMUL value (to cope with SLEN and widening
instruction behavior), and so would need the new instruction and an
extra bit of state in vtype.

But with this change, yes.

For floating-point code, where there is no way to load a narrower type
into a wider element, this can be used to increase the number of
registers of a wider width, e.g., in a matrix multiply accumulating
double-precision values that are the product of single-precision
values, using widening muladds, can arrange as something like:

     vsetvli    x0, x0, e32,f2     # Fractional Lmul
     vle.v      v0, (bmatp)        # Get row of matrix
     flw f1, (x15)                 # Get scalar
     vfwmacc.vf v1, f1, v0         # One row of outer-product
     add x15, x15, acolstride      # Bump pointer
     ...
     flw f31, (x15)                # Get scalar
     vfwmacc.vf v31, f31, v0       # Last row of outer-product
     ...

which is holding 31 rows of the destination matrix accumulators as
doubles while performing widening muladds from a single-precision
vector load held in v0.  This is probably overkill register blocking
for this particular example, but shows the general improvement.

Krste

Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.