Re: RISC-V Vector Task Group: fractional LMUL
David Horner
Special lmul code in vsetvl{i} to derive LML from existing
SEW/LMUL and provided sew code.
toggle quoted message
Show quoted text
As mentioned in the TG we suggested widen LMUL to 3 bits with 7 (explicit) states. I suggest we name them l1,l2,l4,l8 as before and lf2, lf4, lf8 the fractional 1/2, 1/4, 1/8 respectively. lf16 may be useful in larger machines, but already lf8 cannot be used on minimal RV32 machines. ( I recommend we use lf2, lf4 and lf8 rather than f2, f4, f8 to
avoid confusion with float and to associate with lmul as a
fraction value) I propose the remaining value be used to derive LMUL from
the existing SEW/LMUL and provided sew code.
Tentatively I use lx as the mnemonic for the lmul code. I recommend the immediate variant using lx have the
pseudo-opcode vsetlmuli. As a shorthand I refer to vsetlv that uses the lx code in rs1
as vsetlmul.
Removing the fixed size load/stores motivates the increased lmul range and increases the need to issue additional instructions. This facility compliments the increased lmul range and mitigate the need to explicitly specify the correct lmul in the additional associated vsetlmul{i} instructions. As a result the first vsetvl{i} instruction establishes the
SEW/LMUL ratio and subsequent vsetlmul{i} instructions maintain
it.
Notably, values of vl that were legal in the original vsetvl{i} are legal in subsequent (legal) vsetlmul{i} instructions.
The additional fractional lmul values provide a substantial dynamic range and reduces the possibility of an illegal calculated LMUL. Even so, the calculated LMUL can be out of range for vsetlmul{i} which will set vill.
Other proposals have the objective of changing SEW and LMUL and enforcing that vl and/or the SEW/LMUL ratio not change. This proposal compliments those recommendations. In particular it addresses issue #365 (What is the case about vsetvl{i} x0, x0, {vtypei|rs2} without keeping VLMAX the same) when rs1 is x0 to use the current value of vl , the use of lx ensures vl, if possible, will remain the same as will VLMAX and SEW/LMUL ratio and if not possible vill will be generated.
To have our lf16 and lx too. As I recommended initially the lmul code in vsetvl{i} that would otherwise be used for lf16 is used for lx. That means that vtype lmul field will not have a lf16 value set by an explicit lmul code. However, we could allow lf16 to be calculated and stored by an instruction using lx. A read of vtype will correctly show the lf16 code. Restoring vtype from the value read from vtype is a multistep step process. If lmul code in not lf16 then issue vsetvl with save vtype in rs2 if lmul code is lf16 change sew code in rs2 to one larger than provided. (e.g. use e16 if e8 in save vtpye) set lmul in rs2 to lf8 issue vsetvl with revised sew and lmul vtype in rs2 change sew code in rs2 back to original (current example from e16 back to e8) change lmul in rs2 to lx issue vsetvl with original seq and lmul=lx vtype in rs2 lmul will now be lf16 and sew code same as saved. This approach will work for all values of sew and lf16 that are valid in vtype. The question is if lf16 is worth the extra effort, or does it potentially cause confusion/failures for application code. Additional suggestion: This is standalone from the proposal, but complements it, so I
submit it here. vsetvl is defined not to trap on illegal vtype codes. It sets vill instead and clears the remainder of vtype which triggers illegal operation in subsequent vector instructions that rely on it. This facilitates low overhead interrogation of the hardware to
determine supported SEW size and ediv support (and any future
functionality in vtype)
Arguably, vsetlmul{i} should trap if the original vtype has
vill set. But I recommend more than that. An implementation can trap on any instruction or sub-instruction to emulate, take corrective action, to trigger context switch, to report anomaly, etc. As a result an implementation could choose to provide the behaviour I suggest below and still be compliant. However, I recommend we make trap mandatory on certain cases of vsetvl{i} use. These traps would leave undisturbed vl and vtype as they were
before instruction execution. Specifically trap if: 1) lmul = lx and original vtype has vill set. 2) lmul = lx and calculated lmul is not valid 3) rd = x0 and rs1 = x0 (use existing vl for rs1) and VLMAX changes (that is SEW/LMUL ratio is not maintained) (this only occurs with explicit sew/lmul. if lmul=lx it is either case 1 or 2.)
Case 1. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the cause for vill. Case 2. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul. Case 3. There is not compelling reason to use rd = x0 and rs1 = x0 to interrogate the hardware so maintaining the VLMAX is the probable intent given existing vl is provided. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul. On 2020-02-07 1:40 a.m., Krste
Asanovic wrote:
I'm realizing the idea doesn't quite work unless machine actually stores the fractional LMUL value (to cope with SLEN and widening instruction behavior), and so would need the new instruction and an extra bit of state in vtype. But with this change, yes. For floating-point code, where there is no way to load a narrower type into a wider element, this can be used to increase the number of registers of a wider width, e.g., in a matrix multiply accumulating double-precision values that are the product of single-precision values, using widening muladds, can arrange as something like: vsetvli x0, x0, e32,f2 # Fractional Lmul vle.v v0, (bmatp) # Get row of matrix flw f1, (x15) # Get scalar vfwmacc.vf v1, f1, v0 # One row of outer-product add x15, x15, acolstride # Bump pointer ... flw f31, (x15) # Get scalar vfwmacc.vf v31, f31, v0 # Last row of outer-product ... which is holding 31 rows of the destination matrix accumulators as doubles while performing widening muladds from a single-precision vector load held in v0. This is probably overkill register blocking for this particular example, but shows the general improvement. Krste |
|