RISC-V Vector Task Group: fractional LMUL
In the last meeting, we discussed a problem that would be introduced
if we were to drop the fixed-size (b/h/w) variants of the vector load/stores and only have the SEW-size (e) variants. From the minutes: The current design uses constant SEW/LMUL ratios to align data types of different element widths. If only SEW-sized load/stores were available, then a computation using a mixture of element widths would have to use larger LMUL for larger SEW values, which effectively reduces the number of available registers and so increases register pressure. The fixed width load/stores allow, e.g., a byte to be loaded into a vector register with four-byte width with LMUL=1 so avoids this issue. Considering the case of a byte (8b) load into a word (32b) register. The effect of a byte load is to use only one quarter of the bits in a register, with widening to replicate zero/sign bits into the other bits of the register. A different strategy to use a portion of the bits in a vector register would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4, 1/8. This has the effect of supporting a given SEW/LMUL ratio with smaller LMUL values. This can be done without adding additional state to the machine, but only by adding a new variant of vsetvli that sets vl according to a shorter VLMAX calculated with the appropriate reduction in VLEN. E.g., vsetvli rd, rs1, e8,f2 # LMUL=1/2 vsetvli rd, rs1, e8,f4 # LMUL=1/4 vsetvli rd, rs1, e8,f8 # LMUL=1/8 These instructions leave LMUL=1 in vtype, and the machine executes the instructions as before, just the vl will be shorter in these instructions. The same effect could be achieved without any new ISA instructions by performing vsetvli with widest SEW to set vl, then repeat with vsetvli with rd=x0,rs1=x0 to keep this vl value. However, this would add an additional instruction in the general case (sometimes, widest operation isn't naturally the first in a loop), but in other cases vl is fixed throughout a loop and can arrange so first setvl uses widest SEW, so the additional instructions can be avoided. With or without new vsetvli implementation for fractional LMUL, there are still more dynamic instructions required in general than the fixed-size loads into SEW elements, which don't need to change SEW. We can discuss further in the next task group meeting tomorrow. Members can find login details on the members task group calendar. Krste |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its input register group(s)? --Nick Knight On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I'm realizing the idea doesn't quite work unless machine actually
stores the fractional LMUL value (to cope with SLEN and widening instruction behavior), and so would need the new instruction and an extra bit of state in vtype. But with this change, yes. For floating-point code, where there is no way to load a narrower type into a wider element, this can be used to increase the number of registers of a wider width, e.g., in a matrix multiply accumulating double-precision values that are the product of single-precision values, using widening muladds, can arrange as something like: vsetvli x0, x0, e32,f2 # Fractional Lmul vle.v v0, (bmatp) # Get row of matrix flw f1, (x15) # Get scalar vfwmacc.vf v1, f1, v0 # One row of outer-product add x15, x15, acolstride # Bump pointer ... flw f31, (x15) # Get scalar vfwmacc.vf v31, f31, v0 # Last row of outer-product ... which is holding 31 rows of the destination matrix accumulators as doubles while performing widening muladds from a single-precision vector load held in v0. This is probably overkill register blocking for this particular example, but shows the general improvement. Krste | Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its inputOn Thu, 6 Feb 2020 22:05:50 -0800, "Nick Knight" <nick.knight@...> said: | register group(s)? | --Nick Knight | On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote: | In the last meeting, we discussed a problem that would be introduced | if we were to drop the fixed-size (b/h/w) variants of the vector | load/stores and only have the SEW-size (e) variants. From the | minutes: | The current design uses constant SEW/LMUL ratios to align data | types of different element widths. If only SEW-sized load/stores | were available, then a computation using a mixture of element | widths would have to use larger LMUL for larger SEW values, which | effectively reduces the number of available registers and so | increases register pressure. The fixed width load/stores allow, | e.g., a byte to be loaded into a vector register with four-byte | width with LMUL=1 so avoids this issue. | Considering the case of a byte (8b) load into a word (32b) register. | The effect of a byte load is to use only one quarter of the bits in a | register, with widening to replicate zero/sign bits into the other | bits of the register. | A different strategy to use a portion of the bits in a vector register | would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4, | 1/8. This has the effect of supporting a given SEW/LMUL ratio with | smaller LMUL values. This can be done without adding additional state | to the machine, but only by adding a new variant of vsetvli that sets | vl according to a shorter VLMAX calculated with the appropriate | reduction in VLEN. | E.g., vsetvli rd, rs1, e8,f2 # LMUL=1/2 | vsetvli rd, rs1, e8,f4 # LMUL=1/4 | vsetvli rd, rs1, e8,f8 # LMUL=1/8 | These instructions leave LMUL=1 in vtype, and the machine executes the | instructions as before, just the vl will be shorter in these | instructions. | The same effect could be achieved without any new ISA instructions by | performing vsetvli with widest SEW to set vl, then repeat with vsetvli | with rd=x0,rs1=x0 to keep this vl value. However, this would add an | additional instruction in the general case (sometimes, widest | operation isn't naturally the first in a loop), but in other cases vl | is fixed throughout a loop and can arrange so first setvl uses widest | SEW, so the additional instructions can be avoided. | With or without new vsetvli implementation for fractional LMUL, there | are still more dynamic instructions required in general than the | fixed-size loads into SEW elements, which don't need to change SEW. | We can discuss further in the next task group meeting tomorrow. | Members can find login details on the members task group calendar. | Krste | |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
David Horner
Special lmul code in vsetvl{i} to derive LML from existing
SEW/LMUL and provided sew code.
toggle quoted message
Show quoted text
As mentioned in the TG we suggested widen LMUL to 3 bits with 7 (explicit) states. I suggest we name them l1,l2,l4,l8 as before and lf2, lf4, lf8 the fractional 1/2, 1/4, 1/8 respectively. lf16 may be useful in larger machines, but already lf8 cannot be used on minimal RV32 machines. ( I recommend we use lf2, lf4 and lf8 rather than f2, f4, f8 to
avoid confusion with float and to associate with lmul as a
fraction value) I propose the remaining value be used to derive LMUL from
the existing SEW/LMUL and provided sew code.
Tentatively I use lx as the mnemonic for the lmul code. I recommend the immediate variant using lx have the
pseudo-opcode vsetlmuli. As a shorthand I refer to vsetlv that uses the lx code in rs1
as vsetlmul.
Removing the fixed size load/stores motivates the increased lmul range and increases the need to issue additional instructions. This facility compliments the increased lmul range and mitigate the need to explicitly specify the correct lmul in the additional associated vsetlmul{i} instructions. As a result the first vsetvl{i} instruction establishes the
SEW/LMUL ratio and subsequent vsetlmul{i} instructions maintain
it.
Notably, values of vl that were legal in the original vsetvl{i} are legal in subsequent (legal) vsetlmul{i} instructions.
The additional fractional lmul values provide a substantial dynamic range and reduces the possibility of an illegal calculated LMUL. Even so, the calculated LMUL can be out of range for vsetlmul{i} which will set vill.
Other proposals have the objective of changing SEW and LMUL and enforcing that vl and/or the SEW/LMUL ratio not change. This proposal compliments those recommendations. In particular it addresses issue #365 (What is the case about vsetvl{i} x0, x0, {vtypei|rs2} without keeping VLMAX the same) when rs1 is x0 to use the current value of vl , the use of lx ensures vl, if possible, will remain the same as will VLMAX and SEW/LMUL ratio and if not possible vill will be generated.
To have our lf16 and lx too. As I recommended initially the lmul code in vsetvl{i} that would otherwise be used for lf16 is used for lx. That means that vtype lmul field will not have a lf16 value set by an explicit lmul code. However, we could allow lf16 to be calculated and stored by an instruction using lx. A read of vtype will correctly show the lf16 code. Restoring vtype from the value read from vtype is a multistep step process. If lmul code in not lf16 then issue vsetvl with save vtype in rs2 if lmul code is lf16 change sew code in rs2 to one larger than provided. (e.g. use e16 if e8 in save vtpye) set lmul in rs2 to lf8 issue vsetvl with revised sew and lmul vtype in rs2 change sew code in rs2 back to original (current example from e16 back to e8) change lmul in rs2 to lx issue vsetvl with original seq and lmul=lx vtype in rs2 lmul will now be lf16 and sew code same as saved. This approach will work for all values of sew and lf16 that are valid in vtype. The question is if lf16 is worth the extra effort, or does it potentially cause confusion/failures for application code. Additional suggestion: This is standalone from the proposal, but complements it, so I
submit it here. vsetvl is defined not to trap on illegal vtype codes. It sets vill instead and clears the remainder of vtype which triggers illegal operation in subsequent vector instructions that rely on it. This facilitates low overhead interrogation of the hardware to
determine supported SEW size and ediv support (and any future
functionality in vtype)
Arguably, vsetlmul{i} should trap if the original vtype has
vill set. But I recommend more than that. An implementation can trap on any instruction or sub-instruction to emulate, take corrective action, to trigger context switch, to report anomaly, etc. As a result an implementation could choose to provide the behaviour I suggest below and still be compliant. However, I recommend we make trap mandatory on certain cases of vsetvl{i} use. These traps would leave undisturbed vl and vtype as they were
before instruction execution. Specifically trap if: 1) lmul = lx and original vtype has vill set. 2) lmul = lx and calculated lmul is not valid 3) rd = x0 and rs1 = x0 (use existing vl for rs1) and VLMAX changes (that is SEW/LMUL ratio is not maintained) (this only occurs with explicit sew/lmul. if lmul=lx it is either case 1 or 2.)
Case 1. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the cause for vill. Case 2. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul. Case 3. There is not compelling reason to use rd = x0 and rs1 = x0 to interrogate the hardware so maintaining the VLMAX is the probable intent given existing vl is provided. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul. On 2020-02-07 1:40 a.m., Krste
Asanovic wrote:
I'm realizing the idea doesn't quite work unless machine actually stores the fractional LMUL value (to cope with SLEN and widening instruction behavior), and so would need the new instruction and an extra bit of state in vtype. But with this change, yes. For floating-point code, where there is no way to load a narrower type into a wider element, this can be used to increase the number of registers of a wider width, e.g., in a matrix multiply accumulating double-precision values that are the product of single-precision values, using widening muladds, can arrange as something like: vsetvli x0, x0, e32,f2 # Fractional Lmul vle.v v0, (bmatp) # Get row of matrix flw f1, (x15) # Get scalar vfwmacc.vf v1, f1, v0 # One row of outer-product add x15, x15, acolstride # Bump pointer ... flw f31, (x15) # Get scalar vfwmacc.vf v31, f31, v0 # Last row of outer-product ... which is holding 31 rows of the destination matrix accumulators as doubles while performing widening muladds from a single-precision vector load held in v0. This is probably overkill register blocking for this particular example, but shows the general improvement. Krste |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
David Horner
I left the encoding unspecified in the proposal. That was intentional as I saw various tradeoffs. However, I now recommend the codes be in order of increasing VLMAX value as so:
This allows for a quick "andi and bz"
check of the "special" LMUL 1/16
It also allows the formula
new LMUL = ( new sew / old SEW
) * old LMUL
to be implemented as
vlmul = new vsew - old vsew + old
vlmul
The first column could be inverted to
allow the original code values with no appreciable hardware cost,
but it complicates software for the two above cases.
On 2020-02-09 12:44 a.m., David Horner
via Lists.Riscv.Org wrote:
Special lmul code in vsetvl{i} to derive LML from existing SEW/LMUL and provided sew code. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|