make SEW be the largest element width
I added my proposal to github:
https://github.com/riscv/riscv-v-spec/issues/425 appended below for those not following github Krste This proposal is a modification of earlier idea to add effective element width to load/store instructions to mitigate dropping fixed-width load/stores and to provide greater efficiency for mixed-width floating-point codes. This proposal redefines SEW to be the largest element width (LEW?), and correspondingly the definition of widening/narrowing operations: Previously a double-widening add was defined as: 2*SEW = SEW + SEW the new proposal is to specify SEW = SEW/2 + SEW/2 This proposal does not change the behavior/implementation of existing instructions, except to change how the effective EW and effective LMUL are obtained from vtype. For load/store instructions, we could modify these to have relative element widths that are fractions of SEW {SEW, SEW/2, SEW/4, SEW/8}. Assembly syntax could be something like vle, vlef2, vlef4, vlfe8. There is a challenge in readability that these are relative to last vtype setting, and also when SEW is less than 64, some become useless. I think for this reason we should stick with fixed vle8, vle16, vle32, vle, in base encoding {8,16,32,SEW}. These are more readable and can be used to interleave load/stores of larger than SEW values without changing vtype (e.g., moving 32b values when SEW=16). Extending past base, we could add relative EEW load/store using unused mop bit, for example. Mask registers have an EEW of SEW/LMUL, and so defining the relative sizes as fractions of SEW also makes it more likely that code can save/restore a mask register without changing vtype. The fixed sizes will cover some of these cases, though might have to consider saving/restoring more than needed if SEW/LMUL < 8. There is possibly actually a minor hardware checking advantage to knowing that SEW holds the largest possible element width, since following instructions cannot request an effective SEW larger than this. Even with fixed-size load/stores, if the sizes are 8,16,32,SEW then all of these will also be legal on standard implementations. Some of the examples we've been discussing. ---------------------------------------------------------------------- The "worst-case" example from TG slide: # int32_t a[i] = int8_t b[i] << 15 # With fixed-width load/store vsetvli t0, a0, e32,m1 vlb.v v4, (rb) vsll.vi v4, v4, 15 vsw.v v4, (ra) # This proposal, assuming quad-widening and fractional LMUL vsetvli t0, a0, e32,m1 vle8.v v1, (rb) # load bytes into e8,m1 register vqcvt.x.x.v v4, v1 # Quad-widen to 32 bits vsll.vi v4, v4, 15 # 32-bit shift vse32.v v4, (ra) # could also use vse Obviously, adding a quad-widening left shift would remove any difference between two. ---------------------------------------------------------------------- The previous widening compute operation: # Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32 vsetvli t0, a0, e32,m8 # vtype SEW=32b vle16.v v4, (a1) vle16.v v8, (a2) vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed. vfwmacc.vv v16, v4, v8 # EEW,ELMUL of source operands is SEW/2,LMUL/2 vse32.v v16, (a3) # EEW=32b, no vsetvli needed ---------------------------------------------------------------------- The example code from #362 Original fixed-width code assuming 32b float. vsetvli t0, a0, e32,m8 # Assuming 32b float loop: [...] # Other instructions with fixed vl and sew vlb.v v0, (x12) # Get byte value vadd.vx v16, v0, x11 # Add scalar integer offset vfcvt.f.x.v v16, v16 # Convert to 32b floating vlaue [...] converts to: vsetvli t0, a0, e32, m8 loop: [...] vle8.v v0, (x12) # Load bytes vqcvt.x.x.v v16, v0 # Quad-widening sign-extension vadd.vx v16, v16, x11 # Add offset vfcvt.f.x.v v16, v16 # Convert to float [...] which is just one more instruction in inner loop. With a quad-widening add (vqadd.vx), there would be no penalty in this case: vsetvli t0, a0, e32, m8 loop: [...] vle8.v v0, (x12) # Load bytes vqadd.vx v16, v0, x11 # Add offset, quad-widen vfcvt.f.x.v v16, v16 # Convert to float [...] The vle8 version also reduce the number of vector registers tied up buffering load values when the loop is software-pipelined/loop unrolled, allowing more loads in flight for a given LMUL. Basically, in software-pipelined/unrolled loops, the widen-at-load instructions tie up more architectural registers than widen-at-use when the widen-at-use is part of instruction, reducing flexibility in scheduling and/or reducing LMUL. Even if an explicit widen instruction is used, at that point the values are in the registers and the widen will proceed at max rate, generally reducing architectural register occupancy versus having the wider values in-flight from memory. For example: setvli t0, a0, e32,m8 vlb.v v0, () # v0-v7 tied up while . # scheduling around memory latency . . vadd.vx v16, v0, x11 # consume here vlb.v v0, () # schedule next iteration here . versus setvli t0, a0, e32,m8 vle8.v v0, (x12) # only v0-v1 tied up . vle8.v v2, (x13) # fetch next iteration into v2-v3 . # . vqcvt.x.x.v v16, v0 # Quad-widening sign-extension vadd.vx v16,v16,x11 # Add offset . As mentioned earlier, the widened loads also might run at slower rate unless full register write port bandwidth is given. ---------------------------------------------------------------------- Other code example extracted from 3x3 convolutions with 32b += 16b * 8b vsetvli t0, a0, e16,m2 vle8.v v1, (x) # occupies v1 vwcvt.x.x.v v4, v1 # widen v1 into v4-v5 vsetvli t0, a0, e32,m4 vwmacc.vx v8, x11, v4 # widening mul-add into v8-v11 ---------------------------------------------------------------------- |
|
David Horner
First thoughts below - as a response to #424 and - we need only consider SEW = LEW (new) and 1/2LEW (POR) On 2020-04-19 1:00 a.m., Krste Asanovic
wrote:
When completed is it LEWd?I added my proposal to github: https://github.com/riscv/riscv-v-spec/issues/425 appended below for those not following github Krste This proposal is a modification of earlier idea to add effective element width to load/store instructions to mitigate dropping fixed-width load/stores and to provide greater efficiency for mixed-width floating-point codes. This proposal redefines SEW to be the largest element width (LEW?), and correspondingly the definition of widening/narrowing operations: Previously a double-widening add was defined as: 2*SEW = SEW + SEW the new proposal is to specify SEW = SEW/2 + SEW/2 With the corresponding lmul = lmul/2 + lmul/2 Which biases towards fractional lmul and thus interleaved vs striped groupings, a direction I endorse. (I expect it is not surprise that I lean towards interleave and enhancing fractional LMUL. Hopefully to the point that striped LMUL is a secondary mechanism
if not obsoleted) Thus SEW as LEW was one of my preferred options.
SEW at 1/2LEW is the current POR, and has the advantage that the majority of operations are defined and performed at this SEW level. Conditioning data for widening ops and setting masks occurs at the 1/2LEW level. This assumes the dominant widening operations are double and not
quad/octal. I have little reason to suspect otherwise. Straight conversion to 1/2LEW from 1/4 and 1/8 is also important,
but I have little reason to suspect they are so important as to
compromise the efficiency of other operations.
Thus, I believe SEW at 1/4LEW and 1/8 LEW are of little value much too removed from where the real work occurs. Once again I postulate that the best choice is determined by the programs activities as both SEW = LEW and 1/2LEW have merit.
As a response to #424 this is may be a tactical acceptance of my assertion:
However, it may be only addressing the limitations/tradeoff inherent with POV load/store and recently proposed packed fixed load store and subsequent widening mechanisms. All good stuff, but not a general as my KEY POINT. More comments to come. |
|
David Horner
For those not following on github: On 2020-04-19 1:00 a.m., Krste
Asanovic wrote:
....
SEW and LMUL values are essential to correct code execution regardless of load/store width encoding. They should be assembler directive variables set automatically by vsetvli (and vsetvl when its xs2 argument is statically defined). For dynamic xs2 and vsetvl, a manual assembler directive should be available. This should help in various situations, including validation that SEW/LMUL ratio is maintained by a given vsetvli, and also for load/store syntax: With this in place the assembler code can translate e8 to the corre and also when SEW is less than 64, some become useless.ct SEW * factor value in load/stores eliminating the readability concern. I agree that the base should have as robust an encoding without over committing available bits. Thus I would want to also want to move 32b when SEW=16, and in addition
I also believe that load/store are so important, so pivotal (e.g. matrix transforms) that flexibility and efficiency are both mandated. The compress load/store format seeks to address efficiency. The encoding needs the flexibility of SEW * factors. I propose the factors be depent upon current SEW value For SEW=8 the encoding yields factors of 1,2,4 and 8 For SEW=16 the encoding yields factors of 1/2, 1, 2 and 4 For SEW of 32 and above the encoding yields factors of 1/4, 1/2, 1 and 2. Thus we always support LEW = SEW *2 operations and support load/store SEW/2 and SEW/4 when they exist. |
|