Nick, thanks for that code snippet, it's really insightful.
I have a few comments:
a) this is for LMUL=8, the worst-case (most code bloat)
b) this would be automatically generated by a compiler, so visuals are not meaningful, though code storage may be an issue
c) the repetition of vsetvli and sub instructions is not needed; programmer may assume that all vector registers are equal in size
d) the vsetvli / add / sub instructions have minimal runtime due to behaving like scalar operations
e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and the add can be mimicked by a change in the ISA to handle a set of registers in a register group automatically, eg:
instead of this 8 times for v0 to v7: vle8.v v0, (a2) add a2, a2, t2
we can allow vle to operate on registers groups, but done one register at a time in sequence, by doing this just once: vle8.v v0, (a2), m8 // does 8 independent loads, loading to v0 from address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc
That way ALL of the code bloat is now gone.
Ciao, Guy
toggle quoted message
Show quoted text
On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote: I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
# C code: # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i]; # keep N in a0 and &x[0] in a1
# "BEFORE" (original RVV code): loop: vsetvli t0, a0, e8,m8 vle8.v v0, (a1) vadd.vi v0, v0, 1 vse8.v v0, (a1) add a1, a1, t0 sub a0, a0, t0 bnez a0, loop
# "AFTER" removing LMUL > 1 loads/stores: loop: vsetvli t0, a0, e8,m8 mv t1, t0 mv a2, a1
# loads: vsetvli t2, t1, e8,m1 vle8.v v0, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vle8.v v1, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vle8.v v2, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vle8.v v3, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vle8.v v4, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vle8.v v5, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vle8.v v6, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vle8.v v7, (a2)
# cast instructions ...
vsetvli x0, t0, e8,m8 vadd.vi v0, (a1)
# more cast instructions ... mv t1, t0 mv a2, a1
# stores: vsetvli t2, t1, e8,m1 vse8.v v0, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vse8.v v1, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vse8.v v2, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vse8.v v3, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vse8.v v4, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vse8.v v5, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vse8.v v6, (a2) add a2, a2, t2 sub t2, t2, t1 vsetvli t2, t1, e8,m1 vse8.v v7, (a2)
add a1, a1, t0 sub a0, a0, t0 bnez a0, loop
On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
The precise data layout pattern does not matter.
What matters is that a single distribution pattern is agreed upon to avoid fragmenting the software ecosystem.
With my additional restriction, the load/store side of an implementation is greatly simplified, allowing for simple implementations.
The main drawback of my restriction is how to avoid the overhead of the cast instruction in an aggressive implementation? The cast instruction must rearrange data to translate between LMUL!=1 and LMUL=1 data layouts; my proposal requires these casts to be executed between any load/stores (which always assume LMUL=1) and compute instructions which use LMUL!=1. I think this can sometimes be done for "free" by carefully planning your compute instructions. For example, a series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to the same register group destination can be macro-op fused. I don't think the same thing can be done for vst instructions, unless it macro-op fuses a longer sequence consisting of cast / vst / clear register group (or some other operation that overwrites the cast destination, indicating the cast is superfluous and only used by the stores).
Guy
On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
This is v0.8 with SLEN=8.
|