Keep in mind:
a) I amended my proposal to reduce the code bloat identified by Nick
b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).
LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.
I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.
Guy
toggle quoted message
Show quoted text
On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov < sols@...> wrote: Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.
LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.
Alex
On Wed, May 27, 2020 at 4:59 PM Guy Lemieux < glemieux@...> wrote: Nick, thanks for that code snippet, it's really insightful.
I have a few comments:
a) this is for LMUL=8, the worst-case (most code bloat)
b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue
c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size
d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations
e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:
instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2
we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v v0, (a2), m8 // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc
That way ALL of the code bloat is now gone.
Ciao,
Guy
On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>
|