Re: Vector Task Group minutes 2020/5/15


Alex Solomatnikov
 

Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org


On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>



Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.