Re: Vector Task Group minutes 2020/5/15


Nick Knight
 

Hi Guy,

Thanks for your reply. I'll leave a few quick responses, and would like to hear opinions from others on the task group.

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

Agreed.
 
b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

Agreed, especially regarding code storage.
 
c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

When loop strip-mining, we would like to handle the "fringe case" (if any) VLEN-independently. Suppose, e.g., the last loop iteration doesn't consume all eight vector registers in the group. It wasn't obvious to me how to handle this case without more complex fringe-case logic.
 
d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

Agreed: in particular, on machines with separate scalar/vector pipelines their executions can be overlapped with vector instructions. This doesn't hide the energy cost, however.
 
e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

This sounds to me like a redefinition of unit-stride loads/stores with LMUL > 1, just changing the pattern in which they access the register file. (An observation, not a criticism.)
 
That way ALL of the code bloat is now gone.

It seems that way to me, too.

Best,
Nick
 

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>

Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.