Re: Vector Task Group minutes 2020/5/15


Nick Knight
 

I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.

However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:

# C code:
# int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
# keep N in a0 and &x[0] in a1

# "BEFORE" (original RVV code):
loop:
vsetvli t0, a0, e8,m8
vle8.v v0, (a1)
vadd.vi v0, v0, 1
vse8.v v0, (a1)
add a1, a1, t0
sub a0, a0, t0
bnez a0, loop

# "AFTER" removing LMUL > 1 loads/stores:
loop:
vsetvli t0, a0, e8,m8
mv t1, t0
mv a2, a1

# loads:
vsetvli t2, t1, e8,m1
vle8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v7, (a2)

# cast instructions ...

vsetvli x0, t0, e8,m8
vadd.vi v0, (a1)

# more cast instructions ...
mv t1, t0
mv a2, a1

# stores:
vsetvli t2, t1, e8,m1
vse8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v7, (a2)

add a1, a1, t0
sub a0, a0, t0
bnez a0, loop


On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
The precise data layout pattern does not matter.

What matters is that a single distribution pattern is agreed upon to
avoid fragmenting the software ecosystem.

With my additional restriction, the load/store side of an
implementation is greatly simplified, allowing for simple
implementations.

The main drawback of my restriction is how to avoid the overhead of
the cast instruction in an aggressive implementation? The cast
instruction must rearrange data to translate between LMUL!=1 and
LMUL=1 data layouts; my proposal requires these casts to be executed
between any load/stores (which always assume LMUL=1) and compute
instructions which use LMUL!=1. I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused. I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).

Guy

On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>
> This is v0.8 with SLEN=8.



Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.