Vector Task Group minutes 2020/5/15


Guy Lemieux
 

Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v v0, (a2), m8 // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy

On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:

I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.

However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:

# C code:
# int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
# keep N in a0 and &x[0] in a1

# "BEFORE" (original RVV code):
loop:
vsetvli t0, a0, e8,m8
vle8.v v0, (a1)
vadd.vi v0, v0, 1
vse8.v v0, (a1)
add a1, a1, t0
sub a0, a0, t0
bnez a0, loop

# "AFTER" removing LMUL > 1 loads/stores:
loop:
vsetvli t0, a0, e8,m8
mv t1, t0
mv a2, a1

# loads:
vsetvli t2, t1, e8,m1
vle8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v7, (a2)

# cast instructions ...

vsetvli x0, t0, e8,m8
vadd.vi v0, (a1)

# more cast instructions ...
mv t1, t0
mv a2, a1

# stores:
vsetvli t2, t1, e8,m1
vse8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v7, (a2)

add a1, a1, t0
sub a0, a0, t0
bnez a0, loop

On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:

The precise data layout pattern does not matter.

What matters is that a single distribution pattern is agreed upon to
avoid fragmenting the software ecosystem.

With my additional restriction, the load/store side of an
implementation is greatly simplified, allowing for simple
implementations.

The main drawback of my restriction is how to avoid the overhead of
the cast instruction in an aggressive implementation? The cast
instruction must rearrange data to translate between LMUL!=1 and
LMUL=1 data layouts; my proposal requires these casts to be executed
between any load/stores (which always assume LMUL=1) and compute
instructions which use LMUL!=1. I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused. I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).

Guy

On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:

This is v0.8 with SLEN=8.


Nick Knight
 

Hi Guy,

Thanks for your reply. I'll leave a few quick responses, and would like to hear opinions from others on the task group.

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

Agreed.
 
b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

Agreed, especially regarding code storage.
 
c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

When loop strip-mining, we would like to handle the "fringe case" (if any) VLEN-independently. Suppose, e.g., the last loop iteration doesn't consume all eight vector registers in the group. It wasn't obvious to me how to handle this case without more complex fringe-case logic.
 
d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

Agreed: in particular, on machines with separate scalar/vector pipelines their executions can be overlapped with vector instructions. This doesn't hide the energy cost, however.
 
e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

This sounds to me like a redefinition of unit-stride loads/stores with LMUL > 1, just changing the pattern in which they access the register file. (An observation, not a criticism.)
 
That way ALL of the code bloat is now gone.

It seems that way to me, too.

Best,
Nick
 

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>


Alex Solomatnikov
 

Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org


On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>




Guy Lemieux
 

Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>




Alex Solomatnikov
 

Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>




Roger Espasa
 

I was trying to summarize/compare the different proposals. I talked to Grigoris and I think this is a correct summary:

    v08 v09 Grigoris
LMUL=1 SLEN=VLEN SLEN does not matter SLEN does not matter SLEN does not matter
LMUL=1 SLEN<=SEW SLEN does not matter  SLEN does not matter SLEN does not matter
LMUL=1 SEW < SLEN < VLEN ignores SLEN SLEN affects  --
LMUL>1   memory ordering lost by using multiple regs
LMUL<1 SLEN=VLEN -- SLEN does not matter LMUL affects (1)
LMUL<1 SLEN<=SEW -- SLEN does not matter LMUL affects (1)
LMUL<1 SEW < SLEN < VLEN -- SLEN affects  --

In the table above
  • GREEN indicates memory layout is preserved inside register layout
  • YELLOW indicates memory layout is NOT PRESERVED inside register layout
  • (1) data is packed inside a container of LMUL*SEW bits
  • In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register file in the VPU lanes, we tried to summaryize into 
this second table when data coming from memory ends up in a lane diffrent from the natural SLEN=VLEN arrangement:

  v0.9 Grigoris
LMUL < 1 SEW < SLEN always
LMUL = 1 SEW < SLEN never
LMUL > 1 always always

I think Grigori's proposal only suffers in the TAIL of a vector loop when VL is not a nice multiple of your VLEN and LMUL>1. Imagine your tail ends up with 3 elements and you are at LMUL=4. You will get the 3 elements "vertically" across the register groups, so loop EPILOG will take "3 issues" instead of potentially "just 1" (assuming the vector unit has more than one lane).

Is this a correct summary of the three proposals?

roger.


On Fri, May 29, 2020 at 9:06 AM Alex Solomatnikov <sols@...> wrote:
Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>




David Horner
 



On 2020-05-29 6:18 a.m., Roger Espasa wrote:
I was trying to summarize/compare the different proposals. I talked to Grigoris and I think this is a correct summary:

    v08 v09 Grigoris
LMUL=1 SLEN=VLEN SLEN does not matter SLEN does not matter SLEN does not matter
LMUL=1 SLEN<=SEW SLEN does not matter  SLEN does not matter SLEN does not matter
LMUL=1 SEW < SLEN < VLEN ignores SLEN SLEN affects  --
LMUL>1   memory ordering lost by using multiple regs
LMUL<1 SLEN=VLEN -- SLEN does not matter LMUL affects (1)
LMUL<1 SLEN<=SEW -- SLEN does not matter LMUL affects (1)
LMUL<1 SEW < SLEN < VLEN -- SLEN affects  --

The very significant correction is that for v09
SLEN=VLEN memory ordering IS preserved for LMUL>1
should be


    v08 v09 Grigoris
LMUL>1 SLEN=VLEN memory ordering lost memory order retained
memory ordering lost










LMUL>1
SLEN<VLEN
memory order lost

















This is a big win, but also the source of the possible software fragmentation in v09 when SLEN<VLEN
In the table above
  • GREEN indicates memory layout is preserved inside register layout
  • YELLOW indicates memory layout is NOT PRESERVED inside register layout
  • (1) data is packed inside a container of LMUL*SEW bits
  • In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register file in the VPU lanes, we tried to summaryize into 
this second table when data coming from memory ends up in a lane diffrent from the natural SLEN=VLEN arrangement:

  v0.9 Grigoris
LMUL < 1 SEW < SLEN always
LMUL = 1 SEW < SLEN never
LMUL > 1 always always

I think Grigori's proposal only suffers in the TAIL of a vector loop when VL is not a nice multiple of your VLEN and LMUL>1. Imagine your tail ends up with 3 elements and you are at LMUL=4. You will get the 3 elements "vertically" across the register groups, so loop EPILOG will take "3 issues" instead of potentially "just 1" (assuming the vector unit has more than one lane).

Is this a correct summary of the three proposals?

roger.


On Fri, May 29, 2020 at 9:06 AM Alex Solomatnikov <sols@...> wrote:
Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>





Roger Espasa
 

You're absolutely right. Thanks for the correction.


On Fri, May 29, 2020 at 1:44 PM DSHORNER <ds2horner@...> wrote:


On 2020-05-29 6:18 a.m., Roger Espasa wrote:
I was trying to summarize/compare the different proposals. I talked to Grigoris and I think this is a correct summary:

    v08 v09 Grigoris
LMUL=1 SLEN=VLEN SLEN does not matter SLEN does not matter SLEN does not matter
LMUL=1 SLEN<=SEW SLEN does not matter  SLEN does not matter SLEN does not matter
LMUL=1 SEW < SLEN < VLEN ignores SLEN SLEN affects  --
LMUL>1   memory ordering lost by using multiple regs
LMUL<1 SLEN=VLEN -- SLEN does not matter LMUL affects (1)
LMUL<1 SLEN<=SEW -- SLEN does not matter LMUL affects (1)
LMUL<1 SEW < SLEN < VLEN -- SLEN affects  --

The very significant correction is that for v09
SLEN=VLEN memory ordering IS preserved for LMUL>1
should be


    v08 v09 Grigoris
LMUL>1 SLEN=VLEN memory ordering lost memory order retained
memory ordering lost










LMUL>1
SLEN<VLEN
memory order lost

















This is a big win, but also the source of the possible software fragmentation in v09 when SLEN<VLEN
In the table above
  • GREEN indicates memory layout is preserved inside register layout
  • YELLOW indicates memory layout is NOT PRESERVED inside register layout
  • (1) data is packed inside a container of LMUL*SEW bits
  • In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register file in the VPU lanes, we tried to summaryize into 
this second table when data coming from memory ends up in a lane diffrent from the natural SLEN=VLEN arrangement:

  v0.9 Grigoris
LMUL < 1 SEW < SLEN always
LMUL = 1 SEW < SLEN never
LMUL > 1 always always

I think Grigori's proposal only suffers in the TAIL of a vector loop when VL is not a nice multiple of your VLEN and LMUL>1. Imagine your tail ends up with 3 elements and you are at LMUL=4. You will get the 3 elements "vertically" across the register groups, so loop EPILOG will take "3 issues" instead of potentially "just 1" (assuming the vector unit has more than one lane).

Is this a correct summary of the three proposals?

roger.


On Fri, May 29, 2020 at 9:06 AM Alex Solomatnikov <sols@...> wrote:
Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>





David Horner
 



On 2020-05-29 8:21 a.m., Roger Espasa wrote:
You're absolutely right. Thanks for the correction.

On Fri, May 29, 2020 at 1:44 PM DSHORNER <ds2horner@...> wrote:


On 2020-05-29 6:18 a.m., Roger Espasa wrote:
I was trying to summarize/compare the different proposals. I talked to Grigoris and I think this is a correct summary:

    v08 v09 Grigoris
LMUL=1 SLEN=VLEN SLEN does not matter SLEN does not matter SLEN does not matter
LMUL=1 SLEN<=SEW SLEN does not matter  SLEN does not matter SLEN does not matter
LMUL=1 SEW < SLEN < VLEN ignores SLEN SLEN affects  --
LMUL>1   memory ordering lost by using multiple regs
LMUL<1 SLEN=VLEN -- SLEN does not matter LMUL affects (1)
LMUL<1 SLEN<=SEW -- SLEN does not matter LMUL affects (1)
LMUL<1 SEW < SLEN < VLEN -- SLEN affects  --

The very significant correction is that for v09
SLEN=VLEN memory ordering IS preserved for LMUL>1
should be


    v08 v09 Grigoris
LMUL>1 SLEN=VLEN memory ordering lost memory order retained
memory ordering lost










LMUL>1
SLEN<VLEN
memory order lost

















This is a big win, but also the source of the possible software fragmentation in v09 when SLEN<VLEN
In the table above
  • GREEN indicates memory layout is preserved inside register layout
  • YELLOW indicates memory layout is NOT PRESERVED inside register layout
  • (1) data is packed inside a container of LMUL*SEW bits
  • In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register file in the VPU lanes, we tried to summaryize into 
this second table when data coming from memory ends up in a lane diffrent from the natural SLEN=VLEN arrangement:

  v0.9 Grigoris
LMUL < 1 SEW < SLEN always
LMUL = 1 SEW < SLEN never
LMUL > 1 always always


I'm not sure how to interpret this table (been puzzling over it).
But I believe the LMUL>1 case for v0.9 is incorrect for the same reason as in the previous table.
For v0.9 LMUL=1 and LMUL>1 have the same behaviour. The behaviour is  based on SLEN< vs SLEN=  VLEN.
So always is definitely wrong.

I don't see the most relevant characteristic as being SEW < SLEN.
Only when SLEN<VLEN can "MUXING" occur.
For v0.9 SEW>=SLEN means that the SLEN chunk is completely filled for that SEW and thus equivalent to SLEN=VLEN.
However, this is expected to be a rare occurance. It is most likely that ELEN (that is MAX Element Width) is less than SLEN.
I appreciate its inclusion for completeness, but it is likely an outlier in practice.

What is relevant is the implicit CLSTR size in the current design which is byte size.
This interacts with SEW in all values of SEW<SLEN (which is to say in all but the fringe [pathological?] cases).

I think Grigori's proposal only suffers in the TAIL of a vector loop when VL is not a nice multiple of your VLEN and LMUL>1. Imagine your tail ends up with 3 elements and you are at LMUL=4. You will get the 3 elements "vertically" across the register groups, so loop EPILOG will take "3 issues" instead of potentially "just 1" (assuming the vector unit has more than one lane).

Is this a correct summary of the three proposals?

roger.


On Fri, May 29, 2020 at 9:06 AM Alex Solomatnikov <sols@...> wrote:
Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>






Andy Glew Si5
 

 Now I can add AsciiDoc  and  KramDoc artifacts to those HTML and PDF artifacts :-(


[Andrew Waterman]: We have tagged version 0.9 on github, including HTML/PDF artifacts: https://github.com/riscv/riscv-v-spec/releases/tag/0.9



From: Andrew Waterman <andrew@...>
Sent: Friday, May 15, 2020 3:23PM
To: Krste Asanovic <krste@...>
Cc: Tech-vector-ext <tech-vector-ext@...>
Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/5/15

 

On 5/15/2020 3:23 PM, Andrew Waterman wrote:


On Fri, May 15, 2020 at 11:56 AM Krste Asanovic <krste@...> wrote:

Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed.  The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes.  This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
  SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
  (<=SLEN), however these appear to result in even greater variety of
  layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
  internally restrict datapath communication within SLEN width of
  datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.

We have tagged version 0.9 on github, including HTML/PDF artifacts: https://github.com/riscv/riscv-v-spec/releases/tag/0.9