Re: Vector Task Group minutes 2020/5/15


David Horner
 



On 2020-05-29 6:18 a.m., Roger Espasa wrote:
I was trying to summarize/compare the different proposals. I talked to Grigoris and I think this is a correct summary:

    v08 v09 Grigoris
LMUL=1 SLEN=VLEN SLEN does not matter SLEN does not matter SLEN does not matter
LMUL=1 SLEN<=SEW SLEN does not matter  SLEN does not matter SLEN does not matter
LMUL=1 SEW < SLEN < VLEN ignores SLEN SLEN affects  --
LMUL>1   memory ordering lost by using multiple regs
LMUL<1 SLEN=VLEN -- SLEN does not matter LMUL affects (1)
LMUL<1 SLEN<=SEW -- SLEN does not matter LMUL affects (1)
LMUL<1 SEW < SLEN < VLEN -- SLEN affects  --

The very significant correction is that for v09
SLEN=VLEN memory ordering IS preserved for LMUL>1
should be


    v08 v09 Grigoris
LMUL>1 SLEN=VLEN memory ordering lost memory order retained
memory ordering lost










LMUL>1
SLEN<VLEN
memory order lost

















This is a big win, but also the source of the possible software fragmentation in v09 when SLEN<VLEN
In the table above
  • GREEN indicates memory layout is preserved inside register layout
  • YELLOW indicates memory layout is NOT PRESERVED inside register layout
  • (1) data is packed inside a container of LMUL*SEW bits
  • In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register file in the VPU lanes, we tried to summaryize into 
this second table when data coming from memory ends up in a lane diffrent from the natural SLEN=VLEN arrangement:

  v0.9 Grigoris
LMUL < 1 SEW < SLEN always
LMUL = 1 SEW < SLEN never
LMUL > 1 always always

I think Grigori's proposal only suffers in the TAIL of a vector loop when VL is not a nice multiple of your VLEN and LMUL>1. Imagine your tail ends up with 3 elements and you are at LMUL=4. You will get the 3 elements "vertically" across the register groups, so loop EPILOG will take "3 issues" instead of potentially "just 1" (assuming the vector unit has more than one lane).

Is this a correct summary of the three proposals?

roger.


On Fri, May 29, 2020 at 9:06 AM Alex Solomatnikov <sols@...> wrote:
Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>




Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.