On 2020-05-29 8:21 a.m., Roger Espasa
wrote:
You're absolutely right. Thanks for the correction.
On 2020-05-29 6:18 a.m., Roger Espasa wrote:
I was trying to summarize/compare the
different proposals. I talked to Grigoris and I think
this is a correct summary:
|
|
v08 |
v09 |
Grigoris |
LMUL=1 |
SLEN=VLEN |
SLEN
does not matter |
SLEN
does not matter |
SLEN
does not matter |
LMUL=1 |
SLEN<=SEW |
SLEN
does not matter |
SLEN
does not matter |
SLEN
does not matter |
LMUL=1 |
SEW
< SLEN < VLEN |
ignores
SLEN |
SLEN
affects |
-- |
LMUL>1 |
|
memory
ordering lost by using multiple regs |
LMUL<1 |
SLEN=VLEN |
-- |
SLEN
does not matter |
LMUL
affects (1) |
LMUL<1 |
SLEN<=SEW |
-- |
SLEN
does not matter |
LMUL
affects (1) |
LMUL<1 |
SEW
< SLEN < VLEN |
-- |
SLEN
affects |
-- |
The very significant correction is that for v09
SLEN=VLEN memory ordering IS preserved for LMUL>1
should be
|
|
v08 |
v09 |
Grigoris |
LMUL>1 |
SLEN=VLEN |
memory ordering lost |
memory
order retained
|
memory ordering lost |
|
|
|
|
|
|
|
|
|
|
LMUL>1
|
SLEN<VLEN
|
memory
order lost
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This is a big win, but also the source of the possible
software fragmentation in v09 when SLEN<VLEN
In the table above
- GREEN indicates memory layout is preserved
inside register layout
- YELLOW indicates memory layout is NOT
PRESERVED inside register layout
- (1) data is packed inside a container of
LMUL*SEW bits
- In Grigori's proposal a couple cases never
occur
In terms of MUXING from the memory into the
register file in the VPU lanes, we tried to
summaryize into
this second table when data coming from memory
ends up in a lane diffrent from the natural
SLEN=VLEN arrangement:
|
v0.9 |
Grigoris |
LMUL < 1 |
SEW
< SLEN |
always |
LMUL = 1 |
SEW
< SLEN |
never |
LMUL > 1 |
always |
always |
I'm not sure how to interpret this table (been puzzling over it).
But I believe the LMUL>1 case for v0.9 is incorrect for the same
reason as in the previous table.
For v0.9 LMUL=1 and LMUL>1 have the same behaviour. The behaviour
is based on SLEN< vs SLEN= VLEN.
So always is definitely wrong.
I don't see the most relevant characteristic as being SEW < SLEN.
Only when SLEN<VLEN can "MUXING" occur.
For v0.9 SEW>=SLEN means that the SLEN chunk is completely filled
for that SEW and thus equivalent to SLEN=VLEN.
However, this is expected to be a rare occurance. It is most likely
that ELEN (that is MAX Element Width) is less than SLEN.
I appreciate its inclusion for completeness, but it is likely an
outlier in practice.
What is relevant is the implicit CLSTR size in the current design
which is byte size.
This interacts with SEW in all values of SEW<SLEN (which is to
say in all but the fringe [pathological?] cases).
I think Grigori's proposal only suffers in the
TAIL of a vector loop when VL is not a nice
multiple of your VLEN and LMUL>1. Imagine your
tail ends up with 3 elements and you are at
LMUL=4. You will get the 3 elements "vertically"
across the register groups, so loop EPILOG will
take "3 issues" instead of potentially "just 1"
(assuming the vector unit has more than one lane).
Is this a correct summary of the three
proposals?
roger.
On Fri, May 29, 2020
at 9:06 AM Alex Solomatnikov < sols@...>
wrote:
Guy,
FPGA energy/power data is not really relevant
to general purpose ISA design, unless your goal
is to design an ISA only for soft FPGA cores
like Nios or MicroBlaze.
Please, take a look at the paper I
referenced. It has energy breakdown not only for
a simple RISC core but also for a SIMD+VLIW core
(with 2 and 3 slots), which is a reasonable
approximation for a vector core (with 2 or 3
instructions per cycle). These are based on
designs industry competitive in PPA (Tensilica).
For SIMD+VLIW core instruction fetch/decode is
still ~30%.
Here is a quote from Section 4.1: "SIMD and
VLIW speed up the application by 10x, decreasing
IF energy by 10x, but the percentage of energy
going to IF does not change much. IF still
consumes more energy than functional units."
Figure 4 showing energy breakdown is below.
Alex
On Thu, May 28,
2020 at 9:13 PM Guy Lemieux < glemieux@...>
wrote:
Keep in mind:
a) I amended my proposal to
reduce the code bloat identified by Nick
b) the effect of the bloat is
almost entirely about text segment size, not
power or instruction bandwidth, because these
are vector instructions that are already
amortizing their overhead over VLEN/SEW
operations (often over 64 ops).
LMUL>1 has almost nothing to
do with energy efficiency, it’s almost
entirely about storage efficiency (so unused
vectors don’t go to waste) and a little bit to
do with performance (slightly higher memory
and compute utilizations are possible with
longer vectors). Instruction fetch energy is
dwarfed by the actual compute and memory
access energy for VLEN bits per instruction.
I don’t have a citation to give,
but I have direct experience implementing and
measuring the effect. The FPGA-based vector
engine I helped design used less energy per op
and offered faster absolute performance than
the hard ARM processor used as the host,
despite the ARM having a 7x clock speed
advantage.
Guy
On Thu,
May 28, 2020 at 9:02 PM Alex Solomatnikov
< sols@...>
wrote:
Code bloat is important -
not just the number load and store
instructions but also additional
vsetvl/i instructions. This was one of
the reasons for vle8, vle16 and others.
LMUL>1 is also great for
energy/power because a large
percentage of energy/power, typically
>30%, is spent on instruction
fetch/handling even in simple
processors [1]. LMUL>1 reduces the
number of instructions for a given
amount of work and energy/power spent
on instruction fetch/handling.
Alex
On
Wed, May 27, 2020 at 4:59 PM Guy
Lemieux < glemieux@...>
wrote:
Nick,
thanks for that code snippet, it's
really insightful.
I have a few comments:
a) this is for LMUL=8, the
worst-case (most code bloat)
b) this would be automatically
generated by a compiler, so visuals
are
not meaningful, though code storage
may be an issue
c) the repetition of vsetvli and sub
instructions is not needed;
programmer may assume that all
vector registers are equal in size
d) the vsetvli / add / sub
instructions have minimal runtime
due to
behaving like scalar operations
e) the repetition 8 times (or
whatever LMUL you want) vor vle/vse
and
the add can be mimicked by a change
in the ISA to handle a set of
registers in a register group
automatically, eg:
instead of this 8 times for v0 to
v7:
vle8.v v0, (a2)
add a2, a2, t2
we can allow vle to operate on
registers groups, but done one
register
at a time in sequence, by doing this
just once:
vle8.v v0, (a2), m8 // does 8
independent loads, loading to v0
from
address a2, v1 from address a2+vl,
v2 from address a2+2*vl, etc
That way ALL of the code bloat is
now gone.
Ciao,
Guy
On Wed, May 27, 2020 at 11:29 AM
Nick Knight <nick.knight@...>
wrote:
>
> I appreciate this discussion
about making things friendlier to
software. I've always felt the
constraints on SLEN-agnostic
software to be a nuisance, albeit a
mild one.
>
> However, I do have a concern
about removing LMUL > 1 memory
operations regarding code bloat.
This is all purely subjective: I
have not done any concrete analysis.
But here is an example that
illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i
< N; ++i) ++x[i];
> # keep N in a0 and &x[0] in
a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi
v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1
loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi
v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24
AM Guy Lemieux <glemieux@...>
wrote:
>>
>> The precise data layout
pattern does not matter.
>>
>> What matters is that a
single distribution pattern is
agreed upon to
>> avoid fragmenting the
software ecosystem.
>>
>> With my additional
restriction, the load/store side of
an
>> implementation is greatly
simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my
restriction is how to avoid the
overhead of
>> the cast instruction in an
aggressive implementation? The cast
>> instruction must rearrange
data to translate between LMUL!=1
and
>> LMUL=1 data layouts; my
proposal requires these casts to be
executed
>> between any load/stores
(which always assume LMUL=1) and
compute
>> instructions which use
LMUL!=1. I think this can sometimes
be done for
>> "free" by carefully
planning your compute instructions.
For example, a
>> series of vld instructions
with LMUL=1 followed by a cast to
LMUL>1 to
>> the same register group
destination can be macro-op fused. I
don't
>> think the same thing can be
done for vst instructions, unless it
>> macro-op fuses a longer
sequence consisting of cast / vst /
clear
>> register group (or some
other operation that overwrites the
cast
>> destination, indicating the
cast is superfluous and only used by
the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at
10:13 AM David Horner <ds2horner@...>
wrote:
>> >
>> > This is v0.8 with
SLEN=8.
>>
>>
>>
|