On 2020-05-29 6:18 a.m., Roger Espasa
wrote:
I was trying to summarize/compare the different
proposals. I talked to Grigoris and I think this is a correct
summary:
|
|
v08 |
v09 |
Grigoris |
LMUL=1 |
SLEN=VLEN |
SLEN
does not matter |
SLEN
does not matter |
SLEN
does not matter |
LMUL=1 |
SLEN<=SEW |
SLEN
does not matter |
SLEN
does not matter |
SLEN
does not matter |
LMUL=1 |
SEW
< SLEN < VLEN |
ignores
SLEN |
SLEN
affects |
-- |
LMUL>1 |
|
memory
ordering lost by using multiple regs |
LMUL<1 |
SLEN=VLEN |
-- |
SLEN
does not matter |
LMUL
affects (1) |
LMUL<1 |
SLEN<=SEW |
-- |
SLEN
does not matter |
LMUL
affects (1) |
LMUL<1 |
SEW
< SLEN < VLEN |
-- |
SLEN
affects |
-- |
The very significant correction is that for v09
SLEN=VLEN memory ordering IS preserved for LMUL>1
should be
|
|
v08 |
v09 |
Grigoris |
LMUL>1 |
SLEN=VLEN |
memory ordering lost |
memory
order retained
|
memory ordering lost |
|
|
|
|
|
|
|
|
|
|
LMUL>1
|
SLEN<VLEN
|
memory
order lost
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This is a big win, but also the source of the possible software
fragmentation in v09 when SLEN<VLEN
In the table above
- GREEN indicates memory layout is preserved inside
register layout
- YELLOW indicates memory layout is NOT PRESERVED inside
register layout
- (1) data is packed inside a container of LMUL*SEW bits
- In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register
file in the VPU lanes, we tried to summaryize into
this second table when data coming from memory ends up
in a lane diffrent from the natural SLEN=VLEN arrangement:
|
v0.9 |
Grigoris |
LMUL < 1 |
SEW
< SLEN |
always |
LMUL = 1 |
SEW
< SLEN |
never |
LMUL > 1 |
always |
always |
I think Grigori's proposal only suffers in the TAIL of
a vector loop when VL is not a nice multiple of your VLEN
and LMUL>1. Imagine your tail ends up with 3 elements
and you are at LMUL=4. You will get the 3 elements
"vertically" across the register groups, so loop EPILOG
will take "3 issues" instead of potentially "just 1"
(assuming the vector unit has more than one lane).
Is this a correct summary of the three proposals?
roger.
On Fri, May 29, 2020 at 9:06
AM Alex Solomatnikov < sols@...> wrote:
Guy,
FPGA energy/power data is not really relevant to
general purpose ISA design, unless your goal is to
design an ISA only for soft FPGA cores like Nios or
MicroBlaze.
Please, take a look at the paper I referenced. It has
energy breakdown not only for a simple RISC core but
also for a SIMD+VLIW core (with 2 and 3 slots), which is
a reasonable approximation for a vector core (with 2 or
3 instructions per cycle). These are based on designs
industry competitive in PPA (Tensilica). For SIMD+VLIW
core instruction fetch/decode is still ~30%.
Here is a quote from Section 4.1: "SIMD and VLIW
speed up the application by 10x, decreasing IF
energy by 10x, but the percentage of energy going to IF
does not
change much. IF still consumes more energy than
functional units." Figure 4 showing energy breakdown is
below.
Alex
On Thu, May 28, 2020 at
9:13 PM Guy Lemieux < glemieux@...>
wrote:
Keep in mind:
a) I amended my proposal to reduce the
code bloat identified by Nick
b) the effect of the bloat is almost
entirely about text segment size, not power or
instruction bandwidth, because these are vector
instructions that are already amortizing their
overhead over VLEN/SEW operations (often over 64 ops).
LMUL>1 has almost nothing to do with
energy efficiency, it’s almost entirely about storage
efficiency (so unused vectors don’t go to waste) and a
little bit to do with performance (slightly higher
memory and compute utilizations are possible with
longer vectors). Instruction fetch energy is dwarfed
by the actual compute and memory access energy for
VLEN bits per instruction.
I don’t have a citation to give, but I
have direct experience implementing and measuring the
effect. The FPGA-based vector engine I helped design
used less energy per op and offered faster absolute
performance than the hard ARM processor used as the
host, despite the ARM having a 7x clock speed
advantage.
Guy
On Thu, May 28,
2020 at 9:02 PM Alex Solomatnikov < sols@...>
wrote:
Code bloat is important - not just
the number load and store instructions but also
additional vsetvl/i instructions. This was one
of the reasons for vle8, vle16 and others.
LMUL>1 is also great for energy/power
because a large percentage of energy/power,
typically >30%, is spent on instruction
fetch/handling even in simple processors
[1]. LMUL>1 reduces the number of
instructions for a given amount of work and
energy/power spent on instruction
fetch/handling.
Alex
On Wed, May
27, 2020 at 4:59 PM Guy Lemieux < glemieux@...>
wrote:
Nick,
thanks for that code snippet, it's really
insightful.
I have a few comments:
a) this is for LMUL=8, the worst-case (most
code bloat)
b) this would be automatically generated by
a compiler, so visuals are
not meaningful, though code storage may be
an issue
c) the repetition of vsetvli and sub
instructions is not needed;
programmer may assume that all vector
registers are equal in size
d) the vsetvli / add / sub instructions have
minimal runtime due to
behaving like scalar operations
e) the repetition 8 times (or whatever LMUL
you want) vor vle/vse and
the add can be mimicked by a change in the
ISA to handle a set of
registers in a register group automatically,
eg:
instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2
we can allow vle to operate on registers
groups, but done one register
at a time in sequence, by doing this just
once:
vle8.v v0, (a2), m8 // does 8 independent
loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from
address a2+2*vl, etc
That way ALL of the code bloat is now gone.
Ciao,
Guy
On Wed, May 27, 2020 at 11:29 AM Nick Knight
<nick.knight@...>
wrote:
>
> I appreciate this discussion about
making things friendlier to software. I've
always felt the constraints on SLEN-agnostic
software to be a nuisance, albeit a mild
one.
>
> However, I do have a concern about
removing LMUL > 1 memory operations
regarding code bloat. This is all purely
subjective: I have not done any concrete
analysis. But here is an example that
illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N;
++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0,
1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1
loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0,
(a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy
Lemieux <glemieux@...>
wrote:
>>
>> The precise data layout pattern
does not matter.
>>
>> What matters is that a single
distribution pattern is agreed upon to
>> avoid fragmenting the software
ecosystem.
>>
>> With my additional restriction, the
load/store side of an
>> implementation is greatly
simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction
is how to avoid the overhead of
>> the cast instruction in an
aggressive implementation? The cast
>> instruction must rearrange data to
translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal
requires these casts to be executed
>> between any load/stores (which
always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I
think this can sometimes be done for
>> "free" by carefully planning your
compute instructions. For example, a
>> series of vld instructions with
LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination
can be macro-op fused. I don't
>> think the same thing can be done
for vst instructions, unless it
>> macro-op fuses a longer sequence
consisting of cast / vst / clear
>> register group (or some other
operation that overwrites the cast
>> destination, indicating the cast is
superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM
David Horner <ds2horner@...>
wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>
|