Vector Task Group minutes 2020/5/15


Andy Glew Si5
 

 Now I can add AsciiDoc  and  KramDoc artifacts to those HTML and PDF artifacts :-(


[Andrew Waterman]: We have tagged version 0.9 on github, including HTML/PDF artifacts: https://github.com/riscv/riscv-v-spec/releases/tag/0.9



From: Andrew Waterman <andrew@...>
Sent: Friday, May 15, 2020 3:23PM
To: Krste Asanovic <krste@...>
Cc: Tech-vector-ext <tech-vector-ext@...>
Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/5/15

 

On 5/15/2020 3:23 PM, Andrew Waterman wrote:


On Fri, May 15, 2020 at 11:56 AM Krste Asanovic <krste@...> wrote:

Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed.  The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes.  This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
  SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
  (<=SLEN), however these appear to result in even greater variety of
  layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
  internally restrict datapath communication within SLEN width of
  datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.

We have tagged version 0.9 on github, including HTML/PDF artifacts: https://github.com/riscv/riscv-v-spec/releases/tag/0.9









David Horner
 



On 2020-05-29 8:21 a.m., Roger Espasa wrote:
You're absolutely right. Thanks for the correction.

On Fri, May 29, 2020 at 1:44 PM DSHORNER <ds2horner@...> wrote:


On 2020-05-29 6:18 a.m., Roger Espasa wrote:
I was trying to summarize/compare the different proposals. I talked to Grigoris and I think this is a correct summary:

    v08 v09 Grigoris
LMUL=1 SLEN=VLEN SLEN does not matter SLEN does not matter SLEN does not matter
LMUL=1 SLEN<=SEW SLEN does not matter  SLEN does not matter SLEN does not matter
LMUL=1 SEW < SLEN < VLEN ignores SLEN SLEN affects  --
LMUL>1   memory ordering lost by using multiple regs
LMUL<1 SLEN=VLEN -- SLEN does not matter LMUL affects (1)
LMUL<1 SLEN<=SEW -- SLEN does not matter LMUL affects (1)
LMUL<1 SEW < SLEN < VLEN -- SLEN affects  --

The very significant correction is that for v09
SLEN=VLEN memory ordering IS preserved for LMUL>1
should be


    v08 v09 Grigoris
LMUL>1 SLEN=VLEN memory ordering lost memory order retained
memory ordering lost










LMUL>1
SLEN<VLEN
memory order lost

















This is a big win, but also the source of the possible software fragmentation in v09 when SLEN<VLEN
In the table above
  • GREEN indicates memory layout is preserved inside register layout
  • YELLOW indicates memory layout is NOT PRESERVED inside register layout
  • (1) data is packed inside a container of LMUL*SEW bits
  • In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register file in the VPU lanes, we tried to summaryize into 
this second table when data coming from memory ends up in a lane diffrent from the natural SLEN=VLEN arrangement:

  v0.9 Grigoris
LMUL < 1 SEW < SLEN always
LMUL = 1 SEW < SLEN never
LMUL > 1 always always


I'm not sure how to interpret this table (been puzzling over it).
But I believe the LMUL>1 case for v0.9 is incorrect for the same reason as in the previous table.
For v0.9 LMUL=1 and LMUL>1 have the same behaviour. The behaviour is  based on SLEN< vs SLEN=  VLEN.
So always is definitely wrong.

I don't see the most relevant characteristic as being SEW < SLEN.
Only when SLEN<VLEN can "MUXING" occur.
For v0.9 SEW>=SLEN means that the SLEN chunk is completely filled for that SEW and thus equivalent to SLEN=VLEN.
However, this is expected to be a rare occurance. It is most likely that ELEN (that is MAX Element Width) is less than SLEN.
I appreciate its inclusion for completeness, but it is likely an outlier in practice.

What is relevant is the implicit CLSTR size in the current design which is byte size.
This interacts with SEW in all values of SEW<SLEN (which is to say in all but the fringe [pathological?] cases).

I think Grigori's proposal only suffers in the TAIL of a vector loop when VL is not a nice multiple of your VLEN and LMUL>1. Imagine your tail ends up with 3 elements and you are at LMUL=4. You will get the 3 elements "vertically" across the register groups, so loop EPILOG will take "3 issues" instead of potentially "just 1" (assuming the vector unit has more than one lane).

Is this a correct summary of the three proposals?

roger.


On Fri, May 29, 2020 at 9:06 AM Alex Solomatnikov <sols@...> wrote:
Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>






Roger Espasa
 

You're absolutely right. Thanks for the correction.


On Fri, May 29, 2020 at 1:44 PM DSHORNER <ds2horner@...> wrote:


On 2020-05-29 6:18 a.m., Roger Espasa wrote:
I was trying to summarize/compare the different proposals. I talked to Grigoris and I think this is a correct summary:

    v08 v09 Grigoris
LMUL=1 SLEN=VLEN SLEN does not matter SLEN does not matter SLEN does not matter
LMUL=1 SLEN<=SEW SLEN does not matter  SLEN does not matter SLEN does not matter
LMUL=1 SEW < SLEN < VLEN ignores SLEN SLEN affects  --
LMUL>1   memory ordering lost by using multiple regs
LMUL<1 SLEN=VLEN -- SLEN does not matter LMUL affects (1)
LMUL<1 SLEN<=SEW -- SLEN does not matter LMUL affects (1)
LMUL<1 SEW < SLEN < VLEN -- SLEN affects  --

The very significant correction is that for v09
SLEN=VLEN memory ordering IS preserved for LMUL>1
should be


    v08 v09 Grigoris
LMUL>1 SLEN=VLEN memory ordering lost memory order retained
memory ordering lost










LMUL>1
SLEN<VLEN
memory order lost

















This is a big win, but also the source of the possible software fragmentation in v09 when SLEN<VLEN
In the table above
  • GREEN indicates memory layout is preserved inside register layout
  • YELLOW indicates memory layout is NOT PRESERVED inside register layout
  • (1) data is packed inside a container of LMUL*SEW bits
  • In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register file in the VPU lanes, we tried to summaryize into 
this second table when data coming from memory ends up in a lane diffrent from the natural SLEN=VLEN arrangement:

  v0.9 Grigoris
LMUL < 1 SEW < SLEN always
LMUL = 1 SEW < SLEN never
LMUL > 1 always always

I think Grigori's proposal only suffers in the TAIL of a vector loop when VL is not a nice multiple of your VLEN and LMUL>1. Imagine your tail ends up with 3 elements and you are at LMUL=4. You will get the 3 elements "vertically" across the register groups, so loop EPILOG will take "3 issues" instead of potentially "just 1" (assuming the vector unit has more than one lane).

Is this a correct summary of the three proposals?

roger.


On Fri, May 29, 2020 at 9:06 AM Alex Solomatnikov <sols@...> wrote:
Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>





David Horner
 



On 2020-05-29 6:18 a.m., Roger Espasa wrote:
I was trying to summarize/compare the different proposals. I talked to Grigoris and I think this is a correct summary:

    v08 v09 Grigoris
LMUL=1 SLEN=VLEN SLEN does not matter SLEN does not matter SLEN does not matter
LMUL=1 SLEN<=SEW SLEN does not matter  SLEN does not matter SLEN does not matter
LMUL=1 SEW < SLEN < VLEN ignores SLEN SLEN affects  --
LMUL>1   memory ordering lost by using multiple regs
LMUL<1 SLEN=VLEN -- SLEN does not matter LMUL affects (1)
LMUL<1 SLEN<=SEW -- SLEN does not matter LMUL affects (1)
LMUL<1 SEW < SLEN < VLEN -- SLEN affects  --

The very significant correction is that for v09
SLEN=VLEN memory ordering IS preserved for LMUL>1
should be


    v08 v09 Grigoris
LMUL>1 SLEN=VLEN memory ordering lost memory order retained
memory ordering lost










LMUL>1
SLEN<VLEN
memory order lost

















This is a big win, but also the source of the possible software fragmentation in v09 when SLEN<VLEN
In the table above
  • GREEN indicates memory layout is preserved inside register layout
  • YELLOW indicates memory layout is NOT PRESERVED inside register layout
  • (1) data is packed inside a container of LMUL*SEW bits
  • In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register file in the VPU lanes, we tried to summaryize into 
this second table when data coming from memory ends up in a lane diffrent from the natural SLEN=VLEN arrangement:

  v0.9 Grigoris
LMUL < 1 SEW < SLEN always
LMUL = 1 SEW < SLEN never
LMUL > 1 always always

I think Grigori's proposal only suffers in the TAIL of a vector loop when VL is not a nice multiple of your VLEN and LMUL>1. Imagine your tail ends up with 3 elements and you are at LMUL=4. You will get the 3 elements "vertically" across the register groups, so loop EPILOG will take "3 issues" instead of potentially "just 1" (assuming the vector unit has more than one lane).

Is this a correct summary of the three proposals?

roger.


On Fri, May 29, 2020 at 9:06 AM Alex Solomatnikov <sols@...> wrote:
Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>





Roger Espasa
 

I was trying to summarize/compare the different proposals. I talked to Grigoris and I think this is a correct summary:

    v08 v09 Grigoris
LMUL=1 SLEN=VLEN SLEN does not matter SLEN does not matter SLEN does not matter
LMUL=1 SLEN<=SEW SLEN does not matter  SLEN does not matter SLEN does not matter
LMUL=1 SEW < SLEN < VLEN ignores SLEN SLEN affects  --
LMUL>1   memory ordering lost by using multiple regs
LMUL<1 SLEN=VLEN -- SLEN does not matter LMUL affects (1)
LMUL<1 SLEN<=SEW -- SLEN does not matter LMUL affects (1)
LMUL<1 SEW < SLEN < VLEN -- SLEN affects  --

In the table above
  • GREEN indicates memory layout is preserved inside register layout
  • YELLOW indicates memory layout is NOT PRESERVED inside register layout
  • (1) data is packed inside a container of LMUL*SEW bits
  • In Grigori's proposal a couple cases never occur
In terms of MUXING from the memory into the register file in the VPU lanes, we tried to summaryize into 
this second table when data coming from memory ends up in a lane diffrent from the natural SLEN=VLEN arrangement:

  v0.9 Grigoris
LMUL < 1 SEW < SLEN always
LMUL = 1 SEW < SLEN never
LMUL > 1 always always

I think Grigori's proposal only suffers in the TAIL of a vector loop when VL is not a nice multiple of your VLEN and LMUL>1. Imagine your tail ends up with 3 elements and you are at LMUL=4. You will get the 3 elements "vertically" across the register groups, so loop EPILOG will take "3 issues" instead of potentially "just 1" (assuming the vector unit has more than one lane).

Is this a correct summary of the three proposals?

roger.


On Fri, May 29, 2020 at 9:06 AM Alex Solomatnikov <sols@...> wrote:
Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>




Alex Solomatnikov
 

Guy,

FPGA energy/power data is not really relevant to general purpose ISA design, unless your goal is to design an ISA only for soft FPGA cores like Nios or MicroBlaze.

Please, take a look at the paper I referenced. It has energy breakdown not only for a simple RISC core but also for a SIMD+VLIW core (with 2 and 3 slots), which is a reasonable approximation for a vector core (with 2 or 3 instructions per cycle). These are based on designs industry competitive in PPA (Tensilica). For SIMD+VLIW core instruction fetch/decode is still ~30%.

Here is a quote from Section 4.1: "SIMD and VLIW speed up the application by 10x, decreasing IF energy by 10x, but the percentage of energy going to IF does not change much. IF still consumes more energy than functional units." Figure 4 showing energy breakdown is below.

image.png

Alex

On Thu, May 28, 2020 at 9:13 PM Guy Lemieux <glemieux@...> wrote:
Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>




Guy Lemieux
 

Alex,

Keep in mind:

a) I amended my proposal to reduce the code bloat identified by Nick

b) the effect of the bloat is almost entirely about text segment size, not power or instruction bandwidth, because these are vector instructions that are already amortizing their overhead over VLEN/SEW operations (often over 64 ops).

LMUL>1 has almost nothing to do with energy efficiency, it’s almost entirely about storage efficiency (so unused vectors don’t go to waste) and a little bit to do with performance (slightly higher memory and compute utilizations are possible with longer vectors). Instruction fetch energy is dwarfed by the actual compute and memory access energy for VLEN bits per instruction.

I don’t have a citation to give, but I have direct experience implementing and measuring the effect. The FPGA-based vector engine I helped design used less energy per op and offered faster absolute performance than the hard ARM processor used as the host, despite the ARM having a 7x clock speed advantage.

Guy



On Thu, May 28, 2020 at 9:02 PM Alex Solomatnikov <sols@...> wrote:
Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>




Alex Solomatnikov
 

Code bloat is important - not just the number load and store instructions but also additional vsetvl/i instructions. This was one of the reasons for vle8, vle16 and others.

LMUL>1 is also great for energy/power because a large percentage of energy/power, typically >30%, is spent on instruction fetch/handling even in simple processors [1]. LMUL>1 reduces the number of instructions for a given amount of work and energy/power spent on instruction fetch/handling.

Alex

1. Understanding sources of inefficiency in general-purpose chips

R HameedW QadeerM Wachs, O Azizi… - Proceedings of the 37th …, 2010 - dl.acm.org


On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>




Nick Knight
 

Hi Guy,

Thanks for your reply. I'll leave a few quick responses, and would like to hear opinions from others on the task group.

On Wed, May 27, 2020 at 4:59 PM Guy Lemieux <glemieux@...> wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

Agreed.
 
b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

Agreed, especially regarding code storage.
 
c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

When loop strip-mining, we would like to handle the "fringe case" (if any) VLEN-independently. Suppose, e.g., the last loop iteration doesn't consume all eight vector registers in the group. It wasn't obvious to me how to handle this case without more complex fringe-case logic.
 
d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

Agreed: in particular, on machines with separate scalar/vector pipelines their executions can be overlapped with vector instructions. This doesn't hide the energy cost, however.
 
e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

This sounds to me like a redefinition of unit-stride loads/stores with LMUL > 1, just changing the pattern in which they access the register file. (An observation, not a criticism.)
 
That way ALL of the code bloat is now gone.

It seems that way to me, too.

Best,
Nick
 

Ciao,
Guy


On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:
>
> I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.
>
> However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:
>
> # C code:
> # int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
> # keep N in a0 and &x[0] in a1
>
> # "BEFORE" (original RVV code):
> loop:
> vsetvli t0, a0, e8,m8
> vle8.v v0, (a1)
> vadd.vi v0, v0, 1
> vse8.v v0, (a1)
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> # "AFTER" removing LMUL > 1 loads/stores:
> loop:
> vsetvli t0, a0, e8,m8
> mv t1, t0
> mv a2, a1
>
> # loads:
> vsetvli t2, t1, e8,m1
> vle8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vle8.v v7, (a2)
>
> # cast instructions ...
>
> vsetvli x0, t0, e8,m8
> vadd.vi v0, (a1)
>
> # more cast instructions ...
> mv t1, t0
> mv a2, a1
>
> # stores:
> vsetvli t2, t1, e8,m1
> vse8.v v0, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v1, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v2, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v3, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v4, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v5, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v6, (a2)
> add a2, a2, t2
> sub t2, t2, t1
> vsetvli t2, t1, e8,m1
> vse8.v v7, (a2)
>
> add a1, a1, t0
> sub a0, a0, t0
> bnez a0, loop
>
> On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
>>
>> The precise data layout pattern does not matter.
>>
>> What matters is that a single distribution pattern is agreed upon to
>> avoid fragmenting the software ecosystem.
>>
>> With my additional restriction, the load/store side of an
>> implementation is greatly simplified, allowing for simple
>> implementations.
>>
>> The main drawback of my restriction is how to avoid the overhead of
>> the cast instruction in an aggressive implementation? The cast
>> instruction must rearrange data to translate between LMUL!=1 and
>> LMUL=1 data layouts; my proposal requires these casts to be executed
>> between any load/stores (which always assume LMUL=1) and compute
>> instructions which use LMUL!=1. I think this can sometimes be done for
>> "free" by carefully planning your compute instructions. For example, a
>> series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
>> the same register group destination can be macro-op fused. I don't
>> think the same thing can be done for vst instructions, unless it
>> macro-op fuses a longer sequence consisting of cast / vst / clear
>> register group (or some other operation that overwrites the cast
>> destination, indicating the cast is superfluous and only used by the
>> stores).
>>
>> Guy
>>
>> On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>> >
>> > This is v0.8 with SLEN=8.
>>
>>
>>


Guy Lemieux
 

Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v v0, (a2), m8 // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.

Ciao,
Guy

On Wed, May 27, 2020 at 11:29 AM Nick Knight <nick.knight@...> wrote:

I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.

However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:

# C code:
# int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
# keep N in a0 and &x[0] in a1

# "BEFORE" (original RVV code):
loop:
vsetvli t0, a0, e8,m8
vle8.v v0, (a1)
vadd.vi v0, v0, 1
vse8.v v0, (a1)
add a1, a1, t0
sub a0, a0, t0
bnez a0, loop

# "AFTER" removing LMUL > 1 loads/stores:
loop:
vsetvli t0, a0, e8,m8
mv t1, t0
mv a2, a1

# loads:
vsetvli t2, t1, e8,m1
vle8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v7, (a2)

# cast instructions ...

vsetvli x0, t0, e8,m8
vadd.vi v0, (a1)

# more cast instructions ...
mv t1, t0
mv a2, a1

# stores:
vsetvli t2, t1, e8,m1
vse8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v7, (a2)

add a1, a1, t0
sub a0, a0, t0
bnez a0, loop

On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:

The precise data layout pattern does not matter.

What matters is that a single distribution pattern is agreed upon to
avoid fragmenting the software ecosystem.

With my additional restriction, the load/store side of an
implementation is greatly simplified, allowing for simple
implementations.

The main drawback of my restriction is how to avoid the overhead of
the cast instruction in an aggressive implementation? The cast
instruction must rearrange data to translate between LMUL!=1 and
LMUL=1 data layouts; my proposal requires these casts to be executed
between any load/stores (which always assume LMUL=1) and compute
instructions which use LMUL!=1. I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused. I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).

Guy

On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:

This is v0.8 with SLEN=8.


Nick Knight
 

I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.

However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:

# C code:
# int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
# keep N in a0 and &x[0] in a1

# "BEFORE" (original RVV code):
loop:
vsetvli t0, a0, e8,m8
vle8.v v0, (a1)
vadd.vi v0, v0, 1
vse8.v v0, (a1)
add a1, a1, t0
sub a0, a0, t0
bnez a0, loop

# "AFTER" removing LMUL > 1 loads/stores:
loop:
vsetvli t0, a0, e8,m8
mv t1, t0
mv a2, a1

# loads:
vsetvli t2, t1, e8,m1
vle8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v7, (a2)

# cast instructions ...

vsetvli x0, t0, e8,m8
vadd.vi v0, (a1)

# more cast instructions ...
mv t1, t0
mv a2, a1

# stores:
vsetvli t2, t1, e8,m1
vse8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v7, (a2)

add a1, a1, t0
sub a0, a0, t0
bnez a0, loop


On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
The precise data layout pattern does not matter.

What matters is that a single distribution pattern is agreed upon to
avoid fragmenting the software ecosystem.

With my additional restriction, the load/store side of an
implementation is greatly simplified, allowing for simple
implementations.

The main drawback of my restriction is how to avoid the overhead of
the cast instruction in an aggressive implementation? The cast
instruction must rearrange data to translate between LMUL!=1 and
LMUL=1 data layouts; my proposal requires these casts to be executed
between any load/stores (which always assume LMUL=1) and compute
instructions which use LMUL!=1. I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused. I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).

Guy

On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>
> This is v0.8 with SLEN=8.




Guy Lemieux
 

The precise data layout pattern does not matter.

What matters is that a single distribution pattern is agreed upon to
avoid fragmenting the software ecosystem.

With my additional restriction, the load/store side of an
implementation is greatly simplified, allowing for simple
implementations.

The main drawback of my restriction is how to avoid the overhead of
the cast instruction in an aggressive implementation? The cast
instruction must rearrange data to translate between LMUL!=1 and
LMUL=1 data layouts; my proposal requires these casts to be executed
between any load/stores (which always assume LMUL=1) and compute
instructions which use LMUL!=1. I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused. I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).

Guy

On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:

This is v0.8 with SLEN=8.


David Horner
 

This is v0.8 with SLEN=8.



On Wed, May 27, 2020, 07:59 Grigorios Magklis, <grigorios.magklis@...> wrote:
Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n           13      12      11      10       3       2       1       0
   v4*n+1         17      16      15      14       7       6       5       4
   v4*n+2         1B      1A      19      18       B       A       9       8
   v4*n+3         1F      1E      1D      1C       F       E       D       C

Some background.  We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates.  That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n            7       5       3       1       6       4       2       0
   v4*n+1          F       D       B       9       E       C       A       8
   v4*n+2         17      15      13      11      16      14      12      10
   v4*n+3         1F      1D      1B      19      1E      1C      1A      18

is actually reverting back to an older layout (Figure 7.2 in my
thesis).  This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units.  For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used.  This is why I
don't agree with Bill's conclusion.  I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting.  The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting.  I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words.  E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|






David Horner
 

for those not on Github I posted this to #461:

CLSTR can be considered a progressive SLEN=VLEN switch.

Rather than all or nothing as the SLEN=VLEN switch provides for in-memory compatibility,
clstr provides either a fixed or variable degradation for widening operations on SEW<CLSTR.

With a fixed clstr all operations operate at peak performance, except for mixed SEW widening/narrowing and only those with SEW<CLSTR.

In-memory and in-register formats align when SEW<=CLSTR.
Software is fully aware of the mapping, and already accommodates this behaviour for many existing architectures. (Analogous to big-endian vs little-endian in many aspects, although with bigendian all the bytes are present at each SEW level)

The clstr parameter is potentially writable, and for the higher end machines it appears very reasonable that they would provide at least two settings for CLSTR, byte and XLEN.
This would provide in-memory alignment at XLEN for code that is not sure of its dependence on it, and an optimization for widening/narrowing at SEW<XLEN for code that is sure it does not depend on in-memory format for that section of code.

Because clstr is potentially writable software can avoid performance penalties by other means as well, and leverage other potential structural benefits. They will turn a liability into a feature.

I have a pending proposal for exactly that idea “Its not a bug, its a feature”
That enables clstr for SLEN=VLEN implementations also, and allows addressing of even/odd groupings for SLEN<VLEN , too.



On 2020-05-26 5:03 p.m., David Horner via lists.riscv.org wrote:

for those not on Github I posted this to #461:

I gather what was missing from this were examples.
I prefer to consider clstr as a dynamic parameter, that some implementations will use a range of values.

However, for the sake of examples we can consider the cases where CLSTR=32.
So after a load:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                                  7 6 5 4                         3 2 1 0 SEW=8b

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b

VLEN=256b, SLEN=64, vl=13, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                        C         B A 9 8         7 6 5 4         3 2 1 0 SEW=8b SEW=8b

  

                7       3       6       2       5       1       4       0 SEW=32b

                        B               A               9       C       8  @ reg+1


By inspection unary and binary single SEW operations do not affect order.
However, for a widening operation, EEW=16 and 64 respectively which will yield:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b


                        3               1               2               0 SEW=64b

                        7               5               6               4  @ reg+1


Narrowing work in reverse.
When SLEN=VLEN clstr is irrelevant and effectively infinite as there is no other SLEN group in which to advance, so the current SLEN chunk has to be used (in the round-robin fashion.
Thank you for the template to use.
I don’t think SLEN = 1/4 VLEN has to be diagrammed.
And of course, store also works in reverse of load.
     
   
 




         



     

       
         

   














 
 
     






           


           
           
                 
 
  @David-Horner
 

   
     
     
       
         
 
 
     
       

On 2020-05-26 11:17 a.m., David Horner via lists.riscv.org wrote:

On Tue, May 26, 2020, 04:38 , <krste@...> wrote:

.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. 
Correct 
I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. 
I don't understand the reason for this constraint.
E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).
Still not got it.

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b
clstr is not a  count but a size.
When CLSTR is 32 this last row is
                7       5       3       1       6       4       2       0 SEW=ELEN=32b
If I understood your diagram correctly.

See #461. It is effectively what SLEN was under v0.8. But potentially configurable.

I'm doing this for my cell phone. I'll work it up better when I'm at my laptop



Guy Lemieux
 

As a follow-up, the main goal of LMUL>1 is to get better storage efficiency out of the register file, allowing for slightly higher compute unit utilization.

The memory system should not require LMUL>1 to get better bandwidth utilization. An advanced memory system can fuse back-to-back loads (or stores) to get improved bandwidth. Some memory systems may break up vector memory transfers into fixed-size quanta (eg, cache lines) anyways.

Restricting LMUL=1 for loads/stores therefore primarily impacts instruction issue bandwidth and executable size. These shouldn’t be highe drawbacks.

Guy



On Wed, May 27, 2020 at 6:56 AM Guy Lemieux via lists.riscv.org <glemieux=vectorblox.com@...> wrote:
I support this scheme, but I would further add a restriction on loads/stores to only support LMUL=1 (no register groups). Instead, any data stored in a registe group with LMUL!=1 must first be “cast” into registers with LMUL=1. To do this, special cast instructions would be required; likely this cast can be done in-place (same source and dest registers). 

The key advantage of this new restriction is to remove all data shuffling from the interface between external memory and the register file — just transfer bytes in memory order. This vastly simplifies “basic” implementations by keeping data shuffling exclusively on the ALU side of the register file.

Guy



On Wed, May 27, 2020 at 4:59 AM Mr Grigorios Magklis <grigorios.magklis@...> wrote:
Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n           13      12      11      10       3       2       1       0
   v4*n+1         17      16      15      14       7       6       5       4
   v4*n+2         1B      1A      19      18       B       A       9       8
   v4*n+3         1F      1E      1D      1C       F       E       D       C

Some background.  We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates.  That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n            7       5       3       1       6       4       2       0
   v4*n+1          F       D       B       9       E       C       A       8
   v4*n+2         17      15      13      11      16      14      12      10
   v4*n+3         1F      1D      1B      19      1E      1C      1A      18

is actually reverting back to an older layout (Figure 7.2 in my
thesis).  This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units.  For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used.  This is why I
don't agree with Bill's conclusion.  I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting.  The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting.  I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words.  E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|






Guy Lemieux
 

I support this scheme, but I would further add a restriction on loads/stores to only support LMUL=1 (no register groups). Instead, any data stored in a registe group with LMUL!=1 must first be “cast” into registers with LMUL=1. To do this, special cast instructions would be required; likely this cast can be done in-place (same source and dest registers). 

The key advantage of this new restriction is to remove all data shuffling from the interface between external memory and the register file — just transfer bytes in memory order. This vastly simplifies “basic” implementations by keeping data shuffling exclusively on the ALU side of the register file.

Guy



On Wed, May 27, 2020 at 4:59 AM Mr Grigorios Magklis <grigorios.magklis@...> wrote:
Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n           13      12      11      10       3       2       1       0
   v4*n+1         17      16      15      14       7       6       5       4
   v4*n+2         1B      1A      19      18       B       A       9       8
   v4*n+3         1F      1E      1D      1C       F       E       D       C

Some background.  We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates.  That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n            7       5       3       1       6       4       2       0
   v4*n+1          F       D       B       9       E       C       A       8
   v4*n+2         17      15      13      11      16      14      12      10
   v4*n+3         1F      1D      1B      19      1E      1C      1A      18

is actually reverting back to an older layout (Figure 7.2 in my
thesis).  This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units.  For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used.  This is why I
don't agree with Bill's conclusion.  I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting.  The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting.  I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words.  E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|






Mr Grigorios Magklis
 

Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n           13      12      11      10       3       2       1       0
   v4*n+1         17      16      15      14       7       6       5       4
   v4*n+2         1B      1A      19      18       B       A       9       8
   v4*n+3         1F      1E      1D      1C       F       E       D       C

Some background.  We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates.  That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n            7       5       3       1       6       4       2       0
   v4*n+1          F       D       B       9       E       C       A       8
   v4*n+2         17      15      13      11      16      14      12      10
   v4*n+3         1F      1D      1B      19      1E      1C      1A      18

is actually reverting back to an older layout (Figure 7.2 in my
thesis).  This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units.  For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used.  This is why I
don't agree with Bill's conclusion.  I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting.  The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting.  I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words.  E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|






Bill Huffman
 

I appreciate your agreement with my analysis, Krste. However, I wasn't
drawing a conclusion. I lean toward the conclusion that we keep the
"new v0.9 scheme" below and live with casts. But I wasn't fully sure
and wanted to see where the discussion might go. I suspect each of the
extra gates for memory access and the slower speed of short vectors is
sufficient by itself to argue pretty strongly against the "v0.8 scheme"
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where
bytes are contiguous within ELEN." I worked out what I think are the
outlines of a proof that it's not possible, but I thought I'd suggest
what I did at a high level first and only try to make the proof more
rigorous if necessary.

Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
v4*n 13 12 11 10 3 2 1 0
v4*n+1 17 16 15 14 7 6 5 4
v4*n+2 1B 1A 19 18 B A 9 8
v4*n+3 1F 1E 1D 1C F E D C

Some background. We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates. That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
v4*n 7 5 3 1 6 4 2 0
v4*n+1 F D B 9 E C A 8
v4*n+2 17 15 13 11 16 14 12 10
v4*n+3 1F 1D 1B 19 1E 1C 1A 18

is actually reverting back to an older layout (Figure 7.2 in my
thesis). This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units. For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used. This is why I
don't agree with Bill's conclusion. I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting. The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0 SEW=8b
7 6 5 4 3 2 1 0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines. I think this is viable in a
complex microarch. Basically, the microarchitecture tags the vector
register with the EEW used to write it. Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle. Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting. The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing. This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:
| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2. We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed. The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes. This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid. The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|




David Horner
 

for those not on Github I posted this to #461:

I gather what was missing from this were examples.
I prefer to consider clstr as a dynamic parameter, that some implementations will use a range of values.

However, for the sake of examples we can consider the cases where CLSTR=32.
So after a load:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                                  7 6 5 4                         3 2 1 0 SEW=8b

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b

VLEN=256b, SLEN=64, vl=13, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                        C         B A 9 8         7 6 5 4         3 2 1 0 SEW=8b SEW=8b

  

                7       3       6       2       5       1       4       0 SEW=32b

                        B               A               9       C       8  @ reg+1


By inspection unary and binary single SEW operations do not affect order.
However, for a widening operation, EEW=16 and 64 respectively which will yield:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b


                        3               1               2               0 SEW=64b

                        7               5               6               4  @ reg+1


Narrowing work in reverse.
When SLEN=VLEN clstr is irrelevant and effectively infinite as there is no other SLEN group in which to advance, so the current SLEN chunk has to be used (in the round-robin fashion.
Thank you for the template to use.
I don’t think SLEN = 1/4 VLEN has to be diagrammed.
And of course, store also works in reverse of load.
     
   
 




         



     

       
         

   














 
 
     






           


           
           
                 
 
  @David-Horner
 

   
     
     
       
         
 
 
     
       

On 2020-05-26 11:17 a.m., David Horner via lists.riscv.org wrote:


On Tue, May 26, 2020, 04:38 , <krste@...> wrote:

.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. 
Correct 
I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. 
I don't understand the reason for this constraint.
E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).
Still not got it.

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b
clstr is not a  count but a size.
When CLSTR is 32 this last row is
                7       5       3       1       6       4       2       0 SEW=ELEN=32b
If I understood your diagram correctly.

See #461. It is effectively what SLEN was under v0.8. But potentially configurable.

I'm doing this for my cell phone. I'll work it up better when I'm at my laptop


David Horner
 



On Tue, May 26, 2020, 04:38 , <krste@...> wrote:

.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. 
Correct 
I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. 
I don't understand the reason for this constraint.
E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).
Still not got it.

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b
clstr is not a  count but a size.
When CLSTR is 32 this last row is
                7       5       3       1       6       4       2       0 SEW=ELEN=32b
If I understood your diagram correctly.

See #461. It is effectively what SLEN was under v0.8. But potentially configurable.

I'm doing this for my cell phone. I'll work it up better when I'm at my laptop

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

The problem is determining when you would be needing to do these Micro Ops. That's what I had proposed as the flagging solution in response to Bill. Again when I'm on my laptop.

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.
It is an approach the incorporates the clustering consecutive byte concept but is more limiting. 

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


>>>>> On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|