Topics

Vector Task Group minutes 2020/5/15 - precise layout not matter


David Horner
 



On 2020-05-27 7:58 p.m., Guy Lemieux wrote:
On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
The precise data layout pattern does not matter.

What matters is that a single distribution pattern is agreed upon to
avoid fragmenting the software ecosystem.
I believe this can be weakened to required:

select-able distribution patterns that are sufficiently compatible that they
avoid fragmenting the software ecosystem.



On 2020-05-27 10:29 a.m., Guy Lemieux wrote:
> As a follow-up, the main goal of LMUL>1 is to get better storage efficiency out of the register file, allowing for slightly higher compute unit utilization.
>
> The memory system should not require LMUL>1 to get better bandwidth utilization. An advanced memory system can fuse back-to-back loads (or stores) to get improved bandwidth. Some memory systems may break up vector memory transfers into fixed-size quanta (eg, cache lines) anyways.
>
> Restricting LMUL=1 for loads/stores therefore primarily impacts instruction issue bandwidth and executable size. These shouldn’t be highe drawbacks.
>
> Guy
>
>
>
> On Wed, May 27, 2020 at 6:56 AM Guy Lemieux via lists.riscv.org <glemieux=vectorblox.com@...> wrote:
>
>     I support this scheme, but I would further add a restriction on loads/stores to only support LMUL=1 (no register groups). Instead, any data stored in a registe group with LMUL!=1 must first be “cast” into registers with LMUL=1. To do this, special cast instructions would be required; likely this cast can be done in-place (same source and dest registers).

I am confused, as it appears you have backed off on this with your response to Nick on 2020-05-27, 7:58 p.m.
There the load can target multiple registers in the group.

The most puzzling concern I have is what will these cast instructions do?
Presumably apply some mapping from in-memory to an internal in-register format?
But which one? Does the precise layout not really matter?
The different layouts considered each have advantages and disadvantages.
From a simplicity and software fragmentation perspective the v0.9 SLEN=VLEN is perfect.
SLEN can be completely ignored as it matches VLEN. in-register format matches in-memory, even for LMUL>1 .
Sweet. But as VLEN increases, the performance impacts increase non-linearly and substantially for an important target group.
So if high-performance/large-VLEN  implementations are going to happen some accommodation must occur.
>
>     The key advantage of this new restriction is to remove all data shuffling from the interface between external memory and the register file — just transfer bytes in memory order. This vastly simplifies “basic” implementations by keeping data shuffling exclusively on the ALU side of the register file.


I think this is where you envision cast instructions providing the solution, but a restricted set of cast instructions that
 - only operate on LMUL=1 structures
 - are capable (if not restricted) to occur on the ALU rather than memory side.

I don't believe either of these restrictions are necessary for v0.9 design when SLEN=VLEN, as LMUL>1 structure is identical to LMUL=1
 and micro architectures can fuse a following cast (if as you suggest they are able to be done in-place) to operate on memory side if it so chooses.




With my additional restriction, the load/store side of an
implementation is greatly simplified, allowing for simple
implementations.

The main drawback of my restriction is how to avoid the overhead of
the cast instruction in an aggressive implementation? The cast
instruction must rearrange data to translate between LMUL!=1 and
LMUL=1 data layouts; my proposal requires these casts to be executed
between any load/stores (which always assume LMUL=1) and compute
instructions which use LMUL!=1.

I expect you envision these operators to be elective, only included where needed so performance is not affected adversely.
If so, I believe the fragmentation concern is reintroduced.
If not, then unique characteristics of the cast instructions are required as you allude, and I suspect target format will be important for that to happen.

 I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused.

So far there has been no cast magic-bullet.
Although I agree deferring casting (however it is done)  to the ultimate operation is of considerable potential benefit all of this is quite heavy handed.
 As it is not required for the simple v0.9 SLEN=VLEN model industry adoption is certainly not assured.


 I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).
I think you have done a good job at illustrating the complexity of optimizing for mixed SEW operators and especially the casting approach to "fix it".
Therefore, I am trending away from a cast instruction solution.

I have not delivered by promised comparison of alternatives.
The considerations you have provided have helped me to come closer to a formulation of it.
To me, CLSTR looks most promising. 
But only as these alternatives are championed, discussed and analyzed do we get a better idea of the nature of this mixed SEW beast.


Guy

On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
This is v0.8 with SLEN=8.



On 2020-05-27 7:58 p.m., Guy Lemieux wrote:
Nick, thanks for that code snippet, it's really insightful.

I have a few comments:

a) this is for LMUL=8, the worst-case (most code bloat)

b) this would be automatically generated by a compiler, so visuals are
not meaningful, though code storage may be an issue

c) the repetition of vsetvli and sub instructions is not needed;
programmer may assume that all vector registers are equal in size

d) the vsetvli / add / sub instructions have minimal runtime due to
behaving like scalar operations

e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and
the add can be mimicked by a change in the ISA to handle a set of
registers in a register group automatically, eg:

instead of this 8 times for v0 to v7:
vle8.v v0, (a2)
add a2, a2, t2

we can allow vle to operate on registers groups, but done one register
at a time in sequence, by doing this just once:
vle8.v  v0, (a2), m8  // does 8 independent loads, loading to v0 from
address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc

That way ALL of the code bloat is now gone.
It would appear your premise that applying cast only to LMUL=1 is also gone.
I may be completely wrong, so I would greatly appreciate your expounding on this.
How does this apply for v0.9 SLEN= or SLEN< VLEN ?
Does it only apply to v0.8?
Thanks.

Ciao,
Guy


Guy Lemieux
 

I propose 2 data layouts: memory layout, and internal register group layout.

I am not going to specify which internal register group layout to
operate upon, because I haven't read the 0.9 spec and don't understand
all of the discussion around EEW etc. However, all of the discussion
about layout issues have been from people trying to simplify the
hardware by shortening wires so they don't have to cross lanes. To me,
this is compromising software at the expense of making hardware easy;
there's nothing that says you can't have long, pipelined wires, but
I'll stop arguing this any further. To me, the right "internal
register group" layout would be one that is *useful* to software, in
particular, one that allows easy conversion between A-o-S and S-o-A
formats (array of structs, struct of arrays, ie 2/4/8-way
interleaving/deinterleaving which is often needed but difficult to do
on a vector engine), but it also must easily support this for
different widths of data 8/16/32/64 and for mixed width operations.
Again, I haven't had time to investigate how to best support these
multiple requirements, but I support a *single* data layout for all
CPUs to use to avoid fragmentation.

The purpose of the cast instruction is to convert between the memory
layout and internal register group layout.

Hence:

a) "register groups" with compute instructions (eg, vadd) operate on
the internal register group layout element order. This way, if
widening is done etc, it can avoid crossing lanes (if that's the spec)
or more easily operate on deinterleaved data. This particular layout
order can be specified as anything we wish (eg, SLEN=32 and VLEN >
32), but the spec should be standardized to a *single* layout.

b) "register groups" with load/store instructions operate using memory
layout, ie SLEN=VLEN layout. Earlier, I think I incorrectly assumed
setting LMUL=1 would enforce this layout; I should have thought it
through more carefully. In particular, this ordering will operate on
each vector register in sequence (v0, v1, v2, v3 for a 4-register
group) rather than the interleaved ordering that might be implied by
the "internal register group layout". That is, elements are
contiguously packed, with the lowest indexed elements in vn, and the
highest ones in vn+k, where k=2/4/8. Thus, a programmer could use k
separate load instructions, or a single group load instruction (as a
shorthand).

I added (b) to soften my hard line on "no register groups for
loads/stores" because, as Nick pointed out, it causes a lot of bloat
(and requires integer add to increment the address pointer in addition
to replicating the load or store instructions).


The key advantage of this new restriction is to remove all data shuffling from the interface between external memory and the register file — just transfer bytes in memory order. This vastly simplifies “basic” implementations by keeping data shuffling exclusively on the ALU side of the register file.

I think this is where you envision cast instructions providing the solution, but a restricted set of cast instructions that
- only operate on LMUL=1 structures
- are capable (if not restricted) to occur on the ALU rather than memory side.
casting between SLEN=VLEN layout and some other specified layout
yes to "capable (if not restricted)"

I don't believe either of these restrictions are necessary for v0.9 design when SLEN=VLEN, as LMUL>1 structure is identical to LMUL=1
and micro architectures can fuse a following cast (if as you suggest they are able to be done in-place) to operate on memory side if it so chooses.
can be done on memory-side, but architectural state must change after
a cast (unless the result is discarded by a subsequent operation that
writes to the same destination)

I expect you envision these operators to be elective, only included where needed so performance is not affected adversely.
casting is elective, only where data layout conversion is needed.

If so, I believe the fragmentation concern is reintroduced.
I don't see how... if two layouts are known to the programmer, they
can choose which to use. eg, an OS doing a context switch can choose
to avoid cast operations and simply spill register groups using
"register layout" ordering, because it knows it will read them back
again with the same ordering and the memory format conversion would be
superfluous.

If not, then unique characteristics of the cast instructions are required as you allude, and I suspect target format will be important for that to happen.
?

I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused.


So far there has been no cast magic-bullet.
Although I agree deferring casting (however it is done) to the ultimate operation is of considerable potential benefit all of this is quite heavy handed.
As it is not required for the simple v0.9 SLEN=VLEN model industry adoption is certainly not assured.
In my world, there would be no option to choose your own SLEN, so
there would be no implementations choosing to support only the simple
SLEN=VLEN. instead, the spec would specify SLEN=VLEN for memory
layout, and some other combo for register layout (to support register
groups for either widening operations, avoiding lane-crossing, or
de/interleaving between SoA and AoS).

I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).

I think you have done a good job at illustrating the complexity of optimizing for mixed SEW operators and especially the casting approach to "fix it".
Therefore, I am trending away from a cast instruction solution.
I don't see how the above leads to your conclusion.

I have not delivered by promised comparison of alternatives.
The considerations you have provided have helped me to come closer to a formulation of it.
To me, CLSTR looks most promising.
Sorry that I haven't read your proposal. I have skimmed it, and I find
it confusing. That scares me.

I suspect you may also be confused by my proposal, which is my fault
for not explaining it well enough. My proposal is incredibly simple
though, which I tend to like due to occam's razor, and because I
always try to advocate for the programmer :-)

But only as these alternatives are championed, discussed and analyzed do we get a better idea of the nature of this mixed SEW beast.
agreed :-) i fully appreciate you feedback and the time you've invested.

g