Re: Vector Task Group minutes 2020/5/15 - precise layout not matter


Guy Lemieux
 

I propose 2 data layouts: memory layout, and internal register group layout.

I am not going to specify which internal register group layout to
operate upon, because I haven't read the 0.9 spec and don't understand
all of the discussion around EEW etc. However, all of the discussion
about layout issues have been from people trying to simplify the
hardware by shortening wires so they don't have to cross lanes. To me,
this is compromising software at the expense of making hardware easy;
there's nothing that says you can't have long, pipelined wires, but
I'll stop arguing this any further. To me, the right "internal
register group" layout would be one that is *useful* to software, in
particular, one that allows easy conversion between A-o-S and S-o-A
formats (array of structs, struct of arrays, ie 2/4/8-way
interleaving/deinterleaving which is often needed but difficult to do
on a vector engine), but it also must easily support this for
different widths of data 8/16/32/64 and for mixed width operations.
Again, I haven't had time to investigate how to best support these
multiple requirements, but I support a *single* data layout for all
CPUs to use to avoid fragmentation.

The purpose of the cast instruction is to convert between the memory
layout and internal register group layout.

Hence:

a) "register groups" with compute instructions (eg, vadd) operate on
the internal register group layout element order. This way, if
widening is done etc, it can avoid crossing lanes (if that's the spec)
or more easily operate on deinterleaved data. This particular layout
order can be specified as anything we wish (eg, SLEN=32 and VLEN >
32), but the spec should be standardized to a *single* layout.

b) "register groups" with load/store instructions operate using memory
layout, ie SLEN=VLEN layout. Earlier, I think I incorrectly assumed
setting LMUL=1 would enforce this layout; I should have thought it
through more carefully. In particular, this ordering will operate on
each vector register in sequence (v0, v1, v2, v3 for a 4-register
group) rather than the interleaved ordering that might be implied by
the "internal register group layout". That is, elements are
contiguously packed, with the lowest indexed elements in vn, and the
highest ones in vn+k, where k=2/4/8. Thus, a programmer could use k
separate load instructions, or a single group load instruction (as a
shorthand).

I added (b) to soften my hard line on "no register groups for
loads/stores" because, as Nick pointed out, it causes a lot of bloat
(and requires integer add to increment the address pointer in addition
to replicating the load or store instructions).


The key advantage of this new restriction is to remove all data shuffling from the interface between external memory and the register file — just transfer bytes in memory order. This vastly simplifies “basic” implementations by keeping data shuffling exclusively on the ALU side of the register file.

I think this is where you envision cast instructions providing the solution, but a restricted set of cast instructions that
- only operate on LMUL=1 structures
- are capable (if not restricted) to occur on the ALU rather than memory side.
casting between SLEN=VLEN layout and some other specified layout
yes to "capable (if not restricted)"

I don't believe either of these restrictions are necessary for v0.9 design when SLEN=VLEN, as LMUL>1 structure is identical to LMUL=1
and micro architectures can fuse a following cast (if as you suggest they are able to be done in-place) to operate on memory side if it so chooses.
can be done on memory-side, but architectural state must change after
a cast (unless the result is discarded by a subsequent operation that
writes to the same destination)

I expect you envision these operators to be elective, only included where needed so performance is not affected adversely.
casting is elective, only where data layout conversion is needed.

If so, I believe the fragmentation concern is reintroduced.
I don't see how... if two layouts are known to the programmer, they
can choose which to use. eg, an OS doing a context switch can choose
to avoid cast operations and simply spill register groups using
"register layout" ordering, because it knows it will read them back
again with the same ordering and the memory format conversion would be
superfluous.

If not, then unique characteristics of the cast instructions are required as you allude, and I suspect target format will be important for that to happen.
?

I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused.


So far there has been no cast magic-bullet.
Although I agree deferring casting (however it is done) to the ultimate operation is of considerable potential benefit all of this is quite heavy handed.
As it is not required for the simple v0.9 SLEN=VLEN model industry adoption is certainly not assured.
In my world, there would be no option to choose your own SLEN, so
there would be no implementations choosing to support only the simple
SLEN=VLEN. instead, the spec would specify SLEN=VLEN for memory
layout, and some other combo for register layout (to support register
groups for either widening operations, avoiding lane-crossing, or
de/interleaving between SoA and AoS).

I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).

I think you have done a good job at illustrating the complexity of optimizing for mixed SEW operators and especially the casting approach to "fix it".
Therefore, I am trending away from a cast instruction solution.
I don't see how the above leads to your conclusion.

I have not delivered by promised comparison of alternatives.
The considerations you have provided have helped me to come closer to a formulation of it.
To me, CLSTR looks most promising.
Sorry that I haven't read your proposal. I have skimmed it, and I find
it confusing. That scares me.

I suspect you may also be confused by my proposal, which is my fault
for not explaining it well enough. My proposal is incredibly simple
though, which I tend to like due to occam's razor, and because I
always try to advocate for the programmer :-)

But only as these alternatives are championed, discussed and analyzed do we get a better idea of the nature of this mixed SEW beast.
agreed :-) i fully appreciate you feedback and the time you've invested.

g

Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.