Re: Vector Task Group minutes 2020/5/15 - precise layout not matter
I propose 2 data layouts: memory layout, and internal register group layout.
I am not going to specify which internal register group layout to
operate upon, because I haven't read the 0.9 spec and don't understand
all of the discussion around EEW etc. However, all of the discussion
about layout issues have been from people trying to simplify the
hardware by shortening wires so they don't have to cross lanes. To me,
this is compromising software at the expense of making hardware easy;
there's nothing that says you can't have long, pipelined wires, but
I'll stop arguing this any further. To me, the right "internal
register group" layout would be one that is *useful* to software, in
particular, one that allows easy conversion between A-o-S and S-o-A
formats (array of structs, struct of arrays, ie 2/4/8-way
interleaving/deinterleaving which is often needed but difficult to do
on a vector engine), but it also must easily support this for
different widths of data 8/16/32/64 and for mixed width operations.
Again, I haven't had time to investigate how to best support these
multiple requirements, but I support a *single* data layout for all
CPUs to use to avoid fragmentation.
The purpose of the cast instruction is to convert between the memory
layout and internal register group layout.
a) "register groups" with compute instructions (eg, vadd) operate on
the internal register group layout element order. This way, if
widening is done etc, it can avoid crossing lanes (if that's the spec)
or more easily operate on deinterleaved data. This particular layout
order can be specified as anything we wish (eg, SLEN=32 and VLEN >
32), but the spec should be standardized to a *single* layout.
b) "register groups" with load/store instructions operate using memory
layout, ie SLEN=VLEN layout. Earlier, I think I incorrectly assumed
setting LMUL=1 would enforce this layout; I should have thought it
through more carefully. In particular, this ordering will operate on
each vector register in sequence (v0, v1, v2, v3 for a 4-register
group) rather than the interleaved ordering that might be implied by
the "internal register group layout". That is, elements are
contiguously packed, with the lowest indexed elements in vn, and the
highest ones in vn+k, where k=2/4/8. Thus, a programmer could use k
separate load instructions, or a single group load instruction (as a
I added (b) to soften my hard line on "no register groups for
loads/stores" because, as Nick pointed out, it causes a lot of bloat
(and requires integer add to increment the address pointer in addition
to replicating the load or store instructions).
casting between SLEN=VLEN layout and some other specified layoutThe key advantage of this new restriction is to remove all data shuffling from the interface between external memory and the register file — just transfer bytes in memory order. This vastly simplifies “basic” implementations by keeping data shuffling exclusively on the ALU side of the register file.
yes to "capable (if not restricted)"
I don't believe either of these restrictions are necessary for v0.9 design when SLEN=VLEN, as LMUL>1 structure is identical to LMUL=1can be done on memory-side, but architectural state must change after
a cast (unless the result is discarded by a subsequent operation that
writes to the same destination)
I expect you envision these operators to be elective, only included where needed so performance is not affected adversely.casting is elective, only where data layout conversion is needed.
If so, I believe the fragmentation concern is reintroduced.I don't see how... if two layouts are known to the programmer, they
can choose which to use. eg, an OS doing a context switch can choose
to avoid cast operations and simply spill register groups using
"register layout" ordering, because it knows it will read them back
again with the same ordering and the memory format conversion would be
If not, then unique characteristics of the cast instructions are required as you allude, and I suspect target format will be important for that to happen.?
I think this can sometimes be done forIn my world, there would be no option to choose your own SLEN, so
there would be no implementations choosing to support only the simple
SLEN=VLEN. instead, the spec would specify SLEN=VLEN for memory
layout, and some other combo for register layout (to support register
groups for either widening operations, avoiding lane-crossing, or
de/interleaving between SoA and AoS).
I don'tI don't see how the above leads to your conclusion.
I have not delivered by promised comparison of alternatives.Sorry that I haven't read your proposal. I have skimmed it, and I find
it confusing. That scares me.
I suspect you may also be confused by my proposal, which is my fault
for not explaining it well enough. My proposal is incredibly simple
though, which I tend to like due to occam's razor, and because I
always try to advocate for the programmer :-)
But only as these alternatives are championed, discussed and analyzed do we get a better idea of the nature of this mixed SEW beast.agreed :-) i fully appreciate you feedback and the time you've invested.