More thoughts on Git update (8a9fbce) Added fractional LMUL

Krste Asanovic

On Sun, 26 Apr 2020 00:47:34 -0400, "David Horner" <ds2horner@...> said:
| The aspect that will probably be most problematic for programmer is the
| loss of memory mapping paradigm.

| Whereas adjacent  bytes in memory are in the same or adjacent words
| (ditto for half words and doubles),
|       once stored in vector registers this will no longer hold when
| SLEN <= 1/4 VLEN.
| Indeed, in memory consecutive bytes advance through halfs, words and
| doubles,
|    but in vector registers with SLEN<= 1/2 VLEN, they jump to
| consecutive SLEN chunks.

Application programmers are not supposed to be using the underlying
mapping in their code. If they only access vector register using
element indices, and never access values stored in a vector register
group with more than one SEW/LMUL setting then SLEN and the
in-register element-to-byte mapping should be transparent to them.

Debuggers, emulators, and other tools that look at the register values
obviously have to parse the register layout, which is why we're
standardizing SLEN as a parameter, but application programmers
shouldn't have to worry about the data layout.

| Due to this SLEN relative to VLEN dependency,
|     it is at least as hard for one to get ones mind around (to grok)
| than the various big-endian formats.

| It may prove challenging to porting code that assumes the memory mapping
| model in overlapping registers of differing power of two widths .
| I have no immediate solution.

Do you have an example of such code? For which architecture?

I can see a case where vector operations are being used to accelerate
operations on in-memory multi-byte data structures, e.g., an IP packet
containing a mix of SEW=8,16,32 fields. A SEW=8 vector load of an
in-memory structure can always be accessed using element byte indices
to obtain the same in-register byte mapping as for an in-memory data
structure, but it is not efficient to extract/operate on a multi-byte
field from the register in the presence of SLEN (whereas trivial when

Could consider later adding "cast" instructions that convert a vector
of N SEW=8 elements into a vector of N/2 SEW=16 elements by
concatenating the two bytes (and similar for other combinations of
source and dest SEWs). These would be a simple move/copy on an
SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
with bytes crossing between SLEN sections (probably reusing the memory
pipeline crossbar in an implementation, to store the source vector in
its memory format, then load the destination vector in its register
format). So vector is loaded once from memory as SEW=8, then cast
into appropriate type to extract other fields. Misaligned words might
need a slide before casting.

An alternative approach without adding new instructions would be to
just load the in-memory structure several times with different SEW
formats, or more simply, just use scalar instructions to process the
fields in the in-memory struct and only use the vector instructions to
shuffle structures around in memory.

Yet another alternative approach is to make SLEN<VLEN an architectural
option that software has to deal with, but that will fragment the
software ecosystem.



Join to automatically receive all group messages.