Re: More thoughts on Git update (8a9fbce) Added fractional LMUL

Krste Asanovic

I created a github issue for this, #434 - text repeated below,

Should SLEN=VLEN be an extension?

SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
wiring for wide datapaths such that bytes of different SEWs are laid
out differently in register bytes versus memory bytes, whereas when
SLEN=VLEN, the in-register format matches in-memory format for vectors
of all SEW.

Many vector routines can be written to be agnostic to SLEN, but some
routines can use the vector extension to manipulate data structures
that are not simple arrays of a single-width datatype (e.g., a network
packet). These routines can exploit SLEN=VLEN and hence that SEW can
be changed to access different element widths within same vector
register value, and many implementations will have SLEN=VLEN.

To support these kinds of routines portably on both SLEN<VLEN and
SLEN=VLEN machines, we could provide SEW "casting" operations that
internally rearrange in-register representations, e.g., converting a
vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
in the halfwords as they would if the vector was held in memory. For
SLEN=VLEN machines, all cast operations are a simple copy. However,
preserving compatibility between both types of machine incurs an
efficiency cost on the common SLEN=VLEN machines, and the "cast"
operation is not necessarily very efficient on the SLEN<VLEN machines
as it requires communication between the SLEN-wide sections, and
reloading vector from memory with different SEW might actually be more
efficient depending on the microarchitecture.

Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
this where available, avoiding needless casts. A downside would be
that this splits the software ecosystem if code that does not need to
depend on SLEN=VLEN inadvertently requires it. However, software
developers will be motivated to test for SLEN=VLEN to drop need to
perform cast operations even without an extension, so this split will
likely happen anyway.

A second issue either way is whether we should add "cast"
operations. They are primarily useful for the SLEN<VLEN machines
though are difficult to implement efficiently there; the SLEN=VLEN
implementation is just a register-register copy. We could choose to
add the cast operations as another optional extension, which is my
preference at this time.

On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste,
| On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:

| Could consider later adding "cast" instructions that convert a vector
| of N SEW=8 elements into a vector of N/2 SEW=16 elements by
| concatenating the two bytes (and similar for other combinations of
| source and dest SEWs).  These would be a simple move/copy on an
| SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
| with bytes crossing between SLEN sections (probably reusing the memory
| pipeline crossbar in an implementation, to store the source vector in
| its memory format, then load the destination vector in its register
| format).  So vector is loaded once from memory as SEW=8, then cast
| into appropriate type to extract other fields.  Misaligned words might
| need a slide before casting.

| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| SLEN==VLEN. For example,

| # SEW = 32, LMUL = 4
| v_reinterpret v0, e64, m1

| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

| # SEW = 64, LMUL = 1
| v_reinterpret v0, e32, m4

| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| from the programmer. Hence, I think it would be useful in developing portable software.

| Best,
| Nick Knight

Join { to automatically receive all group messages.