On 4/27/20 7:02 AM, Krste Asanovic wrote: EXTERNAL MAIL
I created a github issue for this, #434 - text repeated below, Krste
Should SLEN=VLEN be an extension?
SLEN<VLEN introduces internal rearrangements to reduce cross-datapath wiring for wide datapaths such that bytes of different SEWs are laid out differently in register bytes versus memory bytes, whereas when SLEN=VLEN, the in-register format matches in-memory format for vectors of all SEW.
Many vector routines can be written to be agnostic to SLEN, but some routines can use the vector extension to manipulate data structures that are not simple arrays of a single-width datatype (e.g., a network packet). These routines can exploit SLEN=VLEN and hence that SEW can be changed to access different element widths within same vector register value, and many implementations will have SLEN=VLEN.
To support these kinds of routines portably on both SLEN<VLEN and SLEN=VLEN machines, we could provide SEW "casting" operations that internally rearrange in-register representations, e.g., converting a vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing in the halfwords as they would if the vector was held in memory. For SLEN=VLEN machines, all cast operations are a simple copy. However, preserving compatibility between both types of machine incurs an efficiency cost on the common SLEN=VLEN machines, and the "cast" operation is not necessarily very efficient on the SLEN<VLEN machines as it requires communication between the SLEN-wide sections, and reloading vector from memory with different SEW might actually be more efficient depending on the microarchitecture.
Making SLEN=VLEN an extension (Zveqs?) enables software to exploit this where available, avoiding needless casts. A downside would be that this splits the software ecosystem if code that does not need to depend on SLEN=VLEN inadvertently requires it. However, software developers will be motivated to test for SLEN=VLEN to drop need to perform cast operations even without an extension, so this split will likely happen anyway. It might be the case that the machines where SLEN=VLEN would be the same machines where it would be attractive to use vectors for such code - machines where vectors provided larger registers and some parallelism rather than machines where vectors usually complete in one or a few cycles and wouldn't deal well with irregular operations. That probably increases the value of an extension. On the other hand, adding casting operations would seem to decrease the value of an extension (see below). A second issue either way is whether we should add "cast" operations. They are primarily useful for the SLEN<VLEN machines though are difficult to implement efficiently there; the SLEN=VLEN implementation is just a register-register copy. We could choose to add the cast operations as another optional extension, which is my preference at this time.
Where SLEN<VLEN, cast operations might be implemented as vector register gather operations with element index values determined by SLEN, VLEN and SEW. But where SLEN=VLEN, they would be moves. If then, we add casts, would an SLEN=VLEN extension still be valuable? Bill
On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste, | On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:
| Could consider later adding "cast" instructions that convert a vector | of N SEW=8 elements into a vector of N/2 SEW=16 elements by | concatenating the two bytes (and similar for other combinations of | source and dest SEWs). These would be a simple move/copy on an | SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine | with bytes crossing between SLEN sections (probably reusing the memory | pipeline crossbar in an implementation, to store the source vector in | its memory format, then load the destination vector in its register | format). So vector is loaded once from memory as SEW=8, then cast | into appropriate type to extract other fields. Misaligned words might | need a slide before casting.
| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete | example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They | currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.
| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if | SLEN==VLEN. For example,
| # SEW = 32, LMUL = 4 | v_reinterpret v0, e64, m1
| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up | to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:
| # SEW = 64, LMUL = 1 | v_reinterpret v0, e32, m4
| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN == | VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden | from the programmer. Hence, I think it would be useful in developing portable software.
| Best, | Nick Knight
|