EXTERNAL MAIL
as some are not on github, I posted this response to #434 here:
Observations:
- single SEW
operations are agnostic to underlying structure (as Krte noted
in recent doc revision)
- mixed SEW
operations (widening & narrowing) have substantial impact
on contiguous SLEN=VLEN
- mixed SEW
operations are predominantly SEW <--> 2 * SEW
- by 2
interleaving chunks in SLEN=VLEN at SEW level align well with
non-interleaved at 2 * SEW
Postulate:
That software
can anticipate its need for a matching structures for
widening/narrowing and memory overlay model and make a weighed
choice.
I call the
current interleave proposal SEW level interleave (elements are
apportioned on a SEW basis amongst available SLEN chunks in a
round robin fashion).
I thus propose
a variant of #421 Fractional vtype field vfill – Fractional
Fill order and Fractional Instruction eLement Location :
INTRLV defines
4 interleave formats:
- SLEN<VLEN
(SEW level interleave)
- SLEN=VLEN
(proposed as extension, essentially no interleave)
- Layout is
identical to “SLEN=VLEN” even though SLEN chunks exist, but
upper 1/2 of SLEN chunk is a gap (undisturbed or agnostic
fill).
- Layout is
identical to “SLEN=VLEN” even though SLEN chunks exist, but
lower 1/2 of SLEN chunk is a gap (undisturbed or agnostic
fill).
A 2bit vtype
vintrlv field defines the application of these formats to
various operations, the effect is determined by what kind of
operation it is:
Load/Store will
depending upon mode
```
vintrvl level =
0 -- scrample/descramble SEW level encoding
vintrvl level =
3 -- transfer as if SLEN=VLEN ( non-interleaved)
vintrvl level =
1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk
(upper
undisturbed of agnostic filled)
vintrvl level =
2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk
(lower
undisturbed of agnostic filled)
```
Single width
operations will work on either side of SLEN for vintrvl levels
1 and 2 , but identically on all vl elements for vintrvl level
s 0 and 3
Widening
operations can operate on either side of the SLEN chunks
providing a 2*SEL set of elements in an SLEN length chunk
(vintrvl levels 1 and 2).
Further,
Widening operations can operate with one source on one side
and the other on the other side of the SLEN chunks providing a
2*SEL set of elements in an SLEN length chunk. ( vintrvl
levels 3).
For further
details please read #421.
I created a github issue for this, #434 - text repeated below,
Krste
Should SLEN=VLEN be an extension?
SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
wiring for wide datapaths such that bytes of different SEWs are laid
out differently in register bytes versus memory bytes, whereas when
SLEN=VLEN, the in-register format matches in-memory format for vectors
of all SEW.
Many vector routines can be written to be agnostic to SLEN, but some
routines can use the vector extension to manipulate data structures
that are not simple arrays of a single-width datatype (e.g., a network
packet). These routines can exploit SLEN=VLEN and hence that SEW can
be changed to access different element widths within same vector
register value, and many implementations will have SLEN=VLEN.
To support these kinds of routines portably on both SLEN<VLEN and
SLEN=VLEN machines, we could provide SEW "casting" operations that
internally rearrange in-register representations, e.g., converting a
vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
in the halfwords as they would if the vector was held in memory. For
SLEN=VLEN machines, all cast operations are a simple copy. However,
preserving compatibility between both types of machine incurs an
efficiency cost on the common SLEN=VLEN machines, and the "cast"
operation is not necessarily very efficient on the SLEN<VLEN machines
as it requires communication between the SLEN-wide sections, and
reloading vector from memory with different SEW might actually be more
efficient depending on the microarchitecture.
Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
this where available, avoiding needless casts. A downside would be
that this splits the software ecosystem if code that does not need to
depend on SLEN=VLEN inadvertently requires it. However, software
developers will be motivated to test for SLEN=VLEN to drop need to
perform cast operations even without an extension, so this split will
likely happen anyway.
the above proposal presupposes the SLEN=VLEN support be part of
the base.
It also postulates that such casting operations are not
necessary as they can be avoided by judicious use of the INTRVL
facilities.
I may be wrong and such caste [sick] operations may be
beneficial.
A second issue either way is whether we should add "cast"
operations. They are primarily useful for the SLEN<VLEN machines
though are difficult to implement efficiently there; the SLEN=VLEN
implementation is just a register-register copy. We could choose to
add the cast operations as another optional extension, which is my
preference at this time.
So a separate extension for cast operations is also my current
preference (if needed).
On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste,
| On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:
| Could consider later adding "cast" instructions that convert a vector
| of N SEW=8 elements into a vector of N/2 SEW=16 elements by
| concatenating the two bytes (and similar for other combinations of
| source and dest SEWs). These would be a simple move/copy on an
| SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
| with bytes crossing between SLEN sections (probably reusing the memory
| pipeline crossbar in an implementation, to store the source vector in
| its memory format, then load the destination vector in its register
| format). So vector is loaded once from memory as SEW=8, then cast
| into appropriate type to extract other fields. Misaligned words might
| need a slide before casting.
| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.
| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| SLEN==VLEN. For example,
| # SEW = 32, LMUL = 4
| v_reinterpret v0, e64, m1
| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:
| # SEW = 64, LMUL = 1
| v_reinterpret v0, e32, m4
| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| from the programmer. Hence, I think it would be useful in developing portable software.
| Best,
| Nick Knight