Re: More thoughts on Git update (8a9fbce) Added fractional LMUL


David Horner
 


mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN

I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. Rather than outputs aligning within a data lane, for widening operations the results are shuffled up to their appropriate slot in the destination register, and into the vd+1 register if vl >= VLMAX of a single physical register.


The observation is that on machines that

On 2020-04-27 1:51 p.m., Bill Huffman wrote:

Hi David,

I don't understand your observation about "mixed SEW operations (widening & narrowing)..." 

              "mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN"

I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. For widening operations, rather than the outputs aligning within a data lane,  the results are shuffled up to their appropriate slot in the destination register, and into the vd+1 register if vl >= (VLMAX of a single physical register).

 I understand this is a substantial impact for most implementations.

But we may have different interpretations of what Krste meant by SLEN=VLEN.
He floated an option that VLEN be reduced to SLEN, but my reading of this proposal doesn't jive with that alternative.


The conditions required for impact seem to me much stronger.  Arbitrary widening & narrowing mixing of SEW is fine (unless SEW>SLEN, which would be, at best, unusual).  The conditions to even be able to observe SLEN in real code seem to involve using a vector to represent an irregular structure - and probably one with elements not naturally aligned.

And I'm afraid I'm not following you. I agreed with:

SEW>SLEN, which would be, at best, unusual

struggling with:

The conditions required for impact seem to me much stronger. -- Than "lane misalignment"?


Arbitrary widening & narrowing mixing of SEW is fine   --- in the model I describe above, or is there another interpretation I'm not fathoming.


The conditions to even be able to observe SLEN in real code seem to involve using a vector to represent an irregular structure
   ---- I think not just an irregular structure, but a composite structure (bytes within words, etc.)
 and probably one with elements not naturally aligned.  ---- natural alignment perhaps simplifies, but the issue exists even then.


(I realize an initial struggle I had was in interpreting your use of "vector" to mean mechanism. duh.)

     Bill

On 4/27/20 10:33 AM, David Horner wrote:
EXTERNAL MAIL

as some are not on github, I posted this response to #434 here:

Observations:

- single SEW operations are agnostic to underlying structure (as Krte noted in recent doc revision)

- mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN

- mixed SEW operations are predominantly SEW <--> 2 * SEW

- by 2 interleaving chunks in SLEN=VLEN at SEW level align well with non-interleaved at 2 * SEW


Postulate:

That software can anticipate its need for a matching structures for widening/narrowing and memory overlay model and make a weighed choice.


I call the current interleave proposal SEW level interleave (elements are apportioned on a SEW basis amongst available SLEN chunks in a round robin fashion).


I thus propose a variant of #421 Fractional vtype field vfill – Fractional Fill order and Fractional Instruction eLement Location :


INTRLV defines 4 interleave formats:


- SLEN<VLEN (SEW level interleave)

- SLEN=VLEN (proposed as extension, essentially no interleave)

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but upper 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but lower 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).


A 2bit vtype vintrlv field defines the application of these formats to various operations, the effect is determined by what kind of operation it is:




Load/Store will depending upon mode

```

vintrvl level = 0 -- scrample/descramble SEW level encoding

vintrvl level = 3 -- transfer as if SLEN=VLEN ( non-interleaved)

vintrvl level = 1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk

(upper undisturbed of agnostic filled)

vintrvl level = 2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk

(lower undisturbed of agnostic filled)

```


Single width operations will work on either side of SLEN for vintrvl levels 1 and 2 , but identically on all vl elements for vintrvl level s 0 and 3


Widening operations can operate on either side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk (vintrvl levels 1 and 2).

Further, Widening operations can operate with one source on one side and the other on the other side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk. ( vintrvl levels 3).


For further details please read #421.




On 2020-04-27 10:02 a.m., krste@... wrote:
I created a github issue for this, #434 - text repeated below,
Krste

Should SLEN=VLEN be an extension?

SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
wiring for wide datapaths such that bytes of different SEWs are laid
out differently in register bytes versus memory bytes, whereas when
SLEN=VLEN, the in-register format matches in-memory format for vectors
of all SEW.

Many vector routines can be written to be agnostic to SLEN, but some
routines can use the vector extension to manipulate data structures
that are not simple arrays of a single-width datatype (e.g., a network
packet). These routines can exploit SLEN=VLEN and hence that SEW can
be changed to access different element widths within same vector
register value, and many implementations will have SLEN=VLEN.

To support these kinds of routines portably on both SLEN<VLEN and
SLEN=VLEN machines, we could provide SEW "casting" operations that
internally rearrange in-register representations, e.g., converting a
vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
in the halfwords as they would if the vector was held in memory. For
SLEN=VLEN machines, all cast operations are a simple copy. However,
preserving compatibility between both types of machine incurs an
efficiency cost on the common SLEN=VLEN machines, and the "cast"
operation is not necessarily very efficient on the SLEN<VLEN machines
as it requires communication between the SLEN-wide sections, and
reloading vector from memory with different SEW might actually be more
efficient depending on the microarchitecture.

Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
this where available, avoiding needless casts. A downside would be
that this splits the software ecosystem if code that does not need to
depend on SLEN=VLEN inadvertently requires it. However, software
developers will be motivated to test for SLEN=VLEN to drop need to
perform cast operations even without an extension, so this split will
likely happen anyway.
the above proposal presupposes the SLEN=VLEN support be part of the base.

It also postulates that such casting operations are not necessary as they can be avoided by judicious use of the INTRVL facilities.
I may be wrong and such caste [sick] operations may be beneficial.

A second issue either way is whether we should add "cast"
operations. They are primarily useful for the SLEN<VLEN machines
though are difficult to implement efficiently there; the SLEN=VLEN
implementation is just a register-register copy. We could choose to
add the cast operations as another optional extension, which is my
preference at this time.
So a separate extension for cast operations is also my current preference (if needed).



On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste,
| On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:

|     Could consider later adding "cast" instructions that convert a vector
|     of N SEW=8 elements into a vector of N/2 SEW=16 elements by
|     concatenating the two bytes (and similar for other combinations of
|     source and dest SEWs).  These would be a simple move/copy on an
|     SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
|     with bytes crossing between SLEN sections (probably reusing the memory
|     pipeline crossbar in an implementation, to store the source vector in
|     its memory format, then load the destination vector in its register
|     format).  So vector is loaded once from memory as SEW=8, then cast
|     into appropriate type to extract other fields.  Misaligned words might
|     need a slide before casting.

| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| SLEN==VLEN. For example,

| # SEW = 32, LMUL = 4
| v_reinterpret v0, e64, m1

| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

| # SEW = 64, LMUL = 1
| v_reinterpret v0, e32, m4

| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| from the programmer. Hence, I think it would be useful in developing portable software.

| Best,
| Nick Knight


Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.