Re: More thoughts on Git update (8a9fbce) Added fractional LMUL


David Horner
 

On 2020-04-27 3:16 p.m., krste@... wrote:
I meant the SLEN=VLEN "extension" to simply be an assertion about the
machine's static configuration. Software could then rely on
in-register format matching in-memory format.

Krste
Then I agree that the risk of software fragmentation is high with such an extension.
The reality is that some machines will indeed be SLEN=VLEN and thus risk some fragmentation.

I am indeed proposing a run time mechanism of supporting "in-register format matching in-memory format" and
indefinite levels of widening/narrowing.

For what its worth I think inconvenience in widening/narrowing beyond one level is much less valuable than maintaining the industry standard of match in-memory format.

If we have to decide on one I believe we should toss the SEW level interleave (as much as I am very fond of it).

On Mon, 27 Apr 2020 18:57:42 +0000, Bill Huffman <huffman@...> said:
| Sounds like maybe you're thinking that "widening of SLEN to VLEN" is a runtime setting or something like
| that. There will be no (certainly few) machines where SLEN is variable as the power/area cost would be
| too high. But maybe you meant something else.

| Bill

| On 4/27/20 11:50 AM, DSHORNER wrote:

| EXTERNAL MAIL

| mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN
| I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
| The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. Rather than outputs
| aligning within a data lane, for widening operations the results are shuffled up to their
| appropriate slot in the destination register, and into the vd+1 register if vl >= VLMAX of a single
| physical register.

| The observation is that on machines that
| On 2020-04-27 1:51 p.m., Bill Huffman wrote:

| Hi David,
| I don't understand your observation about "mixed SEW operations (widening & narrowing)..."
| "mixed SEW operations (widening & narrowing) have substantial impact on contiguous
| SLEN=VLEN"
| I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
| The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. For widening operations,
| rather than the outputs aligning within a data lane, the results are shuffled up to their
| appropriate slot in the destination register, and into the vd+1 register if vl >= (VLMAX of a single
| physical register).
| I understand this is a substantial impact for most implementations.
| But we may have different interpretations of what Krste meant by SLEN=VLEN.
| He floated an option that VLEN be reduced to SLEN, but my reading of this proposal doesn't jive with
| that alternative.

| The conditions required for impact seem to me much stronger. Arbitrary widening & narrowing
| mixing of SEW is fine (unless SEW>SLEN, which would be, at best, unusual). The conditions to
| even be able to observe SLEN in real code seem to involve using a vector to represent an
| irregular structure - and probably one with elements not naturally aligned.
| And I'm afraid I'm not following you. I agreed with:
SEW| SLEN, which would be, at best, unusual
| struggling with:
| The conditions required for impact seem to me much stronger. -- Than "lane misalignment"?

| Arbitrary widening & narrowing mixing of SEW is fine --- in the model I describe above, or is
| there another interpretation I'm not fathoming.

| The conditions to even be able to observe SLEN in real code seem to involve using a vector to
| represent an irregular structure
| ---- I think not just an irregular structure, but a composite structure (bytes within words,
| etc.)
| and probably one with elements not naturally aligned. ---- natural alignment perhaps simplifies,
| but the issue exists even then.
| (I realize an initial struggle I had was in interpreting your use of "vector" to mean mechanism.
| duh.)

| Bill
| On 4/27/20 10:33 AM, David Horner wrote:
| EXTERNAL MAIL
| as some are not on github, I posted this response to #434 here:

| Observations:
| - single SEW operations are agnostic to underlying structure (as Krte noted in recent doc
| revision)
| - mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=
| VLEN
| - mixed SEW operations are predominantly SEW <--> 2 * SEW
| - by 2 interleaving chunks in SLEN=VLEN at SEW level align well with non-interleaved at 2 *
| SEW

| Postulate:
| That software can anticipate its need for a matching structures for widening/narrowing and
| memory overlay model and make a weighed choice.

| I call the current interleave proposal SEW level interleave (elements are apportioned on a
| SEW basis amongst available SLEN chunks in a round robin fashion).

| I thus propose a variant of #421 Fractional vtype field vfill – Fractional Fill order and
| Fractional Instruction eLement Location :

| INTRLV defines 4 interleave formats:

| - SLEN<VLEN (SEW level interleave)
| - SLEN=VLEN (proposed as extension, essentially no interleave)
| - Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but upper 1/2 of SLEN
| chunk is a gap (undisturbed or agnostic fill).
| - Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but lower 1/2 of SLEN
| chunk is a gap (undisturbed or agnostic fill).

| A 2bit vtype vintrlv field defines the application of these formats to various operations,
| the effect is determined by what kind of operation it is:

| Load/Store will depending upon mode
| ```
| vintrvl level = 0 -- scrample/descramble SEW level encoding
| vintrvl level = 3 -- transfer as if SLEN=VLEN ( non-interleaved)
| vintrvl level = 1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk
| (upper undisturbed of agnostic filled)
| vintrvl level = 2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk
| (lower undisturbed of agnostic filled)
| ```

| Single width operations will work on either side of SLEN for vintrvl levels 1 and 2 , but
| identically on all vl elements for vintrvl level s 0 and 3

| Widening operations can operate on either side of the SLEN chunks providing a 2*SEL set of
| elements in an SLEN length chunk (vintrvl levels 1 and 2).
| Further, Widening operations can operate with one source on one side and the other on the
| other side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk. (
| vintrvl levels 3).

| For further details please read #421.

| On 2020-04-27 10:02 a.m., krste@... wrote:
| I created a github issue for this, #434 - text repeated below,
| Krste
| Should SLEN=VLEN be an extension?
| SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
| wiring for wide datapaths such that bytes of different SEWs are laid
| out differently in register bytes versus memory bytes, whereas when
| SLEN=VLEN, the in-register format matches in-memory format for vectors
| of all SEW.
| Many vector routines can be written to be agnostic to SLEN, but some
| routines can use the vector extension to manipulate data structures
| that are not simple arrays of a single-width datatype (e.g., a network
| packet). These routines can exploit SLEN=VLEN and hence that SEW can
| be changed to access different element widths within same vector
| register value, and many implementations will have SLEN=VLEN.
| To support these kinds of routines portably on both SLEN<VLEN and
| SLEN=VLEN machines, we could provide SEW "casting" operations that
| internally rearrange in-register representations, e.g., converting a
| vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
| in the halfwords as they would if the vector was held in memory. For
| SLEN=VLEN machines, all cast operations are a simple copy. However,
| preserving compatibility between both types of machine incurs an
| efficiency cost on the common SLEN=VLEN machines, and the "cast"
| operation is not necessarily very efficient on the SLEN<VLEN machines
| as it requires communication between the SLEN-wide sections, and
| reloading vector from memory with different SEW might actually be more
| efficient depending on the microarchitecture.
| Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
| this where available, avoiding needless casts. A downside would be
| that this splits the software ecosystem if code that does not need to
| depend on SLEN=VLEN inadvertently requires it. However, software
| developers will be motivated to test for SLEN=VLEN to drop need to
| perform cast operations even without an extension, so this split will
| likely happen anyway.
| the above proposal presupposes the SLEN=VLEN support be part of the base.
| It also postulates that such casting operations are not necessary as they can be avoided by
| judicious use of the INTRVL facilities.
| I may be wrong and such caste [sick] operations may be beneficial.

| A second issue either way is whether we should add "cast"
| operations. They are primarily useful for the SLEN<VLEN machines
| though are difficult to implement efficiently there; the SLEN=VLEN
| implementation is just a register-register copy. We could choose to
| add the cast operations as another optional extension, which is my
| preference at this time.
| So a separate extension for cast operations is also my current preference (if needed).

| On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| | Hi Krste,
| | On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:
| | Could consider later adding "cast" instructions that convert a vector
| | of N SEW=8 elements into a vector of N/2 SEW=16 elements by
| | concatenating the two bytes (and similar for other combinations of
| | source and dest SEWs). These would be a simple move/copy on an
| | SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
| | with bytes crossing between SLEN sections (probably reusing the memory
| | pipeline crossbar in an implementation, to store the source vector in
| | its memory format, then load the destination vector in its register
| | format). So vector is loaded once from memory as SEW=8, then cast
| | into appropriate type to extract other fields. Misaligned words might
| | need a slide before casting.
| | I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| | example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| | currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.
| | For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| | SLEN==VLEN. For example,
| | # SEW = 32, LMUL = 4
| | v_reinterpret v0, e64, m1
| | would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| | to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:
| | # SEW = 64, LMUL = 1
| | v_reinterpret v0, e32, m4
| | Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| | VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| | from the programmer. Hence, I think it would be useful in developing portable software.
| | Best,
| | Nick Knight

|

Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.