On 2020-04-27 3:16 p.m., krste@... wrote: I meant the SLEN=VLEN "extension" to simply be an assertion about the machine's static configuration. Software could then rely on in-register format matching in-memory format.
Krste Then I agree that the risk of software fragmentation is high with such an extension. The reality is that some machines will indeed be SLEN=VLEN and thus risk some fragmentation. I am indeed proposing a run time mechanism of supporting "in-register format matching in-memory format" and indefinite levels of widening/narrowing. For what its worth I think inconvenience in widening/narrowing beyond one level is much less valuable than maintaining the industry standard of match in-memory format. If we have to decide on one I believe we should toss the SEW level interleave (as much as I am very fond of it). On Mon, 27 Apr 2020 18:57:42 +0000, Bill Huffman <huffman@...> said:
| Sounds like maybe you're thinking that "widening of SLEN to VLEN" is a runtime setting or something like | that. There will be no (certainly few) machines where SLEN is variable as the power/area cost would be | too high. But maybe you meant something else.
| Bill
| On 4/27/20 11:50 AM, DSHORNER wrote:
| EXTERNAL MAIL
| mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN | I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN. | The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. Rather than outputs | aligning within a data lane, for widening operations the results are shuffled up to their | appropriate slot in the destination register, and into the vd+1 register if vl >= VLMAX of a single | physical register.
| The observation is that on machines that | On 2020-04-27 1:51 p.m., Bill Huffman wrote:
| Hi David, | I don't understand your observation about "mixed SEW operations (widening & narrowing)..." | "mixed SEW operations (widening & narrowing) have substantial impact on contiguous | SLEN=VLEN" | I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN. | The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. For widening operations, | rather than the outputs aligning within a data lane, the results are shuffled up to their | appropriate slot in the destination register, and into the vd+1 register if vl >= (VLMAX of a single | physical register). | I understand this is a substantial impact for most implementations. | But we may have different interpretations of what Krste meant by SLEN=VLEN. | He floated an option that VLEN be reduced to SLEN, but my reading of this proposal doesn't jive with | that alternative.
| The conditions required for impact seem to me much stronger. Arbitrary widening & narrowing | mixing of SEW is fine (unless SEW>SLEN, which would be, at best, unusual). The conditions to | even be able to observe SLEN in real code seem to involve using a vector to represent an | irregular structure - and probably one with elements not naturally aligned. | And I'm afraid I'm not following you. I agreed with: SEW| SLEN, which would be, at best, unusual | struggling with: | The conditions required for impact seem to me much stronger. -- Than "lane misalignment"?
| Arbitrary widening & narrowing mixing of SEW is fine --- in the model I describe above, or is | there another interpretation I'm not fathoming.
| The conditions to even be able to observe SLEN in real code seem to involve using a vector to | represent an irregular structure | ---- I think not just an irregular structure, but a composite structure (bytes within words, | etc.) | and probably one with elements not naturally aligned. ---- natural alignment perhaps simplifies, | but the issue exists even then. | (I realize an initial struggle I had was in interpreting your use of "vector" to mean mechanism. | duh.)
| Bill | On 4/27/20 10:33 AM, David Horner wrote: | EXTERNAL MAIL | as some are not on github, I posted this response to #434 here:
| Observations: | - single SEW operations are agnostic to underlying structure (as Krte noted in recent doc | revision) | - mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN= | VLEN | - mixed SEW operations are predominantly SEW <--> 2 * SEW | - by 2 interleaving chunks in SLEN=VLEN at SEW level align well with non-interleaved at 2 * | SEW
| Postulate: | That software can anticipate its need for a matching structures for widening/narrowing and | memory overlay model and make a weighed choice.
| I call the current interleave proposal SEW level interleave (elements are apportioned on a | SEW basis amongst available SLEN chunks in a round robin fashion).
| I thus propose a variant of #421 Fractional vtype field vfill – Fractional Fill order and | Fractional Instruction eLement Location :
| INTRLV defines 4 interleave formats:
| - SLEN<VLEN (SEW level interleave) | - SLEN=VLEN (proposed as extension, essentially no interleave) | - Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but upper 1/2 of SLEN | chunk is a gap (undisturbed or agnostic fill). | - Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but lower 1/2 of SLEN | chunk is a gap (undisturbed or agnostic fill).
| A 2bit vtype vintrlv field defines the application of these formats to various operations, | the effect is determined by what kind of operation it is:
| Load/Store will depending upon mode | ``` | vintrvl level = 0 -- scrample/descramble SEW level encoding | vintrvl level = 3 -- transfer as if SLEN=VLEN ( non-interleaved) | vintrvl level = 1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk | (upper undisturbed of agnostic filled) | vintrvl level = 2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk | (lower undisturbed of agnostic filled) | ```
| Single width operations will work on either side of SLEN for vintrvl levels 1 and 2 , but | identically on all vl elements for vintrvl level s 0 and 3
| Widening operations can operate on either side of the SLEN chunks providing a 2*SEL set of | elements in an SLEN length chunk (vintrvl levels 1 and 2). | Further, Widening operations can operate with one source on one side and the other on the | other side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk. ( | vintrvl levels 3).
| For further details please read #421.
| On 2020-04-27 10:02 a.m., krste@... wrote: | I created a github issue for this, #434 - text repeated below, | Krste | Should SLEN=VLEN be an extension? | SLEN<VLEN introduces internal rearrangements to reduce cross-datapath | wiring for wide datapaths such that bytes of different SEWs are laid | out differently in register bytes versus memory bytes, whereas when | SLEN=VLEN, the in-register format matches in-memory format for vectors | of all SEW. | Many vector routines can be written to be agnostic to SLEN, but some | routines can use the vector extension to manipulate data structures | that are not simple arrays of a single-width datatype (e.g., a network | packet). These routines can exploit SLEN=VLEN and hence that SEW can | be changed to access different element widths within same vector | register value, and many implementations will have SLEN=VLEN. | To support these kinds of routines portably on both SLEN<VLEN and | SLEN=VLEN machines, we could provide SEW "casting" operations that | internally rearrange in-register representations, e.g., converting a | vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing | in the halfwords as they would if the vector was held in memory. For | SLEN=VLEN machines, all cast operations are a simple copy. However, | preserving compatibility between both types of machine incurs an | efficiency cost on the common SLEN=VLEN machines, and the "cast" | operation is not necessarily very efficient on the SLEN<VLEN machines | as it requires communication between the SLEN-wide sections, and | reloading vector from memory with different SEW might actually be more | efficient depending on the microarchitecture. | Making SLEN=VLEN an extension (Zveqs?) enables software to exploit | this where available, avoiding needless casts. A downside would be | that this splits the software ecosystem if code that does not need to | depend on SLEN=VLEN inadvertently requires it. However, software | developers will be motivated to test for SLEN=VLEN to drop need to | perform cast operations even without an extension, so this split will | likely happen anyway. | the above proposal presupposes the SLEN=VLEN support be part of the base. | It also postulates that such casting operations are not necessary as they can be avoided by | judicious use of the INTRVL facilities. | I may be wrong and such caste [sick] operations may be beneficial.
| A second issue either way is whether we should add "cast" | operations. They are primarily useful for the SLEN<VLEN machines | though are difficult to implement efficiently there; the SLEN=VLEN | implementation is just a register-register copy. We could choose to | add the cast operations as another optional extension, which is my | preference at this time. | So a separate extension for cast operations is also my current preference (if needed).
| On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said: | | Hi Krste, | | On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote: | | Could consider later adding "cast" instructions that convert a vector | | of N SEW=8 elements into a vector of N/2 SEW=16 elements by | | concatenating the two bytes (and similar for other combinations of | | source and dest SEWs). These would be a simple move/copy on an | | SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine | | with bytes crossing between SLEN sections (probably reusing the memory | | pipeline crossbar in an implementation, to store the source vector in | | its memory format, then load the destination vector in its register | | format). So vector is loaded once from memory as SEW=8, then cast | | into appropriate type to extract other fields. Misaligned words might | | need a slide before casting. | | I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete | | example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They | | currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32. | | For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if | | SLEN==VLEN. For example, | | # SEW = 32, LMUL = 4 | | v_reinterpret v0, e64, m1 | | would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up | | to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows: | | # SEW = 64, LMUL = 1 | | v_reinterpret v0, e32, m4 | | Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN == | | VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden | | from the programmer. Hence, I think it would be useful in developing portable software. | | Best, | | Nick Knight
|
|