More thoughts on Git update (8a9fbce) Added fractional LMUL


David Horner
 

The aspect that will probably be most problematic for programmer is the loss of memory mapping paradigm.

Whereas adjacent  bytes in memory are in the same or adjacent words (ditto for half words and doubles),
      once stored in vector registers this will no longer hold when SLEN <= 1/4 VLEN.
Indeed, in memory consecutive bytes advance through halfs, words and doubles,
   but in vector registers with SLEN<= 1/2 VLEN, they jump to consecutive SLEN chunks.

Due to this SLEN relative to VLEN dependency,
    it is at least as hard for one to get ones mind around (to grok) than the various big-endian formats.

It may prove challenging to porting code that assumes the memory mapping model in overlapping registers of differing power of two widths .
I have no immediate solution.


Krste Asanovic
 

On Sun, 26 Apr 2020 00:47:34 -0400, "David Horner" <ds2horner@...> said:
| The aspect that will probably be most problematic for programmer is the
| loss of memory mapping paradigm.

| Whereas adjacent  bytes in memory are in the same or adjacent words
| (ditto for half words and doubles),
|       once stored in vector registers this will no longer hold when
| SLEN <= 1/4 VLEN.
| Indeed, in memory consecutive bytes advance through halfs, words and
| doubles,
|    but in vector registers with SLEN<= 1/2 VLEN, they jump to
| consecutive SLEN chunks.

Application programmers are not supposed to be using the underlying
mapping in their code. If they only access vector register using
element indices, and never access values stored in a vector register
group with more than one SEW/LMUL setting then SLEN and the
in-register element-to-byte mapping should be transparent to them.

Debuggers, emulators, and other tools that look at the register values
obviously have to parse the register layout, which is why we're
standardizing SLEN as a parameter, but application programmers
shouldn't have to worry about the data layout.

| Due to this SLEN relative to VLEN dependency,
|     it is at least as hard for one to get ones mind around (to grok)
| than the various big-endian formats.

| It may prove challenging to porting code that assumes the memory mapping
| model in overlapping registers of differing power of two widths .
| I have no immediate solution.

Do you have an example of such code? For which architecture?

I can see a case where vector operations are being used to accelerate
operations on in-memory multi-byte data structures, e.g., an IP packet
containing a mix of SEW=8,16,32 fields. A SEW=8 vector load of an
in-memory structure can always be accessed using element byte indices
to obtain the same in-register byte mapping as for an in-memory data
structure, but it is not efficient to extract/operate on a multi-byte
field from the register in the presence of SLEN (whereas trivial when
SLEN=VLEN).

Could consider later adding "cast" instructions that convert a vector
of N SEW=8 elements into a vector of N/2 SEW=16 elements by
concatenating the two bytes (and similar for other combinations of
source and dest SEWs). These would be a simple move/copy on an
SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
with bytes crossing between SLEN sections (probably reusing the memory
pipeline crossbar in an implementation, to store the source vector in
its memory format, then load the destination vector in its register
format). So vector is loaded once from memory as SEW=8, then cast
into appropriate type to extract other fields. Misaligned words might
need a slide before casting.

An alternative approach without adding new instructions would be to
just load the in-memory structure several times with different SEW
formats, or more simply, just use scalar instructions to process the
fields in the in-memory struct and only use the vector instructions to
shuffle structures around in memory.

Yet another alternative approach is to make SLEN<VLEN an architectural
option that software has to deal with, but that will fragment the
software ecosystem.

Krste


|


Nick Knight
 

Hi Krste,

On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:
Could consider later adding "cast" instructions that convert a vector
of N SEW=8 elements into a vector of N/2 SEW=16 elements by
concatenating the two bytes (and similar for other combinations of
source and dest SEWs).  These would be a simple move/copy on an
SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
with bytes crossing between SLEN sections (probably reusing the memory
pipeline crossbar in an implementation, to store the source vector in
its memory format, then load the destination vector in its register
format).  So vector is loaded once from memory as SEW=8, then cast
into appropriate type to extract other fields.  Misaligned words might
need a slide before casting.

I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if SLEN==VLEN. For example,

# SEW = 32, LMUL = 4
v_reinterpret v0, e64, m1

would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

# SEW = 64, LMUL = 1
v_reinterpret v0, e32, m4

Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN == VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden from the programmer. Hence, I think it would be useful in developing portable software.

Best,
Nick Knight


Krste Asanovic
 

I created a github issue for this, #434 - text repeated below,
Krste

Should SLEN=VLEN be an extension?

SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
wiring for wide datapaths such that bytes of different SEWs are laid
out differently in register bytes versus memory bytes, whereas when
SLEN=VLEN, the in-register format matches in-memory format for vectors
of all SEW.

Many vector routines can be written to be agnostic to SLEN, but some
routines can use the vector extension to manipulate data structures
that are not simple arrays of a single-width datatype (e.g., a network
packet). These routines can exploit SLEN=VLEN and hence that SEW can
be changed to access different element widths within same vector
register value, and many implementations will have SLEN=VLEN.

To support these kinds of routines portably on both SLEN<VLEN and
SLEN=VLEN machines, we could provide SEW "casting" operations that
internally rearrange in-register representations, e.g., converting a
vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
in the halfwords as they would if the vector was held in memory. For
SLEN=VLEN machines, all cast operations are a simple copy. However,
preserving compatibility between both types of machine incurs an
efficiency cost on the common SLEN=VLEN machines, and the "cast"
operation is not necessarily very efficient on the SLEN<VLEN machines
as it requires communication between the SLEN-wide sections, and
reloading vector from memory with different SEW might actually be more
efficient depending on the microarchitecture.

Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
this where available, avoiding needless casts. A downside would be
that this splits the software ecosystem if code that does not need to
depend on SLEN=VLEN inadvertently requires it. However, software
developers will be motivated to test for SLEN=VLEN to drop need to
perform cast operations even without an extension, so this split will
likely happen anyway.

A second issue either way is whether we should add "cast"
operations. They are primarily useful for the SLEN<VLEN machines
though are difficult to implement efficiently there; the SLEN=VLEN
implementation is just a register-register copy. We could choose to
add the cast operations as another optional extension, which is my
preference at this time.







On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste,
| On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:

| Could consider later adding "cast" instructions that convert a vector
| of N SEW=8 elements into a vector of N/2 SEW=16 elements by
| concatenating the two bytes (and similar for other combinations of
| source and dest SEWs).  These would be a simple move/copy on an
| SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
| with bytes crossing between SLEN sections (probably reusing the memory
| pipeline crossbar in an implementation, to store the source vector in
| its memory format, then load the destination vector in its register
| format).  So vector is loaded once from memory as SEW=8, then cast
| into appropriate type to extract other fields.  Misaligned words might
| need a slide before casting.

| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| SLEN==VLEN. For example,

| # SEW = 32, LMUL = 4
| v_reinterpret v0, e64, m1

| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

| # SEW = 64, LMUL = 1
| v_reinterpret v0, e32, m4

| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| from the programmer. Hence, I think it would be useful in developing portable software.

| Best,
| Nick Knight


David Horner
 

as some are not on github, I posted this response to #434 here:

Observations:

- single SEW operations are agnostic to underlying structure (as Krte noted in recent doc revision)

- mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN

- mixed SEW operations are predominantly SEW <--> 2 * SEW

- by 2 interleaving chunks in SLEN=VLEN at SEW level align well with non-interleaved at 2 * SEW


Postulate:

That software can anticipate its need for a matching structures for widening/narrowing and memory overlay model and make a weighed choice.


I call the current interleave proposal SEW level interleave (elements are apportioned on a SEW basis amongst available SLEN chunks in a round robin fashion).


I thus propose a variant of #421 Fractional vtype field vfill – Fractional Fill order and Fractional Instruction eLement Location :


INTRLV defines 4 interleave formats:


- SLEN<VLEN (SEW level interleave)

- SLEN=VLEN (proposed as extension, essentially no interleave)

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but upper 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but lower 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).


A 2bit vtype vintrlv field defines the application of these formats to various operations, the effect is determined by what kind of operation it is:




Load/Store will depending upon mode

```

vintrvl level = 0 -- scrample/descramble SEW level encoding

vintrvl level = 3 -- transfer as if SLEN=VLEN ( non-interleaved)

vintrvl level = 1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk

(upper undisturbed of agnostic filled)

vintrvl level = 2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk

(lower undisturbed of agnostic filled)

```


Single width operations will work on either side of SLEN for vintrvl levels 1 and 2 , but identically on all vl elements for vintrvl level s 0 and 3


Widening operations can operate on either side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk (vintrvl levels 1 and 2).

Further, Widening operations can operate with one source on one side and the other on the other side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk. ( vintrvl levels 3).


For further details please read #421.




On 2020-04-27 10:02 a.m., krste@... wrote:
I created a github issue for this, #434 - text repeated below,
Krste

Should SLEN=VLEN be an extension?

SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
wiring for wide datapaths such that bytes of different SEWs are laid
out differently in register bytes versus memory bytes, whereas when
SLEN=VLEN, the in-register format matches in-memory format for vectors
of all SEW.

Many vector routines can be written to be agnostic to SLEN, but some
routines can use the vector extension to manipulate data structures
that are not simple arrays of a single-width datatype (e.g., a network
packet). These routines can exploit SLEN=VLEN and hence that SEW can
be changed to access different element widths within same vector
register value, and many implementations will have SLEN=VLEN.

To support these kinds of routines portably on both SLEN<VLEN and
SLEN=VLEN machines, we could provide SEW "casting" operations that
internally rearrange in-register representations, e.g., converting a
vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
in the halfwords as they would if the vector was held in memory. For
SLEN=VLEN machines, all cast operations are a simple copy. However,
preserving compatibility between both types of machine incurs an
efficiency cost on the common SLEN=VLEN machines, and the "cast"
operation is not necessarily very efficient on the SLEN<VLEN machines
as it requires communication between the SLEN-wide sections, and
reloading vector from memory with different SEW might actually be more
efficient depending on the microarchitecture.

Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
this where available, avoiding needless casts. A downside would be
that this splits the software ecosystem if code that does not need to
depend on SLEN=VLEN inadvertently requires it. However, software
developers will be motivated to test for SLEN=VLEN to drop need to
perform cast operations even without an extension, so this split will
likely happen anyway.
the above proposal presupposes the SLEN=VLEN support be part of the base.

It also postulates that such casting operations are not necessary as they can be avoided by judicious use of the INTRVL facilities.
I may be wrong and such caste [sick] operations may be beneficial.


A second issue either way is whether we should add "cast"
operations. They are primarily useful for the SLEN<VLEN machines
though are difficult to implement efficiently there; the SLEN=VLEN
implementation is just a register-register copy. We could choose to
add the cast operations as another optional extension, which is my
preference at this time.
So a separate extension for cast operations is also my current preference (if needed).







On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste,
| On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:

|     Could consider later adding "cast" instructions that convert a vector
|     of N SEW=8 elements into a vector of N/2 SEW=16 elements by
|     concatenating the two bytes (and similar for other combinations of
|     source and dest SEWs).  These would be a simple move/copy on an
|     SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
|     with bytes crossing between SLEN sections (probably reusing the memory
|     pipeline crossbar in an implementation, to store the source vector in
|     its memory format, then load the destination vector in its register
|     format).  So vector is loaded once from memory as SEW=8, then cast
|     into appropriate type to extract other fields.  Misaligned words might
|     need a slide before casting.

| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| SLEN==VLEN. For example,

| # SEW = 32, LMUL = 4
| v_reinterpret v0, e64, m1

| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

| # SEW = 64, LMUL = 1
| v_reinterpret v0, e32, m4

| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| from the programmer. Hence, I think it would be useful in developing portable software.

| Best,
| Nick Knight


Bill Huffman
 

Hi David,

I don't understand your observation about "mixed SEW operations (widening & narrowing)..."  The conditions required for impact seem to me much stronger.  Arbitrary widening & narrowing mixing of SEW is fine (unless SEW>SLEN, which would be, at best, unusual).  The conditions to even be able to observe SLEN in real code seem to involve using a vector to represent an irregular structure - and probably one with elements not naturally aligned.

     Bill

On 4/27/20 10:33 AM, David Horner wrote:

EXTERNAL MAIL

as some are not on github, I posted this response to #434 here:

Observations:

- single SEW operations are agnostic to underlying structure (as Krte noted in recent doc revision)

- mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN

- mixed SEW operations are predominantly SEW <--> 2 * SEW

- by 2 interleaving chunks in SLEN=VLEN at SEW level align well with non-interleaved at 2 * SEW


Postulate:

That software can anticipate its need for a matching structures for widening/narrowing and memory overlay model and make a weighed choice.


I call the current interleave proposal SEW level interleave (elements are apportioned on a SEW basis amongst available SLEN chunks in a round robin fashion).


I thus propose a variant of #421 Fractional vtype field vfill – Fractional Fill order and Fractional Instruction eLement Location :


INTRLV defines 4 interleave formats:


- SLEN<VLEN (SEW level interleave)

- SLEN=VLEN (proposed as extension, essentially no interleave)

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but upper 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but lower 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).


A 2bit vtype vintrlv field defines the application of these formats to various operations, the effect is determined by what kind of operation it is:




Load/Store will depending upon mode

```

vintrvl level = 0 -- scrample/descramble SEW level encoding

vintrvl level = 3 -- transfer as if SLEN=VLEN ( non-interleaved)

vintrvl level = 1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk

(upper undisturbed of agnostic filled)

vintrvl level = 2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk

(lower undisturbed of agnostic filled)

```


Single width operations will work on either side of SLEN for vintrvl levels 1 and 2 , but identically on all vl elements for vintrvl level s 0 and 3


Widening operations can operate on either side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk (vintrvl levels 1 and 2).

Further, Widening operations can operate with one source on one side and the other on the other side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk. ( vintrvl levels 3).


For further details please read #421.




On 2020-04-27 10:02 a.m., krste@... wrote:
I created a github issue for this, #434 - text repeated below,
Krste

Should SLEN=VLEN be an extension?

SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
wiring for wide datapaths such that bytes of different SEWs are laid
out differently in register bytes versus memory bytes, whereas when
SLEN=VLEN, the in-register format matches in-memory format for vectors
of all SEW.

Many vector routines can be written to be agnostic to SLEN, but some
routines can use the vector extension to manipulate data structures
that are not simple arrays of a single-width datatype (e.g., a network
packet). These routines can exploit SLEN=VLEN and hence that SEW can
be changed to access different element widths within same vector
register value, and many implementations will have SLEN=VLEN.

To support these kinds of routines portably on both SLEN<VLEN and
SLEN=VLEN machines, we could provide SEW "casting" operations that
internally rearrange in-register representations, e.g., converting a
vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
in the halfwords as they would if the vector was held in memory. For
SLEN=VLEN machines, all cast operations are a simple copy. However,
preserving compatibility between both types of machine incurs an
efficiency cost on the common SLEN=VLEN machines, and the "cast"
operation is not necessarily very efficient on the SLEN<VLEN machines
as it requires communication between the SLEN-wide sections, and
reloading vector from memory with different SEW might actually be more
efficient depending on the microarchitecture.

Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
this where available, avoiding needless casts. A downside would be
that this splits the software ecosystem if code that does not need to
depend on SLEN=VLEN inadvertently requires it. However, software
developers will be motivated to test for SLEN=VLEN to drop need to
perform cast operations even without an extension, so this split will
likely happen anyway.
the above proposal presupposes the SLEN=VLEN support be part of the base.

It also postulates that such casting operations are not necessary as they can be avoided by judicious use of the INTRVL facilities.
I may be wrong and such caste [sick] operations may be beneficial.

A second issue either way is whether we should add "cast"
operations. They are primarily useful for the SLEN<VLEN machines
though are difficult to implement efficiently there; the SLEN=VLEN
implementation is just a register-register copy. We could choose to
add the cast operations as another optional extension, which is my
preference at this time.
So a separate extension for cast operations is also my current preference (if needed).





On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste,
| On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:

|     Could consider later adding "cast" instructions that convert a vector
|     of N SEW=8 elements into a vector of N/2 SEW=16 elements by
|     concatenating the two bytes (and similar for other combinations of
|     source and dest SEWs).  These would be a simple move/copy on an
|     SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
|     with bytes crossing between SLEN sections (probably reusing the memory
|     pipeline crossbar in an implementation, to store the source vector in
|     its memory format, then load the destination vector in its register
|     format).  So vector is loaded once from memory as SEW=8, then cast
|     into appropriate type to extract other fields.  Misaligned words might
|     need a slide before casting.

| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| SLEN==VLEN. For example,

| # SEW = 32, LMUL = 4
| v_reinterpret v0, e64, m1

| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

| # SEW = 64, LMUL = 1
| v_reinterpret v0, e32, m4

| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| from the programmer. Hence, I think it would be useful in developing portable software.

| Best,
| Nick Knight


Bill Huffman
 

On 4/27/20 7:02 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I created a github issue for this, #434 - text repeated below,
Krste

Should SLEN=VLEN be an extension?

SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
wiring for wide datapaths such that bytes of different SEWs are laid
out differently in register bytes versus memory bytes, whereas when
SLEN=VLEN, the in-register format matches in-memory format for vectors
of all SEW.

Many vector routines can be written to be agnostic to SLEN, but some
routines can use the vector extension to manipulate data structures
that are not simple arrays of a single-width datatype (e.g., a network
packet). These routines can exploit SLEN=VLEN and hence that SEW can
be changed to access different element widths within same vector
register value, and many implementations will have SLEN=VLEN.

To support these kinds of routines portably on both SLEN<VLEN and
SLEN=VLEN machines, we could provide SEW "casting" operations that
internally rearrange in-register representations, e.g., converting a
vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
in the halfwords as they would if the vector was held in memory. For
SLEN=VLEN machines, all cast operations are a simple copy. However,
preserving compatibility between both types of machine incurs an
efficiency cost on the common SLEN=VLEN machines, and the "cast"
operation is not necessarily very efficient on the SLEN<VLEN machines
as it requires communication between the SLEN-wide sections, and
reloading vector from memory with different SEW might actually be more
efficient depending on the microarchitecture.

Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
this where available, avoiding needless casts. A downside would be
that this splits the software ecosystem if code that does not need to
depend on SLEN=VLEN inadvertently requires it. However, software
developers will be motivated to test for SLEN=VLEN to drop need to
perform cast operations even without an extension, so this split will
likely happen anyway.
It might be the case that the machines where SLEN=VLEN would be the same
machines where it would be attractive to use vectors for such code -
machines where vectors provided larger registers and some parallelism
rather than machines where vectors usually complete in one or a few
cycles and wouldn't deal well with irregular operations. That probably
increases the value of an extension.

On the other hand, adding casting operations would seem to decrease the
value of an extension (see below).


A second issue either way is whether we should add "cast"
operations. They are primarily useful for the SLEN<VLEN machines
though are difficult to implement efficiently there; the SLEN=VLEN
implementation is just a register-register copy. We could choose to
add the cast operations as another optional extension, which is my
preference at this time.
Where SLEN<VLEN, cast operations might be implemented as vector register
gather operations with element index values determined by SLEN, VLEN and
SEW. But where SLEN=VLEN, they would be moves. If then, we add casts,
would an SLEN=VLEN extension still be valuable?

Bill








On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste,
| On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:

| Could consider later adding "cast" instructions that convert a vector
| of N SEW=8 elements into a vector of N/2 SEW=16 elements by
| concatenating the two bytes (and similar for other combinations of
| source and dest SEWs).  These would be a simple move/copy on an
| SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
| with bytes crossing between SLEN sections (probably reusing the memory
| pipeline crossbar in an implementation, to store the source vector in
| its memory format, then load the destination vector in its register
| format).  So vector is loaded once from memory as SEW=8, then cast
| into appropriate type to extract other fields.  Misaligned words might
| need a slide before casting.

| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| SLEN==VLEN. For example,

| # SEW = 32, LMUL = 4
| v_reinterpret v0, e64, m1

| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

| # SEW = 64, LMUL = 1
| v_reinterpret v0, e32, m4

| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| from the programmer. Hence, I think it would be useful in developing portable software.

| Best,
| Nick Knight



David Horner
 


mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN

I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. Rather than outputs aligning within a data lane, for widening operations the results are shuffled up to their appropriate slot in the destination register, and into the vd+1 register if vl >= VLMAX of a single physical register.


The observation is that on machines that

On 2020-04-27 1:51 p.m., Bill Huffman wrote:

Hi David,

I don't understand your observation about "mixed SEW operations (widening & narrowing)..." 

              "mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN"

I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. For widening operations, rather than the outputs aligning within a data lane,  the results are shuffled up to their appropriate slot in the destination register, and into the vd+1 register if vl >= (VLMAX of a single physical register).

 I understand this is a substantial impact for most implementations.

But we may have different interpretations of what Krste meant by SLEN=VLEN.
He floated an option that VLEN be reduced to SLEN, but my reading of this proposal doesn't jive with that alternative.


The conditions required for impact seem to me much stronger.  Arbitrary widening & narrowing mixing of SEW is fine (unless SEW>SLEN, which would be, at best, unusual).  The conditions to even be able to observe SLEN in real code seem to involve using a vector to represent an irregular structure - and probably one with elements not naturally aligned.

And I'm afraid I'm not following you. I agreed with:

SEW>SLEN, which would be, at best, unusual

struggling with:

The conditions required for impact seem to me much stronger. -- Than "lane misalignment"?


Arbitrary widening & narrowing mixing of SEW is fine   --- in the model I describe above, or is there another interpretation I'm not fathoming.


The conditions to even be able to observe SLEN in real code seem to involve using a vector to represent an irregular structure
   ---- I think not just an irregular structure, but a composite structure (bytes within words, etc.)
 and probably one with elements not naturally aligned.  ---- natural alignment perhaps simplifies, but the issue exists even then.


(I realize an initial struggle I had was in interpreting your use of "vector" to mean mechanism. duh.)

     Bill

On 4/27/20 10:33 AM, David Horner wrote:
EXTERNAL MAIL

as some are not on github, I posted this response to #434 here:

Observations:

- single SEW operations are agnostic to underlying structure (as Krte noted in recent doc revision)

- mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN

- mixed SEW operations are predominantly SEW <--> 2 * SEW

- by 2 interleaving chunks in SLEN=VLEN at SEW level align well with non-interleaved at 2 * SEW


Postulate:

That software can anticipate its need for a matching structures for widening/narrowing and memory overlay model and make a weighed choice.


I call the current interleave proposal SEW level interleave (elements are apportioned on a SEW basis amongst available SLEN chunks in a round robin fashion).


I thus propose a variant of #421 Fractional vtype field vfill – Fractional Fill order and Fractional Instruction eLement Location :


INTRLV defines 4 interleave formats:


- SLEN<VLEN (SEW level interleave)

- SLEN=VLEN (proposed as extension, essentially no interleave)

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but upper 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but lower 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).


A 2bit vtype vintrlv field defines the application of these formats to various operations, the effect is determined by what kind of operation it is:




Load/Store will depending upon mode

```

vintrvl level = 0 -- scrample/descramble SEW level encoding

vintrvl level = 3 -- transfer as if SLEN=VLEN ( non-interleaved)

vintrvl level = 1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk

(upper undisturbed of agnostic filled)

vintrvl level = 2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk

(lower undisturbed of agnostic filled)

```


Single width operations will work on either side of SLEN for vintrvl levels 1 and 2 , but identically on all vl elements for vintrvl level s 0 and 3


Widening operations can operate on either side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk (vintrvl levels 1 and 2).

Further, Widening operations can operate with one source on one side and the other on the other side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk. ( vintrvl levels 3).


For further details please read #421.




On 2020-04-27 10:02 a.m., krste@... wrote:
I created a github issue for this, #434 - text repeated below,
Krste

Should SLEN=VLEN be an extension?

SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
wiring for wide datapaths such that bytes of different SEWs are laid
out differently in register bytes versus memory bytes, whereas when
SLEN=VLEN, the in-register format matches in-memory format for vectors
of all SEW.

Many vector routines can be written to be agnostic to SLEN, but some
routines can use the vector extension to manipulate data structures
that are not simple arrays of a single-width datatype (e.g., a network
packet). These routines can exploit SLEN=VLEN and hence that SEW can
be changed to access different element widths within same vector
register value, and many implementations will have SLEN=VLEN.

To support these kinds of routines portably on both SLEN<VLEN and
SLEN=VLEN machines, we could provide SEW "casting" operations that
internally rearrange in-register representations, e.g., converting a
vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
in the halfwords as they would if the vector was held in memory. For
SLEN=VLEN machines, all cast operations are a simple copy. However,
preserving compatibility between both types of machine incurs an
efficiency cost on the common SLEN=VLEN machines, and the "cast"
operation is not necessarily very efficient on the SLEN<VLEN machines
as it requires communication between the SLEN-wide sections, and
reloading vector from memory with different SEW might actually be more
efficient depending on the microarchitecture.

Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
this where available, avoiding needless casts. A downside would be
that this splits the software ecosystem if code that does not need to
depend on SLEN=VLEN inadvertently requires it. However, software
developers will be motivated to test for SLEN=VLEN to drop need to
perform cast operations even without an extension, so this split will
likely happen anyway.
the above proposal presupposes the SLEN=VLEN support be part of the base.

It also postulates that such casting operations are not necessary as they can be avoided by judicious use of the INTRVL facilities.
I may be wrong and such caste [sick] operations may be beneficial.

A second issue either way is whether we should add "cast"
operations. They are primarily useful for the SLEN<VLEN machines
though are difficult to implement efficiently there; the SLEN=VLEN
implementation is just a register-register copy. We could choose to
add the cast operations as another optional extension, which is my
preference at this time.
So a separate extension for cast operations is also my current preference (if needed).



On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste,
| On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:

|     Could consider later adding "cast" instructions that convert a vector
|     of N SEW=8 elements into a vector of N/2 SEW=16 elements by
|     concatenating the two bytes (and similar for other combinations of
|     source and dest SEWs).  These would be a simple move/copy on an
|     SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
|     with bytes crossing between SLEN sections (probably reusing the memory
|     pipeline crossbar in an implementation, to store the source vector in
|     its memory format, then load the destination vector in its register
|     format).  So vector is loaded once from memory as SEW=8, then cast
|     into appropriate type to extract other fields.  Misaligned words might
|     need a slide before casting.

| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| SLEN==VLEN. For example,

| # SEW = 32, LMUL = 4
| v_reinterpret v0, e64, m1

| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

| # SEW = 64, LMUL = 1
| v_reinterpret v0, e32, m4

| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| from the programmer. Hence, I think it would be useful in developing portable software.

| Best,
| Nick Knight



Bill Huffman
 

Sounds like maybe you're thinking that "widening of SLEN to VLEN" is a runtime setting or something like that.  There will be no (certainly few) machines where SLEN is variable as the power/area cost would be too high.  But maybe you meant something else.

     Bill

On 4/27/20 11:50 AM, DSHORNER wrote:

EXTERNAL MAIL


mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN

I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. Rather than outputs aligning within a data lane, for widening operations the results are shuffled up to their appropriate slot in the destination register, and into the vd+1 register if vl >= VLMAX of a single physical register.


The observation is that on machines that

On 2020-04-27 1:51 p.m., Bill Huffman wrote:

Hi David,

I don't understand your observation about "mixed SEW operations (widening & narrowing)..." 

              "mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN"

I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. For widening operations, rather than the outputs aligning within a data lane,  the results are shuffled up to their appropriate slot in the destination register, and into the vd+1 register if vl >= (VLMAX of a single physical register).

 I understand this is a substantial impact for most implementations.

But we may have different interpretations of what Krste meant by SLEN=VLEN.
He floated an option that VLEN be reduced to SLEN, but my reading of this proposal doesn't jive with that alternative.


The conditions required for impact seem to me much stronger.  Arbitrary widening & narrowing mixing of SEW is fine (unless SEW>SLEN, which would be, at best, unusual).  The conditions to even be able to observe SLEN in real code seem to involve using a vector to represent an irregular structure - and probably one with elements not naturally aligned.

And I'm afraid I'm not following you. I agreed with:

SEW>SLEN, which would be, at best, unusual

struggling with:

The conditions required for impact seem to me much stronger. -- Than "lane misalignment"?


Arbitrary widening & narrowing mixing of SEW is fine   --- in the model I describe above, or is there another interpretation I'm not fathoming.


The conditions to even be able to observe SLEN in real code seem to involve using a vector to represent an irregular structure
   ---- I think not just an irregular structure, but a composite structure (bytes within words, etc.)
 and probably one with elements not naturally aligned.  ---- natural alignment perhaps simplifies, but the issue exists even then.


(I realize an initial struggle I had was in interpreting your use of "vector" to mean mechanism. duh.)

     Bill

On 4/27/20 10:33 AM, David Horner wrote:
EXTERNAL MAIL

as some are not on github, I posted this response to #434 here:

Observations:

- single SEW operations are agnostic to underlying structure (as Krte noted in recent doc revision)

- mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN

- mixed SEW operations are predominantly SEW <--> 2 * SEW

- by 2 interleaving chunks in SLEN=VLEN at SEW level align well with non-interleaved at 2 * SEW


Postulate:

That software can anticipate its need for a matching structures for widening/narrowing and memory overlay model and make a weighed choice.


I call the current interleave proposal SEW level interleave (elements are apportioned on a SEW basis amongst available SLEN chunks in a round robin fashion).


I thus propose a variant of #421 Fractional vtype field vfill – Fractional Fill order and Fractional Instruction eLement Location :


INTRLV defines 4 interleave formats:


- SLEN<VLEN (SEW level interleave)

- SLEN=VLEN (proposed as extension, essentially no interleave)

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but upper 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).

- Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but lower 1/2 of SLEN chunk is a gap (undisturbed or agnostic fill).


A 2bit vtype vintrlv field defines the application of these formats to various operations, the effect is determined by what kind of operation it is:




Load/Store will depending upon mode

```

vintrvl level = 0 -- scrample/descramble SEW level encoding

vintrvl level = 3 -- transfer as if SLEN=VLEN ( non-interleaved)

vintrvl level = 1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk

(upper undisturbed of agnostic filled)

vintrvl level = 2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk

(lower undisturbed of agnostic filled)

```


Single width operations will work on either side of SLEN for vintrvl levels 1 and 2 , but identically on all vl elements for vintrvl level s 0 and 3


Widening operations can operate on either side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk (vintrvl levels 1 and 2).

Further, Widening operations can operate with one source on one side and the other on the other side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk. ( vintrvl levels 3).


For further details please read #421.




On 2020-04-27 10:02 a.m., krste@... wrote:
I created a github issue for this, #434 - text repeated below,
Krste

Should SLEN=VLEN be an extension?

SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
wiring for wide datapaths such that bytes of different SEWs are laid
out differently in register bytes versus memory bytes, whereas when
SLEN=VLEN, the in-register format matches in-memory format for vectors
of all SEW.

Many vector routines can be written to be agnostic to SLEN, but some
routines can use the vector extension to manipulate data structures
that are not simple arrays of a single-width datatype (e.g., a network
packet). These routines can exploit SLEN=VLEN and hence that SEW can
be changed to access different element widths within same vector
register value, and many implementations will have SLEN=VLEN.

To support these kinds of routines portably on both SLEN<VLEN and
SLEN=VLEN machines, we could provide SEW "casting" operations that
internally rearrange in-register representations, e.g., converting a
vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
in the halfwords as they would if the vector was held in memory. For
SLEN=VLEN machines, all cast operations are a simple copy. However,
preserving compatibility between both types of machine incurs an
efficiency cost on the common SLEN=VLEN machines, and the "cast"
operation is not necessarily very efficient on the SLEN<VLEN machines
as it requires communication between the SLEN-wide sections, and
reloading vector from memory with different SEW might actually be more
efficient depending on the microarchitecture.

Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
this where available, avoiding needless casts. A downside would be
that this splits the software ecosystem if code that does not need to
depend on SLEN=VLEN inadvertently requires it. However, software
developers will be motivated to test for SLEN=VLEN to drop need to
perform cast operations even without an extension, so this split will
likely happen anyway.
the above proposal presupposes the SLEN=VLEN support be part of the base.

It also postulates that such casting operations are not necessary as they can be avoided by judicious use of the INTRVL facilities.
I may be wrong and such caste [sick] operations may be beneficial.

A second issue either way is whether we should add "cast"
operations. They are primarily useful for the SLEN<VLEN machines
though are difficult to implement efficiently there; the SLEN=VLEN
implementation is just a register-register copy. We could choose to
add the cast operations as another optional extension, which is my
preference at this time.
So a separate extension for cast operations is also my current preference (if needed).

On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| Hi Krste,
| On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:

|     Could consider later adding "cast" instructions that convert a vector
|     of N SEW=8 elements into a vector of N/2 SEW=16 elements by
|     concatenating the two bytes (and similar for other combinations of
|     source and dest SEWs).  These would be a simple move/copy on an
|     SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
|     with bytes crossing between SLEN sections (probably reusing the memory
|     pipeline crossbar in an implementation, to store the source vector in
|     its memory format, then load the destination vector in its register
|     format).  So vector is loaded once from memory as SEW=8, then cast
|     into appropriate type to extract other fields.  Misaligned words might
|     need a slide before casting.

| I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

| For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| SLEN==VLEN. For example,

| # SEW = 32, LMUL = 4
| v_reinterpret v0, e64, m1

| would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

| # SEW = 64, LMUL = 1
| v_reinterpret v0, e32, m4

| Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| from the programmer. Hence, I think it would be useful in developing portable software.

| Best,
| Nick Knight



Krste Asanovic
 

I meant the SLEN=VLEN "extension" to simply be an assertion about the
machine's static configuration. Software could then rely on
in-register format matching in-memory format.

Krste

On Mon, 27 Apr 2020 18:57:42 +0000, Bill Huffman <huffman@...> said:
| Sounds like maybe you're thinking that "widening of SLEN to VLEN" is a runtime setting or something like
| that. There will be no (certainly few) machines where SLEN is variable as the power/area cost would be
| too high. But maybe you meant something else.

| Bill

| On 4/27/20 11:50 AM, DSHORNER wrote:

| EXTERNAL MAIL

| mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN

| I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
| The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. Rather than outputs
| aligning within a data lane, for widening operations the results are shuffled up to their
| appropriate slot in the destination register, and into the vd+1 register if vl >= VLMAX of a single
| physical register.

| The observation is that on machines that

| On 2020-04-27 1:51 p.m., Bill Huffman wrote:

| Hi David,

| I don't understand your observation about "mixed SEW operations (widening & narrowing)..."

| "mixed SEW operations (widening & narrowing) have substantial impact on contiguous
| SLEN=VLEN"

| I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
| The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. For widening operations,
| rather than the outputs aligning within a data lane, the results are shuffled up to their
| appropriate slot in the destination register, and into the vd+1 register if vl >= (VLMAX of a single
| physical register).

| I understand this is a substantial impact for most implementations.

| But we may have different interpretations of what Krste meant by SLEN=VLEN.
| He floated an option that VLEN be reduced to SLEN, but my reading of this proposal doesn't jive with
| that alternative.

| The conditions required for impact seem to me much stronger. Arbitrary widening & narrowing
| mixing of SEW is fine (unless SEW>SLEN, which would be, at best, unusual). The conditions to
| even be able to observe SLEN in real code seem to involve using a vector to represent an
| irregular structure - and probably one with elements not naturally aligned.

| And I'm afraid I'm not following you. I agreed with:

SEW| SLEN, which would be, at best, unusual

| struggling with:

| The conditions required for impact seem to me much stronger. -- Than "lane misalignment"?

| Arbitrary widening & narrowing mixing of SEW is fine --- in the model I describe above, or is
| there another interpretation I'm not fathoming.

| The conditions to even be able to observe SLEN in real code seem to involve using a vector to
| represent an irregular structure
| ---- I think not just an irregular structure, but a composite structure (bytes within words,
| etc.)
| and probably one with elements not naturally aligned. ---- natural alignment perhaps simplifies,
| but the issue exists even then.

| (I realize an initial struggle I had was in interpreting your use of "vector" to mean mechanism.
| duh.)

| Bill

| On 4/27/20 10:33 AM, David Horner wrote:

| EXTERNAL MAIL

| as some are not on github, I posted this response to #434 here:

| Observations:

| - single SEW operations are agnostic to underlying structure (as Krte noted in recent doc
| revision)

| - mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=
| VLEN

| - mixed SEW operations are predominantly SEW <--> 2 * SEW

| - by 2 interleaving chunks in SLEN=VLEN at SEW level align well with non-interleaved at 2 *
| SEW

| Postulate:

| That software can anticipate its need for a matching structures for widening/narrowing and
| memory overlay model and make a weighed choice.

| I call the current interleave proposal SEW level interleave (elements are apportioned on a
| SEW basis amongst available SLEN chunks in a round robin fashion).

| I thus propose a variant of #421 Fractional vtype field vfill – Fractional Fill order and
| Fractional Instruction eLement Location :

| INTRLV defines 4 interleave formats:

| - SLEN<VLEN (SEW level interleave)

| - SLEN=VLEN (proposed as extension, essentially no interleave)

| - Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but upper 1/2 of SLEN
| chunk is a gap (undisturbed or agnostic fill).

| - Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but lower 1/2 of SLEN
| chunk is a gap (undisturbed or agnostic fill).

| A 2bit vtype vintrlv field defines the application of these formats to various operations,
| the effect is determined by what kind of operation it is:

| Load/Store will depending upon mode

| ```

| vintrvl level = 0 -- scrample/descramble SEW level encoding

| vintrvl level = 3 -- transfer as if SLEN=VLEN ( non-interleaved)

| vintrvl level = 1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk

| (upper undisturbed of agnostic filled)

| vintrvl level = 2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk

| (lower undisturbed of agnostic filled)

| ```

| Single width operations will work on either side of SLEN for vintrvl levels 1 and 2 , but
| identically on all vl elements for vintrvl level s 0 and 3

| Widening operations can operate on either side of the SLEN chunks providing a 2*SEL set of
| elements in an SLEN length chunk (vintrvl levels 1 and 2).

| Further, Widening operations can operate with one source on one side and the other on the
| other side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk. (
| vintrvl levels 3).

| For further details please read #421.

| On 2020-04-27 10:02 a.m., krste@... wrote:

| I created a github issue for this, #434 - text repeated below,
| Krste

| Should SLEN=VLEN be an extension?

| SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
| wiring for wide datapaths such that bytes of different SEWs are laid
| out differently in register bytes versus memory bytes, whereas when
| SLEN=VLEN, the in-register format matches in-memory format for vectors
| of all SEW.

| Many vector routines can be written to be agnostic to SLEN, but some
| routines can use the vector extension to manipulate data structures
| that are not simple arrays of a single-width datatype (e.g., a network
| packet). These routines can exploit SLEN=VLEN and hence that SEW can
| be changed to access different element widths within same vector
| register value, and many implementations will have SLEN=VLEN.

| To support these kinds of routines portably on both SLEN<VLEN and
| SLEN=VLEN machines, we could provide SEW "casting" operations that
| internally rearrange in-register representations, e.g., converting a
| vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
| in the halfwords as they would if the vector was held in memory. For
| SLEN=VLEN machines, all cast operations are a simple copy. However,
| preserving compatibility between both types of machine incurs an
| efficiency cost on the common SLEN=VLEN machines, and the "cast"
| operation is not necessarily very efficient on the SLEN<VLEN machines
| as it requires communication between the SLEN-wide sections, and
| reloading vector from memory with different SEW might actually be more
| efficient depending on the microarchitecture.

| Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
| this where available, avoiding needless casts. A downside would be
| that this splits the software ecosystem if code that does not need to
| depend on SLEN=VLEN inadvertently requires it. However, software
| developers will be motivated to test for SLEN=VLEN to drop need to
| perform cast operations even without an extension, so this split will
| likely happen anyway.

| the above proposal presupposes the SLEN=VLEN support be part of the base.

| It also postulates that such casting operations are not necessary as they can be avoided by
| judicious use of the INTRVL facilities.
| I may be wrong and such caste [sick] operations may be beneficial.

| A second issue either way is whether we should add "cast"
| operations. They are primarily useful for the SLEN<VLEN machines
| though are difficult to implement efficiently there; the SLEN=VLEN
| implementation is just a register-register copy. We could choose to
| add the cast operations as another optional extension, which is my
| preference at this time.

| So a separate extension for cast operations is also my current preference (if needed).

| On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:

| | Hi Krste,
| | On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:

| | Could consider later adding "cast" instructions that convert a vector
| | of N SEW=8 elements into a vector of N/2 SEW=16 elements by
| | concatenating the two bytes (and similar for other combinations of
| | source and dest SEWs). These would be a simple move/copy on an
| | SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
| | with bytes crossing between SLEN sections (probably reusing the memory
| | pipeline crossbar in an implementation, to store the source vector in
| | its memory format, then load the destination vector in its register
| | format). So vector is loaded once from memory as SEW=8, then cast
| | into appropriate type to extract other fields. Misaligned words might
| | need a slide before casting.

| | I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| | example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| | currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.

| | For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| | SLEN==VLEN. For example,

| | # SEW = 32, LMUL = 4
| | v_reinterpret v0, e64, m1

| | would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| | to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:

| | # SEW = 64, LMUL = 1
| | v_reinterpret v0, e32, m4

| | Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| | VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| | from the programmer. Hence, I think it would be useful in developing portable software.

| | Best,
| | Nick Knight

|


David Horner
 

On 2020-04-27 3:16 p.m., krste@... wrote:
I meant the SLEN=VLEN "extension" to simply be an assertion about the
machine's static configuration. Software could then rely on
in-register format matching in-memory format.

Krste
Then I agree that the risk of software fragmentation is high with such an extension.
The reality is that some machines will indeed be SLEN=VLEN and thus risk some fragmentation.

I am indeed proposing a run time mechanism of supporting "in-register format matching in-memory format" and
indefinite levels of widening/narrowing.

For what its worth I think inconvenience in widening/narrowing beyond one level is much less valuable than maintaining the industry standard of match in-memory format.

If we have to decide on one I believe we should toss the SEW level interleave (as much as I am very fond of it).

On Mon, 27 Apr 2020 18:57:42 +0000, Bill Huffman <huffman@...> said:
| Sounds like maybe you're thinking that "widening of SLEN to VLEN" is a runtime setting or something like
| that. There will be no (certainly few) machines where SLEN is variable as the power/area cost would be
| too high. But maybe you meant something else.

| Bill

| On 4/27/20 11:50 AM, DSHORNER wrote:

| EXTERNAL MAIL

| mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=VLEN
| I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
| The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. Rather than outputs
| aligning within a data lane, for widening operations the results are shuffled up to their
| appropriate slot in the destination register, and into the vd+1 register if vl >= VLMAX of a single
| physical register.

| The observation is that on machines that
| On 2020-04-27 1:51 p.m., Bill Huffman wrote:

| Hi David,
| I don't understand your observation about "mixed SEW operations (widening & narrowing)..."
| "mixed SEW operations (widening & narrowing) have substantial impact on contiguous
| SLEN=VLEN"
| I read the "SLEN=VLEN" extension as a logical/virtual widening of SLEN to VLEN.
| The machine behaves as if SLEN=VLEN but the data lanes are not VLEN wide. For widening operations,
| rather than the outputs aligning within a data lane, the results are shuffled up to their
| appropriate slot in the destination register, and into the vd+1 register if vl >= (VLMAX of a single
| physical register).
| I understand this is a substantial impact for most implementations.
| But we may have different interpretations of what Krste meant by SLEN=VLEN.
| He floated an option that VLEN be reduced to SLEN, but my reading of this proposal doesn't jive with
| that alternative.

| The conditions required for impact seem to me much stronger. Arbitrary widening & narrowing
| mixing of SEW is fine (unless SEW>SLEN, which would be, at best, unusual). The conditions to
| even be able to observe SLEN in real code seem to involve using a vector to represent an
| irregular structure - and probably one with elements not naturally aligned.
| And I'm afraid I'm not following you. I agreed with:
SEW| SLEN, which would be, at best, unusual
| struggling with:
| The conditions required for impact seem to me much stronger. -- Than "lane misalignment"?

| Arbitrary widening & narrowing mixing of SEW is fine --- in the model I describe above, or is
| there another interpretation I'm not fathoming.

| The conditions to even be able to observe SLEN in real code seem to involve using a vector to
| represent an irregular structure
| ---- I think not just an irregular structure, but a composite structure (bytes within words,
| etc.)
| and probably one with elements not naturally aligned. ---- natural alignment perhaps simplifies,
| but the issue exists even then.
| (I realize an initial struggle I had was in interpreting your use of "vector" to mean mechanism.
| duh.)

| Bill
| On 4/27/20 10:33 AM, David Horner wrote:
| EXTERNAL MAIL
| as some are not on github, I posted this response to #434 here:

| Observations:
| - single SEW operations are agnostic to underlying structure (as Krte noted in recent doc
| revision)
| - mixed SEW operations (widening & narrowing) have substantial impact on contiguous SLEN=
| VLEN
| - mixed SEW operations are predominantly SEW <--> 2 * SEW
| - by 2 interleaving chunks in SLEN=VLEN at SEW level align well with non-interleaved at 2 *
| SEW

| Postulate:
| That software can anticipate its need for a matching structures for widening/narrowing and
| memory overlay model and make a weighed choice.

| I call the current interleave proposal SEW level interleave (elements are apportioned on a
| SEW basis amongst available SLEN chunks in a round robin fashion).

| I thus propose a variant of #421 Fractional vtype field vfill – Fractional Fill order and
| Fractional Instruction eLement Location :

| INTRLV defines 4 interleave formats:

| - SLEN<VLEN (SEW level interleave)
| - SLEN=VLEN (proposed as extension, essentially no interleave)
| - Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but upper 1/2 of SLEN
| chunk is a gap (undisturbed or agnostic fill).
| - Layout is identical to “SLEN=VLEN” even though SLEN chunks exist, but lower 1/2 of SLEN
| chunk is a gap (undisturbed or agnostic fill).

| A 2bit vtype vintrlv field defines the application of these formats to various operations,
| the effect is determined by what kind of operation it is:

| Load/Store will depending upon mode
| ```
| vintrvl level = 0 -- scrample/descramble SEW level encoding
| vintrvl level = 3 -- transfer as if SLEN=VLEN ( non-interleaved)
| vintrvl level = 1 -- load (as if SLEN=VLEN) lower 1/2 of SLEN chunk
| (upper undisturbed of agnostic filled)
| vintrvl level = 2 -- load (as if SLEN=VLEN) upper1/2 of SLEN chunk
| (lower undisturbed of agnostic filled)
| ```

| Single width operations will work on either side of SLEN for vintrvl levels 1 and 2 , but
| identically on all vl elements for vintrvl level s 0 and 3

| Widening operations can operate on either side of the SLEN chunks providing a 2*SEL set of
| elements in an SLEN length chunk (vintrvl levels 1 and 2).
| Further, Widening operations can operate with one source on one side and the other on the
| other side of the SLEN chunks providing a 2*SEL set of elements in an SLEN length chunk. (
| vintrvl levels 3).

| For further details please read #421.

| On 2020-04-27 10:02 a.m., krste@... wrote:
| I created a github issue for this, #434 - text repeated below,
| Krste
| Should SLEN=VLEN be an extension?
| SLEN<VLEN introduces internal rearrangements to reduce cross-datapath
| wiring for wide datapaths such that bytes of different SEWs are laid
| out differently in register bytes versus memory bytes, whereas when
| SLEN=VLEN, the in-register format matches in-memory format for vectors
| of all SEW.
| Many vector routines can be written to be agnostic to SLEN, but some
| routines can use the vector extension to manipulate data structures
| that are not simple arrays of a single-width datatype (e.g., a network
| packet). These routines can exploit SLEN=VLEN and hence that SEW can
| be changed to access different element widths within same vector
| register value, and many implementations will have SLEN=VLEN.
| To support these kinds of routines portably on both SLEN<VLEN and
| SLEN=VLEN machines, we could provide SEW "casting" operations that
| internally rearrange in-register representations, e.g., converting a
| vector of N SEW=8 bytes to N/2 SEW=16 halfwords with bytes appearing
| in the halfwords as they would if the vector was held in memory. For
| SLEN=VLEN machines, all cast operations are a simple copy. However,
| preserving compatibility between both types of machine incurs an
| efficiency cost on the common SLEN=VLEN machines, and the "cast"
| operation is not necessarily very efficient on the SLEN<VLEN machines
| as it requires communication between the SLEN-wide sections, and
| reloading vector from memory with different SEW might actually be more
| efficient depending on the microarchitecture.
| Making SLEN=VLEN an extension (Zveqs?) enables software to exploit
| this where available, avoiding needless casts. A downside would be
| that this splits the software ecosystem if code that does not need to
| depend on SLEN=VLEN inadvertently requires it. However, software
| developers will be motivated to test for SLEN=VLEN to drop need to
| perform cast operations even without an extension, so this split will
| likely happen anyway.
| the above proposal presupposes the SLEN=VLEN support be part of the base.
| It also postulates that such casting operations are not necessary as they can be avoided by
| judicious use of the INTRVL facilities.
| I may be wrong and such caste [sick] operations may be beneficial.

| A second issue either way is whether we should add "cast"
| operations. They are primarily useful for the SLEN<VLEN machines
| though are difficult to implement efficiently there; the SLEN=VLEN
| implementation is just a register-register copy. We could choose to
| add the cast operations as another optional extension, which is my
| preference at this time.
| So a separate extension for cast operations is also my current preference (if needed).

| On Sun, 26 Apr 2020 13:27:39 -0700, Nick Knight <nick.knight@...> said:
| | Hi Krste,
| | On Sat, Apr 25, 2020 at 11:51 PM Krste Asanovic <krste@...> wrote:
| | Could consider later adding "cast" instructions that convert a vector
| | of N SEW=8 elements into a vector of N/2 SEW=16 elements by
| | concatenating the two bytes (and similar for other combinations of
| | source and dest SEWs). These would be a simple move/copy on an
| | SLEN=VLEN machine, but would perform a permute on an SLEN<VLEN machine
| | with bytes crossing between SLEN sections (probably reusing the memory
| | pipeline crossbar in an implementation, to store the source vector in
| | its memory format, then load the destination vector in its register
| | format). So vector is loaded once from memory as SEW=8, then cast
| | into appropriate type to extract other fields. Misaligned words might
| | need a slide before casting.
| | I have recently learned from my interactions with EPI/BSC folks that cryptographic routines make frequent use of such operations. For one concrete
| | example, they need to reinterpret an e32 vector as e64 (length VL/2) to perform 64-bit arithmetic, then reinterpret the result as e32. They
| | currently use SLEN-dependent type-punning; this example only seems to be problematic in the extremal case SLEN == 32.
| | For these types of problems, it would be useful to have a "reinterpret_cast" instruction, which changes SEW and LMUL on a register group as if
| | SLEN==VLEN. For example,
| | # SEW = 32, LMUL = 4
| | v_reinterpret v0, e64, m1
| | would perform "register-group fission" on [v0, v1, v2, v3], concatenating (logically) adjacent pairs of 32-bit elements into 64-bit elements (up
| | to, or perhaps ignoring VL). And we would perform the inverse operation, "register-group fusion", as follows:
| | # SEW = 64, LMUL = 1
| | v_reinterpret v0, e32, m4
| | Like you suggested, this is implementable by (sequences of) stores and loads; the advantage is it optimizes for the (common?) case of SLEN ==
| | VLEN. And there probably are optimizations for other combinations of SLEN, VLEN, SEW_{old,new}, and LMUL_{old,new}, which could also be hidden
| | from the programmer. Hence, I think it would be useful in developing portable software.
| | Best,
| | Nick Knight

|


Krste Asanovic
 

On Mon, 27 Apr 2020 18:14:39 +0000, Bill Huffman <huffman@...> said:
| On 4/27/20 7:02 AM, Krste Asanovic wrote:
[..]
|| I created a github issue for this, #434 - text repeated below,
|| Krste
||
|| Should SLEN=VLEN be an extension?
||
[...]

| It might be the case that the machines where SLEN=VLEN would be the same
| machines where it would be attractive to use vectors for such code -
| machines where vectors provided larger registers and some parallelism
| rather than machines where vectors usually complete in one or a few
| cycles and wouldn't deal well with irregular operations. That probably
| increases the value of an extension.

I think having vectors complete in one or a few cycles (shallow
temporal) is orthogonal to choice of SLEN=VLEN.

I think SLEN=VLEN is simply about how wide you want interactions
between arithmetic units. I'm guessing e.g. 128-256b wide datapaths
are probably OK with SLEN=VLEN, whereas 512b and up datapaths are
probably starting to see issues, independent of VLEN in either case.

| On the other hand, adding casting operations would seem to decrease the
| value of an extension (see below).

|| A second issue either way is whether we should add "cast"
|| operations. They are primarily useful for the SLEN<VLEN machines
|| though are difficult to implement efficiently there; the SLEN=VLEN
|| implementation is just a register-register copy. We could choose to
|| add the cast operations as another optional extension, which is my
|| preference at this time.

| Where SLEN<VLEN, cast operations might be implemented as vector register
| gather operations with element index values determined by SLEN, VLEN and
| SEW.

Agree this is a sensible implementation strategy, but pattern is
simpler than general vrgather, and can also implement as a store(src
SEW)+load(dest SEW) across memory crossbar given that you need to
materialize/parse in-memory formats there anyway.

| But where SLEN=VLEN, they would be moves. If then, we add casts,
| would an SLEN=VLEN extension still be valuable?

Casting makes it possible to have a common interface, but given that
SLEN=VLEN will be common choice and it's easy for software to figure
this out, and there is a performance/complexity advantage to not using
the casts when SLEN=VLEN, I can't see mandating everyone use the
casting model as working in practice. Also, I don't believe casting
provides an efficient solution for all the use cases.

Now, a SLEN<VLEN machine could provide a configuration switch to turn
off all but the first SLEN partition (maybe what David was alluding to)
and then support the SLEN=VLEN extension albeit at reduced
performance.

And an SLEN=VLEN machine could implement the cast extension to run
software that used those at no penalty.

Krste


Bill Huffman
 

On 4/27/20 12:32 PM, krste@... wrote:
EXTERNAL MAIL



On Mon, 27 Apr 2020 18:14:39 +0000, Bill Huffman <huffman@...> said:
| On 4/27/20 7:02 AM, Krste Asanovic wrote:
[..]
|| I created a github issue for this, #434 - text repeated below,
|| Krste
||
|| Should SLEN=VLEN be an extension?
||
[...]

| It might be the case that the machines where SLEN=VLEN would be the same
| machines where it would be attractive to use vectors for such code -
| machines where vectors provided larger registers and some parallelism
| rather than machines where vectors usually complete in one or a few
| cycles and wouldn't deal well with irregular operations. That probably
| increases the value of an extension.

I think having vectors complete in one or a few cycles (shallow
temporal) is orthogonal to choice of SLEN=VLEN.

I think SLEN=VLEN is simply about how wide you want interactions
between arithmetic units. I'm guessing e.g. 128-256b wide datapaths
are probably OK with SLEN=VLEN, whereas 512b and up datapaths are
probably starting to see issues, independent of VLEN in either case.
Sorry, I didn't say what I meant very well. I agree that it's the width
that matters. Machines with short vector registers are likely to be
SLEN=VLEN even if the complete quickly.

In my experience 256b width is shaky and may well want SLEN=128.

In any case, I'm wondering if having cast instructions is better than an
extension. I think it avoids the potential fragmentation.


| On the other hand, adding casting operations would seem to decrease the
| value of an extension (see below).

|| A second issue either way is whether we should add "cast"
|| operations. They are primarily useful for the SLEN<VLEN machines
|| though are difficult to implement efficiently there; the SLEN=VLEN
|| implementation is just a register-register copy. We could choose to
|| add the cast operations as another optional extension, which is my
|| preference at this time.

| Where SLEN<VLEN, cast operations might be implemented as vector register
| gather operations with element index values determined by SLEN, VLEN and
| SEW.

Agree this is a sensible implementation strategy, but pattern is
simpler than general vrgather, and can also implement as a store(src
SEW)+load(dest SEW) across memory crossbar given that you need to
materialize/parse in-memory formats there anyway.
OK. That's also quite do-able. Physical layout and control issues
could make for either implementation, I think.


| But where SLEN=VLEN, they would be moves. If then, we add casts,
| would an SLEN=VLEN extension still be valuable?

Casting makes it possible to have a common interface, but given that
SLEN=VLEN will be common choice and it's easy for software to figure
this out, and there is a performance/complexity advantage to not using
the casts when SLEN=VLEN, I can't see mandating everyone use the
casting model as working in practice. Also, I don't believe casting
provides an efficient solution for all the use cases.

Now, a SLEN<VLEN machine could provide a configuration switch to turn
off all but the first SLEN partition (maybe what David was alluding to)
and then support the SLEN=VLEN extension albeit at reduced
performance.
Agreed. That's feasible. It might be set by vsetvl, but unchanged by
vsetvli, and implemented by reduction of VLMAX as you suggest. That
might be a reasonable tradeoff.

Maybe there's no cast and no extension. Only a bit that may reduce
performance, but makes SLEN=VLEN.

Bill


And an SLEN=VLEN machine could implement the cast extension to run
software that used those at no penalty.

Krste




David Horner
 

In trying to make SEW level interleave by augmenting the instruction set (including casting),
I have a few observations.
- arithmetic operators need to function at a given SEW and there is no in-memory form requirement.
- exploitation of in-memory form in SEW * n elements is substantially (if not completely)
        of SEW level bit-wise operations and/or/xor/move/load and shift  and
        SEW level masking (not SEW * n).

postulates:
That use of in-memory form for an algorithm can be identified and provided only when needed.
That the need for such instructions will be neither statically nor dynamically frequent in common code.
That the gather of SEW level elements to build a SEW * n result is not prohibitively expensive for the "only when needed" and infrequent aspects of the algorithm.



If these are true, then we can provide augmented forms of the bitwise and shift instructions that
         source a SEW level set of n consecutive elements and
         if another vector source is needed either
             another such SEW level set of n consecutive elements or
             a SEW * n element and
        stores a SEW * n element with the operator applied by each SEW level element in turn, under the mask at the SEW level.
The total number of SEW elements processed is determined by vl.
Lets say the value in vl is required to be a multiple of n, for now.


The two needed data elements are
-    n (the aggregate level for the target) (lets call it inmemn)
             2 bits would appear necessary for XLEN=64,
                with derived values of values 1 (standard operation),2,4,8 (allowing byte to double).
                However,3 bits would additionally allow factors of 16 through 128 which might be useful for encryption)

- single bit indicator whether second source two is level SEW or SEW * n . (lets call it inmem2)


If these were incorporated into the bitwise/shift variants the opcodes would be increased by a minimum of 3 bits.
It would appear the vtype opcode compression method should be leveraged again.

These two "parameters", inmem2 and inmemn, could be included in vtype as a persistent modifier.
However, it is fully conceivable that most of the data masaging cam be done at the SEW level with only the last operation required to pace it in a SEW * n destination.
This would be a good justification to allow a transient form of the vmodinstr prefix. Issue #423 - additional instructions to set vtype fields.


Note, neither of these changes the vl  of these instructions. And further, the execution of these is expected to be infrequent (one of the postulates).
  Therefore it is a candidate for the alternate vmodtype instruction, rather than further vsetvli immediate bit use. (also issue #423)


On 2020-04-27 3:56 p.m., Bill Huffman wrote:

On 4/27/20 12:32 PM, krste@... wrote:
EXTERNAL MAIL



On Mon, 27 Apr 2020 18:14:39 +0000, Bill Huffman <huffman@...> said:
| On 4/27/20 7:02 AM, Krste Asanovic wrote:
[..]
|| I created a github issue for this, #434 - text repeated below,
|| Krste
||
|| Should SLEN=VLEN be an extension?
||
[...]

| It might be the case that the machines where SLEN=VLEN would be the same
| machines where it would be attractive to use vectors for such code -
| machines where vectors provided larger registers and some parallelism
| rather than machines where vectors usually complete in one or a few
| cycles and wouldn't deal well with irregular operations. That probably
| increases the value of an extension.

I think having vectors complete in one or a few cycles (shallow
temporal) is orthogonal to choice of SLEN=VLEN.

I think SLEN=VLEN is simply about how wide you want interactions
between arithmetic units. I'm guessing e.g. 128-256b wide datapaths
are probably OK with SLEN=VLEN, whereas 512b and up datapaths are
probably starting to see issues, independent of VLEN in either case.
Sorry, I didn't say what I meant very well. I agree that it's the width
that matters. Machines with short vector registers are likely to be
SLEN=VLEN even if the complete quickly.

In my experience 256b width is shaky and may well want SLEN=128.

In any case, I'm wondering if having cast instructions is better than an
extension. I think it avoids the potential fragmentation.

| On the other hand, adding casting operations would seem to decrease the
| value of an extension (see below).

|| A second issue either way is whether we should add "cast"
|| operations. They are primarily useful for the SLEN<VLEN machines
|| though are difficult to implement efficiently there; the SLEN=VLEN
|| implementation is just a register-register copy. We could choose to
|| add the cast operations as another optional extension, which is my
|| preference at this time.

| Where SLEN<VLEN, cast operations might be implemented as vector register
| gather operations with element index values determined by SLEN, VLEN and
| SEW.

Agree this is a sensible implementation strategy, but pattern is
simpler than general vrgather, and can also implement as a store(src
SEW)+load(dest SEW) across memory crossbar given that you need to
materialize/parse in-memory formats there anyway.
I expect the SEW level sourcing of the augmented bitwise/shift could readily use this path via a passthrough to the execution units

OK. That's also quite do-able. Physical layout and control issues
could make for either implementation, I think.

| But where SLEN=VLEN, they would be moves. If then, we add casts,
| would an SLEN=VLEN extension still be valuable?

Casting makes it possible to have a common interface, but given that
SLEN=VLEN will be common choice and it's easy for software to figure
this out, and there is a performance/complexity advantage to not using
the casts when SLEN=VLEN, I can't see mandating everyone use the
casting model as working in practice. Also, I don't believe casting
provides an efficient solution for all the use cases.

Now, a SLEN<VLEN machine could provide a configuration switch to turn
off all but the first SLEN partition (maybe what David was alluding to)
yes it was.
and then support the SLEN=VLEN extension albeit at reduced
performance.
Agreed. That's feasible. It might be set by vsetvl, but unchanged by
vsetvli, and implemented by reduction of VLMAX as you suggest. That
might be a reasonable tradeoff.
I don't expect it would be well accepted by a purchaser that their 4096 bit vector accelerator is 256 bit brain damaged by what are infrequent but "essential" in-memory mapped transforms.
The high end market would look else where than RISCV with such dumbed down support.
If we are going to have an industry bucking non-standard internal register format, but provide in-memory format support , it had better be proficient.

Maybe there's no cast and no extension. Only a bit that may reduce
performance, but makes SLEN=VLEN.

Bill

And an SLEN=VLEN machine could implement the cast extension to run
software that used those at no penalty.
Because the augmented instructions are essential to the performance of the algorithm there is even less penalty for SLEN=VLEN implementations.
That is no extraneous moves.
The bits in vtype become no ops, and the prefix becomes a nop that a linker could remove.

Krste



David Horner
 



On 2020-04-27 3:56 p.m., Bill Huffman wrote:

On 4/27/20 12:32 PM, krste@... wrote:
EXTERNAL MAIL



On Mon, 27 Apr 2020 18:14:39 +0000, Bill Huffman <huffman@...> said:
| On 4/27/20 7:02 AM, Krste Asanovic wrote:
[..]
|| I created a github issue for this, #434 - text repeated below,
|| Krste
||
|| Should SLEN=VLEN be an extension?
||
[...]

| It might be the case that the machines where SLEN=VLEN would be the same
| machines where it would be attractive to use vectors for such code -
| machines where vectors provided larger registers and some parallelism
| rather than machines where vectors usually complete in one or a few
| cycles and wouldn't deal well with irregular operations.  That probably
| increases the value of an extension.

I think having vectors complete in one or a few cycles (shallow
temporal) is orthogonal to choice of SLEN=VLEN.

I think SLEN=VLEN is simply about how wide you want interactions
between arithmetic units.  I'm guessing e.g. 128-256b wide datapaths
are probably OK with SLEN=VLEN, whereas 512b and up datapaths are
probably starting to see issues, independent of VLEN in either case.
Sorry, I didn't say what I meant very well.  I agree that it's the width 
that matters.  Machines with short vector registers are likely to be 
SLEN=VLEN even if the complete quickly.

In my experience 256b width is shaky and may well want SLEN=128.

In any case, I'm wondering if having cast instructions is better than an 
extension.  I think it avoids the potential fragmentation.

| On the other hand, adding casting operations would seem to decrease the
| value of an extension (see below).

|| A second issue either way is whether we should add "cast"
|| operations. They are primarily useful for the SLEN<VLEN machines
|| though are difficult to implement efficiently there; the SLEN=VLEN
|| implementation is just a register-register copy. We could choose to
|| add the cast operations as another optional extension, which is my
|| preference at this time.

| Where SLEN<VLEN, cast operations might be implemented as vector register
| gather operations with element index values determined by SLEN, VLEN and
| SEW.

Agree this is a sensible implementation strategy, but pattern is
simpler than general vrgather, and can also implement as a store(src
SEW)+load(dest SEW) across memory crossbar given that you need to
materialize/parse in-memory formats there anyway.
OK.  That's also quite do-able.  Physical layout and control issues 
could make for either implementation, I think.

| But where SLEN=VLEN, they would be moves.  If then, we add casts,
| would an SLEN=VLEN extension still be valuable?

Casting makes it possible to have a common interface, but given that
SLEN=VLEN will be common choice and it's easy for software to figure
this out, and there is a performance/complexity advantage to not using
the casts when SLEN=VLEN, I can't see mandating everyone use the
casting model as working in practice.  Also, I don't believe casting
provides an efficient solution for all the use cases.

Now, a SLEN<VLEN machine could provide a configuration switch to turn
off all but the first SLEN partition (maybe what David was alluding to)
and then support the SLEN=VLEN extension albeit at reduced
performance.
Agreed.  That's feasible.  It might be set by vsetvl, but unchanged by 
vsetvli, 
Perhaps you would be willing to comment on #410 Place stabler RVV control fields in bits [30:12] of vtype.
and implemented by reduction of VLMAX as you suggest.  That 
might be a reasonable tradeoff.
reduction of VLMAX is not sufficient.
Within each SLEN chunk the existing data will already be "scrambled".
It would be possible to load SEW=SLEN data (or load whole register) to prep the data, avoiding scrambling.
But otherwise, the new _source_ data will need to be loaded under tne new mode.

And it does not address register groups (of more than 1 physical register)
To both limit to one SLEN group AND reduce registrar groups to a single physical register is a double wammy.

Maybe there's no cast and no extension.  Only a bit that may reduce 
performance, but makes SLEN=VLEN.

      Bill

And an SLEN=VLEN machine could implement the cast extension to run
software that used those at no penalty.

Krste





Bill Huffman
 


On 4/29/20 6:40 AM, David Horner wrote:
EXTERNAL MAIL



On 2020-04-27 3:56 p.m., Bill Huffman wrote:
On 4/27/20 12:32 PM, krste@... wrote:
EXTERNAL MAIL



On Mon, 27 Apr 2020 18:14:39 +0000, Bill Huffman <huffman@...> said:
| On 4/27/20 7:02 AM, Krste Asanovic wrote:
[..]
|| I created a github issue for this, #434 - text repeated below,
|| Krste
||
|| Should SLEN=VLEN be an extension?
||
[...]

| It might be the case that the machines where SLEN=VLEN would be the same
| machines where it would be attractive to use vectors for such code -
| machines where vectors provided larger registers and some parallelism
| rather than machines where vectors usually complete in one or a few
| cycles and wouldn't deal well with irregular operations.  That probably
| increases the value of an extension.

I think having vectors complete in one or a few cycles (shallow
temporal) is orthogonal to choice of SLEN=VLEN.

I think SLEN=VLEN is simply about how wide you want interactions
between arithmetic units.  I'm guessing e.g. 128-256b wide datapaths
are probably OK with SLEN=VLEN, whereas 512b and up datapaths are
probably starting to see issues, independent of VLEN in either case.
Sorry, I didn't say what I meant very well.  I agree that it's the width 
that matters.  Machines with short vector registers are likely to be 
SLEN=VLEN even if the complete quickly.

In my experience 256b width is shaky and may well want SLEN=128.

In any case, I'm wondering if having cast instructions is better than an 
extension.  I think it avoids the potential fragmentation.

| On the other hand, adding casting operations would seem to decrease the
| value of an extension (see below).

|| A second issue either way is whether we should add "cast"
|| operations. They are primarily useful for the SLEN<VLEN machines
|| though are difficult to implement efficiently there; the SLEN=VLEN
|| implementation is just a register-register copy. We could choose to
|| add the cast operations as another optional extension, which is my
|| preference at this time.

| Where SLEN<VLEN, cast operations might be implemented as vector register
| gather operations with element index values determined by SLEN, VLEN and
| SEW.

Agree this is a sensible implementation strategy, but pattern is
simpler than general vrgather, and can also implement as a store(src
SEW)+load(dest SEW) across memory crossbar given that you need to
materialize/parse in-memory formats there anyway.
OK.  That's also quite do-able.  Physical layout and control issues 
could make for either implementation, I think.

| But where SLEN=VLEN, they would be moves.  If then, we add casts,
| would an SLEN=VLEN extension still be valuable?

Casting makes it possible to have a common interface, but given that
SLEN=VLEN will be common choice and it's easy for software to figure
this out, and there is a performance/complexity advantage to not using
the casts when SLEN=VLEN, I can't see mandating everyone use the
casting model as working in practice.  Also, I don't believe casting
provides an efficient solution for all the use cases.

Now, a SLEN<VLEN machine could provide a configuration switch to turn
off all but the first SLEN partition (maybe what David was alluding to)
and then support the SLEN=VLEN extension albeit at reduced
performance.
Agreed.  That's feasible.  It might be set by vsetvl, but unchanged by 
vsetvli, 
Perhaps you would be willing to comment on #410 Place stabler RVV control fields in bits [30:12] of vtype.
Perhaps it would be better added by someone who's actually advocating for the feature rather than by me who doesn't have the broader purpose.
and implemented by reduction of VLMAX as you suggest.  That 
might be a reasonable tradeoff.
reduction of VLMAX is not sufficient.
Within each SLEN chunk the existing data will already be "scrambled".
It would be possible to load SEW=SLEN data (or load whole register) to prep the data, avoiding scrambling.
But otherwise, the new _source_ data will need to be loaded under tne new mode.

And it does not address register groups (of more than 1 physical register)
To both limit to one SLEN group AND reduce registrar groups to a single physical register is a double wammy.

I didn't mean just to reduce VLMAX.  I'm assuming reducing VLMAX and operating the controls as if VLMAX were reduced - so the rest of the SIMD width isn't there and SLEN=VLEN.  Those controls include all LMUL and SEW values.

      Bill

Maybe there's no cast and no extension.  Only a bit that may reduce 
performance, but makes SLEN=VLEN.

      Bill

And an SLEN=VLEN machine could implement the cast extension to run
software that used those at no penalty.

Krste