Topics

Vector Task Group minutes 2020/5/15


Krste Asanovic
 

Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed. The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes. This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid. The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
(<=SLEN), however these appear to result in even greater variety of
layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
internally restrict datapath communication within SLEN width of
datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.


Andrew Waterman
 



On Fri, May 15, 2020 at 11:56 AM Krste Asanovic <krste@...> wrote:

Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed.  The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes.  This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
  SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
  (<=SLEN), however these appear to result in even greater variety of
  layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
  internally restrict datapath communication within SLEN width of
  datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.

We have tagged version 0.9 on github, including HTML/PDF artifacts: https://github.com/riscv/riscv-v-spec/releases/tag/0.9









Bill Huffman
 

It seems like the function of a cast instruction the same as storing to
memory (stride-1) with one SEW and loading back the same number of bytes
with another SEW. Is that a correct understanding?

Bill

On 5/15/20 11:55 AM, Krste Asanovic wrote:
EXTERNAL MAIL



Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$

Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed. The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes. This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid. The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
(<=SLEN), however these appear to result in even greater variety of
layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
internally restrict datapath communication within SLEN width of
datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.







David Horner
 

I would agree that "by definition" this is a sufficient condition to obtain the instructions that Krste was envisioning of instructions the also nop on SLEN=VLEN machine.
That is a sufficient condition to address the byte mismatch of SLEN < VLEN.

However, is it necessary as it is a very expensive operation for SLEN<?

Are there casting instructions that are reasonably low cost on both SLEN= and SLEN< VLEN that create an intermediate state that works for both?

And if there are such operations, do you only provide them (and NOT the heavy handed "as if written to memory and back")?
Can two such instructions do the full transition for SLEN< to SLEN=?
If so, is it sufficiently easy to recognize such a pair and fuse as a nop on SLEN= systems?
Can applications alternatively rely on a linkage editor to nop them?

 I have no good solution (yet) as the guts of the range of microarch tricks is not my forte.

But there are others who undoubtedly are mulling over such considerations.

It would not be a lose-win proposition but a limited win-win.

I look forward to Krste's proposals . I have been surprised before!!

On 2020-05-16 2:07 a.m., Bill Huffman wrote:
It seems like the function of a cast instruction the same as storing to
memory (stride-1) with one SEW and loading back the same number of bytes
with another SEW. Is that a correct understanding?

Bill

On 5/15/20 11:55 AM, Krste Asanovic wrote:
EXTERNAL MAIL



Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$

Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed. The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes. This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid. The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
(<=SLEN), however these appear to result in even greater variety of
layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
internally restrict datapath communication within SLEN width of
datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.







David Horner
 

By the way, this is similar to the "prefix" proposal that applies the transform to selected source and/or destinations.

The cost is reduced as the "transform" is occurring while the operation is also occurring,
  -   on SLEN= machines these "prefix" are nops
  - on SLEN< micro archs the specific combination of transform and operation can be optimized - especially for high use combinations.

The cost is further reduced for some combinations as an intermediate state can be used for that op and transform rather than storing a result that has to be usable by any operation.

To the extent that these cast operations can apply in this way, the benefit of the "prefix" over cast may be limited and not warrant a "new" prefix mechanism like #423 (and a specific application #456).

https://github.com/riscv/riscv-v-spec/issues/423
https://github.com/riscv/riscv-v-spec/issues/456

On 2020-05-16 8:50 a.m., David Horner via lists.riscv.org wrote:
I would agree that "by definition" this is a sufficient condition to obtain the instructions that Krste was envisioning of instructions the also nop on SLEN=VLEN machine.
That is a sufficient condition to address the byte mismatch of SLEN < VLEN.

However, is it necessary as it is a very expensive operation for SLEN<?

Are there casting instructions that are reasonably low cost on both SLEN= and SLEN< VLEN that create an intermediate state that works for both?

And if there are such operations, do you only provide them (and NOT the heavy handed "as if written to memory and back")?
Can two such instructions do the full transition for SLEN< to SLEN=?
If so, is it sufficiently easy to recognize such a pair and fuse as a nop on SLEN= systems?
Can applications alternatively rely on a linkage editor to nop them?

 I have no good solution (yet) as the guts of the range of microarch tricks is not my forte.

But there are others who undoubtedly are mulling over such considerations.

It would not be a lose-win proposition but a limited win-win.

I look forward to Krste's proposals . I have been surprised before!!

On 2020-05-16 2:07 a.m., Bill Huffman wrote:
It seems like the function of a cast instruction the same as storing to
memory (stride-1) with one SEW and loading back the same number of bytes
with another SEW.  Is that a correct understanding?

       Bill

On 5/15/20 11:55 AM, Krste Asanovic wrote:
EXTERNAL MAIL



Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$

Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed.  The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes.  This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
    SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
    (<=SLEN), however these appear to result in even greater variety of
    layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
    internally restrict datapath communication within SLEN width of
    datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.








Bill Huffman
 

Hi David,

I wasn't at all trying to suggest that we could use the store/load
combination instead of defining cast instructions. I was only trying to
verify the required operation.

You said that it is a sufficient condition. If there's a less
constraining condition that is also sufficient, could you describe? I'm
hoping either to prove there is not a layout or (possibly) show that
there is one.

Bill

On 5/16/20 6:06 AM, David Horner wrote:
EXTERNAL MAIL


By the way, this is similar to the "prefix" proposal that applies the
transform to selected source and/or destinations.

The cost is reduced as the "transform" is occurring while the operation
is also occurring,
  -   on SLEN= machines these "prefix" are nops
  - on SLEN< micro archs the specific combination of transform and
operation can be optimized - especially for high use combinations.

The cost is further reduced for some combinations as an intermediate
state can be used for that op and transform rather than storing a result
that has to be usable by any operation.

To the extent that these cast operations can apply in this way, the
benefit of the "prefix" over cast may be limited and not warrant a "new"
prefix mechanism like #423 (and a specific application #456).

https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/issues/423__;!!EHscmS1ygiU1lA!Vi0QgSrHLMFBGFcES-MJshFWqPS3cRhfJEvF2d9GJNVwi-XzSV3xVUFlEBpRH4U$
https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/issues/456__;!!EHscmS1ygiU1lA!Vi0QgSrHLMFBGFcES-MJshFWqPS3cRhfJEvF2d9GJNVwi-XzSV3xVUFlt3H0BTg$



On 2020-05-16 8:50 a.m., David Horner via lists.riscv.org wrote:
I would agree that "by definition" this is a sufficient condition to
obtain the instructions that Krste was envisioning of instructions the
also nop on SLEN=VLEN machine.
That is a sufficient condition to address the byte mismatch of SLEN <
VLEN.

However, is it necessary as it is a very expensive operation for SLEN<?

Are there casting instructions that are reasonably low cost on both
SLEN= and SLEN< VLEN that create an intermediate state that works for
both?

And if there are such operations, do you only provide them (and NOT
the heavy handed "as if written to memory and back")?
Can two such instructions do the full transition for SLEN< to SLEN=?
If so, is it sufficiently easy to recognize such a pair and fuse as a
nop on SLEN= systems?
Can applications alternatively rely on a linkage editor to nop them?

 I have no good solution (yet) as the guts of the range of microarch
tricks is not my forte.

But there are others who undoubtedly are mulling over such
considerations.

It would not be a lose-win proposition but a limited win-win.

I look forward to Krste's proposals . I have been surprised before!!

On 2020-05-16 2:07 a.m., Bill Huffman wrote:
It seems like the function of a cast instruction the same as storing to
memory (stride-1) with one SEW and loading back the same number of bytes
with another SEW.  Is that a correct understanding?

       Bill

On 5/15/20 11:55 AM, Krste Asanovic wrote:
EXTERNAL MAIL



Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github:
https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$


Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed.  The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes.  This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
    SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
    (<=SLEN), however these appear to result in even greater variety of
    layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
    internally restrict datapath communication within SLEN width of
    datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.











David Horner
 

On 2020-05-16 2:25 p.m., Bill Huffman wrote:
Hi David,

I wasn't at all trying to suggest that we could use the store/load
combination instead of defining cast instructions.
Nor was I intending on implying that you believed these casting instructions would affect the memory system.
Although Krste and I believe Andrew did suggest the memory side of the register file could be leveraged to provide appropriate rearrangement in respect to fractional loads.

I was only trying to
verify the required operation.
Understood.
So I apologize that I didn't make that clearer in my response.

You said that it is a sufficient condition. If there's a less
constraining condition that is also sufficient, could you describe?
To some extent it is but conjecture on my part. A hopeful possibility. Especially the intermediate format, that occurs in two parts to provide the full transform.

Something that comes close, is a requirement that the residual space on a casting instruction is indeterminate.
A prohibition of tail undisturbed and a variant from Tail agnostic that does not require a poison value fill.
And a further constraint that vl cannot be changed for this "transformed" register (that is while any cast is active).
Such an approach would allow a microarch to tag the register that is "cast" with the transform it will need.
Then when the register is referenced, and only then, will the transform be applied within that reference only.

This would be a win for a comparison of two registers in a SEW other than that loaded, as
  1)both registers could be so tagged, and then
   2)  the comparison could occur in parallel across all SLENs starting with most significant parts at the load SEW sizes and advance to lower significance only is required.
   3) Of course the mask register is going to be in ordinal order and possibly closely aligned with the original SEW load structures, even if it isn't as well aligned with the newer SEW.

But if I can think of such things, then how much more the experts!!

If these operations are indeed infrequent then the impact can be deferred as in the above example, whereas the general programing principle is to discard as little as necessary with the expectation that it might be used later. (optimistic vs pessimistic).

Of course such a deferral needs to address how to remove stale state, for example at a context switch.
At that time, a standard store will need to do in-memory formatting of the new SEW, so that's doable.
However, a full register store will not store exactly as in-register but the new SEWs representation so deferral has a higher cost then.
But we are now talking spilling to memory, so performance is expected to be limited.


In any case, there are possibilities.

I'm
hoping either to prove there is not a layout or (possibly) show that
there is one.
I'm not holding my breath, but I am very hopeful that such a viable compromise situation can be formulated.
Loosing tail unchanged, for example, is a reasonable compromise. Does improved performance warrant increased validation cost for tail undefined?
Perhaps tail undefined is not strictly necessary, that poison fill could also be deferred.

But starting to track changing vl lengths is problematic, and a trap on vl change when a cast change is "active" is going to be annoying to some programmers, but worse is that SLEN=VLEN implementations will need to enforce it ( annoying microarchitect engineers)  if deferred  transform is going to be reliable in the ecosytem.


As I mentioned, such a solution as this deferral is similar to the prefix approach.

We have to wait to see what will be proposed by Krste and co. As I said I have been surprised before, and pleasantly.

Then it can be contrasted to other approaches, tweaked or edge cases addressed.
Until then, like you, I'm hoping.



Bill

On 5/16/20 6:06 AM, David Horner wrote:
EXTERNAL MAIL


By the way, this is similar to the "prefix" proposal that applies the
transform to selected source and/or destinations.

The cost is reduced as the "transform" is occurring while the operation
is also occurring,
  -   on SLEN= machines these "prefix" are nops
  - on SLEN< micro archs the specific combination of transform and
operation can be optimized - especially for high use combinations.

The cost is further reduced for some combinations as an intermediate
state can be used for that op and transform rather than storing a result
that has to be usable by any operation.

To the extent that these cast operations can apply in this way, the
benefit of the "prefix" over cast may be limited and not warrant a "new"
prefix mechanism like #423 (and a specific application #456).

https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/issues/423__;!!EHscmS1ygiU1lA!Vi0QgSrHLMFBGFcES-MJshFWqPS3cRhfJEvF2d9GJNVwi-XzSV3xVUFlEBpRH4U$
https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec/issues/456__;!!EHscmS1ygiU1lA!Vi0QgSrHLMFBGFcES-MJshFWqPS3cRhfJEvF2d9GJNVwi-XzSV3xVUFlt3H0BTg$



On 2020-05-16 8:50 a.m., David Horner via lists.riscv.org wrote:
I would agree that "by definition" this is a sufficient condition to
obtain the instructions that Krste was envisioning of instructions the
also nop on SLEN=VLEN machine.
That is a sufficient condition to address the byte mismatch of SLEN <
VLEN.

However, is it necessary as it is a very expensive operation for SLEN<?

Are there casting instructions that are reasonably low cost on both
SLEN= and SLEN< VLEN that create an intermediate state that works for
both?

And if there are such operations, do you only provide them (and NOT
the heavy handed "as if written to memory and back")?
Can two such instructions do the full transition for SLEN< to SLEN=?
If so, is it sufficiently easy to recognize such a pair and fuse as a
nop on SLEN= systems?
Can applications alternatively rely on a linkage editor to nop them?

 I have no good solution (yet) as the guts of the range of microarch
tricks is not my forte.

But there are others who undoubtedly are mulling over such
considerations.

It would not be a lose-win proposition but a limited win-win.

I look forward to Krste's proposals . I have been surprised before!!

On 2020-05-16 2:07 a.m., Bill Huffman wrote:
It seems like the function of a cast instruction the same as storing to
memory (stride-1) with one SEW and loading back the same number of bytes
with another SEW.  Is that a correct understanding?

       Bill

On 5/15/20 11:55 AM, Krste Asanovic wrote:
EXTERNAL MAIL



Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github:
https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$


Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed.  The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes.  This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
    SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
    (<=SLEN), however these appear to result in even greater variety of
    layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
    internally restrict datapath communication within SLEN width of
    datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.








Bill Huffman
 

I believe it is provably not possible for our vectors to have more than
two of the following properties:

1. The datapath can be sliced into multiple slices to improve wiring
such that corresponding elements of different sizes reside in the
same slice.
2. Memory accesses containing enough contiguous bytes to fill the
datapath width corresponding to one vector can spread evenly
across the slices when loading or storing a vector group of two,
four, or eight vector registers.
3. The operation corresponding to storing a register group at one
element width and loading back the same number of bytes into a
register group of the same size but with a different element width
results in exactly the same register position for all bytes.

The SLEN solution we've had for some time allows for #1 and #2. We're
discussing requiring "cast" operations in place of having property #3.

I wonder whether we should look again at giving up property #2 instead.
It would cost additional logic in wide, sliced datapaths to keep up
memory bandwidth. But the damage might be less than requiring casts and
the potential of splitting the ecosystem?

Bill

On 5/15/20 11:55 AM, Krste Asanovic wrote:
EXTERNAL MAIL



Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$

Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed. The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes. This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid. The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
(<=SLEN), however these appear to result in even greater variety of
layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
internally restrict datapath communication within SLEN width of
datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.







David Horner
 

that may be


On 2020-05-19 7:14 p.m., Bill Huffman wrote:
I believe it is provably not possible for our vectors to have more than
two of the following properties:
For me the definitions contained in 1,2 and 3 need to be more rigorously defined before I can agree that the constraints/behaviours  described are provably inconsistent on aggregate..

1. The datapath can be sliced into multiple slices to improve wiring
such that corresponding elements of different sizes reside in the
same slice.
2. Memory accesses containing enough contiguous bytes to fill the
datapath width corresponding to one vector can spread evenly
across the slices when loading or storing a vector group of two,
four, or eight vector registers.
This one is particularly difficult for me to formalize.
When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be requested in an order such that when they arrive corresponding segments can be filled.
So, I'm not sure if the focus here is an efficiency concern?
3. The operation corresponding to storing a register group at one
element width and loading back the same number of bytes into a
register group of the same size but with a different element width
results in exactly the same register position for all bytes.
What we can definitely prove is that a specific design has specific characteristics and eliminates other characteristics.
I agree that the current design has the characteristics you describe.

However, for #3, I appears to me that a facility that clusters elements of smaller than a certain size still allows behaviours 1 and 2.
Further,for element lengths up to that cluster size in-register order matches the in-memory order.

The SLEN solution we've had for some time allows for #1 and #2. We're
discussing requiring "cast" operations in place of having property #3.

I wonder whether we should look again at giving up property #2 instead.
I also agree reconsidering #2
It would cost additional logic in wide, sliced datapaths to keep up
memory bandwidth.
Here I believe is where you introduce efficacy in implementation.
Once implementation design considerations are introduced the proof becomes much more complex;
Compounded by objectives and technical tradeoffs and less a mathematics rigor issue .

But the damage might be less than requiring casts and
the potential of splitting the ecosystem?
I also agree with you that reconsidering #2 can lead to conceptually simpler designs that perhaps will result in less eco fragmentation.
However, anticipating a communities response to even the smallest of changes is crystal ball material.

There are many variations and approaches still open to us to address in-register and in-memory order agreement, and to address widening approaches (in particular, interleaving or striping with generalized SLEN parameters).

I'm still waiting on the proposed casting details. If that resolves all our concerns, great.


In the interim I believe it may be worthwhile exercises to consider equivalences of functionality.

Specifically, vertical stripping vs horizontal interleave for widening ops, in-register vs in-memory order for element width alignment.
I hope that the more we identify the easier it will be to compare them and evaluate trade-offs.

I also think it constructive to consider big-endian vs little-endian with concerns about granularity (inherent in big endian and obscured with little-endian (aligned vs unaligned still relevant))




Bill

On 5/15/20 11:55 AM, Krste Asanovic wrote:
EXTERNAL MAIL



Date: 2020/5/15
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$

Issues discussed:

# MLEN=1 change

The new layout of mask registers with fixed MLEN=1 was discussed. The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes. This might save some
wiring but could increase bits read/written for the mask in a
microarchitecture.

#434 SLEN=VLEN as optional extension

Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid. The group was keen to prevent possible
fragmentation, so is going to consider several options:

- providing cast instructions that are mandatory, so at least
SLEN<VLEN code runs correctly on SLEN=VLEN machines.

- consider a different data layout that could allow casting up to ELEN
(<=SLEN), however these appear to result in even greater variety of
layouts or dynamic layouts

- invent a microarchitecture that can appear as SLEN=VLEN but
internally restrict datapath communication within SLEN width of
datapath, or prove this is impossible/expensive


# v0.9

The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.







Krste Asanovic
 

I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
v4*n 13 12 11 10 3 2 1 0
v4*n+1 17 16 15 14 7 6 5 4
v4*n+2 1B 1A 19 18 B A 9 8
v4*n+3 1F 1E 1D 1C F E D C

Some background. We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates. That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
v4*n 7 5 3 1 6 4 2 0
v4*n+1 F D B 9 E C A 8
v4*n+2 17 15 13 11 16 14 12 10
v4*n+3 1F 1D 1B 19 1E 1C 1A 18

is actually reverting back to an older layout (Figure 7.2 in my
thesis). This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units. For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used. This is why I
don't agree with Bill's conclusion. I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting. The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0 SEW=8b
7 6 5 4 3 2 1 0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines. I think this is viable in a
complex microarch. Basically, the microarchitecture tags the vector
register with the EEW used to write it. Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle. Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting. The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing. This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@gmail.com> said:
| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2. We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed. The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes. This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid. The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|


David Horner
 



On Tue, May 26, 2020, 04:38 , <krste@...> wrote:

.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. 
Correct 
I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. 
I don't understand the reason for this constraint.
E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).
Still not got it.

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b
clstr is not a  count but a size.
When CLSTR is 32 this last row is
                7       5       3       1       6       4       2       0 SEW=ELEN=32b
If I understood your diagram correctly.

See #461. It is effectively what SLEN was under v0.8. But potentially configurable.

I'm doing this for my cell phone. I'll work it up better when I'm at my laptop

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

The problem is determining when you would be needing to do these Micro Ops. That's what I had proposed as the flagging solution in response to Bill. Again when I'm on my laptop.

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.
It is an approach the incorporates the clustering consecutive byte concept but is more limiting. 

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


>>>>> On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|


David Horner
 

for those not on Github I posted this to #461:

I gather what was missing from this were examples.
I prefer to consider clstr as a dynamic parameter, that some implementations will use a range of values.

However, for the sake of examples we can consider the cases where CLSTR=32.
So after a load:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                                  7 6 5 4                         3 2 1 0 SEW=8b

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b

VLEN=256b, SLEN=64, vl=13, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                        C         B A 9 8         7 6 5 4         3 2 1 0 SEW=8b SEW=8b

  

                7       3       6       2       5       1       4       0 SEW=32b

                        B               A               9       C       8  @ reg+1


By inspection unary and binary single SEW operations do not affect order.
However, for a widening operation, EEW=16 and 64 respectively which will yield:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b


                        3               1               2               0 SEW=64b

                        7               5               6               4  @ reg+1


Narrowing work in reverse.
When SLEN=VLEN clstr is irrelevant and effectively infinite as there is no other SLEN group in which to advance, so the current SLEN chunk has to be used (in the round-robin fashion.
Thank you for the template to use.
I don’t think SLEN = 1/4 VLEN has to be diagrammed.
And of course, store also works in reverse of load.
     
   
 




         



     

       
         

   














 
 
     






           


           
           
                 
 
  @David-Horner
 

   
     
     
       
         
 
 
     
       

On 2020-05-26 11:17 a.m., David Horner via lists.riscv.org wrote:

On Tue, May 26, 2020, 04:38 , <krste@...> wrote:

.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. 
Correct 
I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. 
I don't understand the reason for this constraint.
E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).
Still not got it.

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b
clstr is not a  count but a size.
When CLSTR is 32 this last row is
                7       5       3       1       6       4       2       0 SEW=ELEN=32b
If I understood your diagram correctly.

See #461. It is effectively what SLEN was under v0.8. But potentially configurable.

I'm doing this for my cell phone. I'll work it up better when I'm at my laptop


Bill Huffman
 

I appreciate your agreement with my analysis, Krste. However, I wasn't
drawing a conclusion. I lean toward the conclusion that we keep the
"new v0.9 scheme" below and live with casts. But I wasn't fully sure
and wanted to see where the discussion might go. I suspect each of the
extra gates for memory access and the slower speed of short vectors is
sufficient by itself to argue pretty strongly against the "v0.8 scheme"
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where
bytes are contiguous within ELEN." I worked out what I think are the
outlines of a proof that it's not possible, but I thought I'd suggest
what I did at a high level first and only try to make the proof more
rigorous if necessary.

Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
v4*n 13 12 11 10 3 2 1 0
v4*n+1 17 16 15 14 7 6 5 4
v4*n+2 1B 1A 19 18 B A 9 8
v4*n+3 1F 1E 1D 1C F E D C

Some background. We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates. That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
v4*n 7 5 3 1 6 4 2 0
v4*n+1 F D B 9 E C A 8
v4*n+2 17 15 13 11 16 14 12 10
v4*n+3 1F 1D 1B 19 1E 1C 1A 18

is actually reverting back to an older layout (Figure 7.2 in my
thesis). This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units. For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used. This is why I
don't agree with Bill's conclusion. I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting. The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
7 6 5 4 3 2 1 0 SEW=8b
7 6 5 4 3 2 1 0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines. I think this is viable in a
complex microarch. Basically, the microarchitecture tags the vector
register with the EEW used to write it. Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle. Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting. The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing. This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@gmail.com> said:
| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2. We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed. The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes. This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid. The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|




Mr Grigorios Magklis
 

Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n           13      12      11      10       3       2       1       0
   v4*n+1         17      16      15      14       7       6       5       4
   v4*n+2         1B      1A      19      18       B       A       9       8
   v4*n+3         1F      1E      1D      1C       F       E       D       C

Some background.  We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates.  That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n            7       5       3       1       6       4       2       0
   v4*n+1          F       D       B       9       E       C       A       8
   v4*n+2         17      15      13      11      16      14      12      10
   v4*n+3         1F      1D      1B      19      1E      1C      1A      18

is actually reverting back to an older layout (Figure 7.2 in my
thesis).  This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units.  For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used.  This is why I
don't agree with Bill's conclusion.  I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting.  The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting.  I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words.  E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|






Guy Lemieux
 

I support this scheme, but I would further add a restriction on loads/stores to only support LMUL=1 (no register groups). Instead, any data stored in a registe group with LMUL!=1 must first be “cast” into registers with LMUL=1. To do this, special cast instructions would be required; likely this cast can be done in-place (same source and dest registers). 

The key advantage of this new restriction is to remove all data shuffling from the interface between external memory and the register file — just transfer bytes in memory order. This vastly simplifies “basic” implementations by keeping data shuffling exclusively on the ALU side of the register file.

Guy



On Wed, May 27, 2020 at 4:59 AM Mr Grigorios Magklis <grigorios.magklis@...> wrote:
Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n           13      12      11      10       3       2       1       0
   v4*n+1         17      16      15      14       7       6       5       4
   v4*n+2         1B      1A      19      18       B       A       9       8
   v4*n+3         1F      1E      1D      1C       F       E       D       C

Some background.  We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates.  That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n            7       5       3       1       6       4       2       0
   v4*n+1          F       D       B       9       E       C       A       8
   v4*n+2         17      15      13      11      16      14      12      10
   v4*n+3         1F      1D      1B      19      1E      1C      1A      18

is actually reverting back to an older layout (Figure 7.2 in my
thesis).  This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units.  For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used.  This is why I
don't agree with Bill's conclusion.  I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting.  The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting.  I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words.  E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|






Guy Lemieux
 

As a follow-up, the main goal of LMUL>1 is to get better storage efficiency out of the register file, allowing for slightly higher compute unit utilization.

The memory system should not require LMUL>1 to get better bandwidth utilization. An advanced memory system can fuse back-to-back loads (or stores) to get improved bandwidth. Some memory systems may break up vector memory transfers into fixed-size quanta (eg, cache lines) anyways.

Restricting LMUL=1 for loads/stores therefore primarily impacts instruction issue bandwidth and executable size. These shouldn’t be highe drawbacks.

Guy



On Wed, May 27, 2020 at 6:56 AM Guy Lemieux via lists.riscv.org <glemieux=vectorblox.com@...> wrote:
I support this scheme, but I would further add a restriction on loads/stores to only support LMUL=1 (no register groups). Instead, any data stored in a registe group with LMUL!=1 must first be “cast” into registers with LMUL=1. To do this, special cast instructions would be required; likely this cast can be done in-place (same source and dest registers). 

The key advantage of this new restriction is to remove all data shuffling from the interface between external memory and the register file — just transfer bytes in memory order. This vastly simplifies “basic” implementations by keeping data shuffling exclusively on the ALU side of the register file.

Guy



On Wed, May 27, 2020 at 4:59 AM Mr Grigorios Magklis <grigorios.magklis@...> wrote:
Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n           13      12      11      10       3       2       1       0
   v4*n+1         17      16      15      14       7       6       5       4
   v4*n+2         1B      1A      19      18       B       A       9       8
   v4*n+3         1F      1E      1D      1C       F       E       D       C

Some background.  We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates.  That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n            7       5       3       1       6       4       2       0
   v4*n+1          F       D       B       9       E       C       A       8
   v4*n+2         17      15      13      11      16      14      12      10
   v4*n+3         1F      1D      1B      19      1E      1C      1A      18

is actually reverting back to an older layout (Figure 7.2 in my
thesis).  This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units.  For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used.  This is why I
don't agree with Bill's conclusion.  I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting.  The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting.  I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words.  E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|






David Horner
 

for those not on Github I posted this to #461:

CLSTR can be considered a progressive SLEN=VLEN switch.

Rather than all or nothing as the SLEN=VLEN switch provides for in-memory compatibility,
clstr provides either a fixed or variable degradation for widening operations on SEW<CLSTR.

With a fixed clstr all operations operate at peak performance, except for mixed SEW widening/narrowing and only those with SEW<CLSTR.

In-memory and in-register formats align when SEW<=CLSTR.
Software is fully aware of the mapping, and already accommodates this behaviour for many existing architectures. (Analogous to big-endian vs little-endian in many aspects, although with bigendian all the bytes are present at each SEW level)

The clstr parameter is potentially writable, and for the higher end machines it appears very reasonable that they would provide at least two settings for CLSTR, byte and XLEN.
This would provide in-memory alignment at XLEN for code that is not sure of its dependence on it, and an optimization for widening/narrowing at SEW<XLEN for code that is sure it does not depend on in-memory format for that section of code.

Because clstr is potentially writable software can avoid performance penalties by other means as well, and leverage other potential structural benefits. They will turn a liability into a feature.

I have a pending proposal for exactly that idea “Its not a bug, its a feature”
That enables clstr for SLEN=VLEN implementations also, and allows addressing of even/odd groupings for SLEN<VLEN , too.



On 2020-05-26 5:03 p.m., David Horner via lists.riscv.org wrote:
for those not on Github I posted this to #461:

I gather what was missing from this were examples.
I prefer to consider clstr as a dynamic parameter, that some implementations will use a range of values.

However, for the sake of examples we can consider the cases where CLSTR=32.
So after a load:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                                  7 6 5 4                         3 2 1 0 SEW=8b

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b

VLEN=256b, SLEN=64, vl=13, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                        C         B A 9 8         7 6 5 4         3 2 1 0 SEW=8b SEW=8b

  

                7       3       6       2       5       1       4       0 SEW=32b

                        B               A               9       C       8  @ reg+1


By inspection unary and binary single SEW operations do not affect order.
However, for a widening operation, EEW=16 and 64 respectively which will yield:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b


                        3               1               2               0 SEW=64b

                        7               5               6               4  @ reg+1


Narrowing work in reverse.
When SLEN=VLEN clstr is irrelevant and effectively infinite as there is no other SLEN group in which to advance, so the current SLEN chunk has to be used (in the round-robin fashion.
Thank you for the template to use.
I don’t think SLEN = 1/4 VLEN has to be diagrammed.
And of course, store also works in reverse of load.
     
   
 




         



     

       
         

   














 
 
     






           


           
           
                 
 
  @David-Horner
 

   
     
     
       
         
 
 
     
       

On 2020-05-26 11:17 a.m., David Horner via lists.riscv.org wrote:

On Tue, May 26, 2020, 04:38 , <krste@...> wrote:

.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. 
Correct 
I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. 
I don't understand the reason for this constraint.
E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).
Still not got it.

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b
clstr is not a  count but a size.
When CLSTR is 32 this last row is
                7       5       3       1       6       4       2       0 SEW=ELEN=32b
If I understood your diagram correctly.

See #461. It is effectively what SLEN was under v0.8. But potentially configurable.

I'm doing this for my cell phone. I'll work it up better when I'm at my laptop



David Horner
 

This is v0.8 with SLEN=8.



On Wed, May 27, 2020, 07:59 Grigorios Magklis, <grigorios.magklis@...> wrote:
Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n           13      12      11      10       3       2       1       0
   v4*n+1         17      16      15      14       7       6       5       4
   v4*n+2         1B      1A      19      18       B       A       9       8
   v4*n+3         1F      1E      1D      1C       F       E       D       C

Some background.  We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates.  That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n            7       5       3       1       6       4       2       0
   v4*n+1          F       D       B       9       E       C       A       8
   v4*n+2         17      15      13      11      16      14      12      10
   v4*n+3         1F      1D      1B      19      1E      1C      1A      18

is actually reverting back to an older layout (Figure 7.2 in my
thesis).  This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units.  For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used.  This is why I
don't agree with Bill's conclusion.  I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting.  The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting.  I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words.  E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|






Guy Lemieux
 

The precise data layout pattern does not matter.

What matters is that a single distribution pattern is agreed upon to
avoid fragmenting the software ecosystem.

With my additional restriction, the load/store side of an
implementation is greatly simplified, allowing for simple
implementations.

The main drawback of my restriction is how to avoid the overhead of
the cast instruction in an aggressive implementation? The cast
instruction must rearrange data to translate between LMUL!=1 and
LMUL=1 data layouts; my proposal requires these casts to be executed
between any load/stores (which always assume LMUL=1) and compute
instructions which use LMUL!=1. I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused. I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).

Guy

On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@gmail.com> wrote:

This is v0.8 with SLEN=8.


Nick Knight
 

I appreciate this discussion about making things friendlier to software. I've always felt the constraints on SLEN-agnostic software to be a nuisance, albeit a mild one.

However, I do have a concern about removing LMUL > 1 memory operations regarding code bloat. This is all purely subjective: I have not done any concrete analysis. But here is an example that illustrates my concern:

# C code:
# int8_t x[N]; for(int i = 0; i < N; ++i) ++x[i];
# keep N in a0 and &x[0] in a1

# "BEFORE" (original RVV code):
loop:
vsetvli t0, a0, e8,m8
vle8.v v0, (a1)
vadd.vi v0, v0, 1
vse8.v v0, (a1)
add a1, a1, t0
sub a0, a0, t0
bnez a0, loop

# "AFTER" removing LMUL > 1 loads/stores:
loop:
vsetvli t0, a0, e8,m8
mv t1, t0
mv a2, a1

# loads:
vsetvli t2, t1, e8,m1
vle8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vle8.v v7, (a2)

# cast instructions ...

vsetvli x0, t0, e8,m8
vadd.vi v0, (a1)

# more cast instructions ...
mv t1, t0
mv a2, a1

# stores:
vsetvli t2, t1, e8,m1
vse8.v v0, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v1, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v2, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v3, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v4, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v5, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v6, (a2)
add a2, a2, t2
sub t2, t2, t1
vsetvli t2, t1, e8,m1
vse8.v v7, (a2)

add a1, a1, t0
sub a0, a0, t0
bnez a0, loop


On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:
The precise data layout pattern does not matter.

What matters is that a single distribution pattern is agreed upon to
avoid fragmenting the software ecosystem.

With my additional restriction, the load/store side of an
implementation is greatly simplified, allowing for simple
implementations.

The main drawback of my restriction is how to avoid the overhead of
the cast instruction in an aggressive implementation? The cast
instruction must rearrange data to translate between LMUL!=1 and
LMUL=1 data layouts; my proposal requires these casts to be executed
between any load/stores (which always assume LMUL=1) and compute
instructions which use LMUL!=1. I think this can sometimes be done for
"free" by carefully planning your compute instructions. For example, a
series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to
the same register group destination can be macro-op fused. I don't
think the same thing can be done for vst instructions, unless it
macro-op fuses a longer sequence consisting of cast / vst / clear
register group (or some other operation that overwrites the cast
destination, indicating the cast is superfluous and only used by the
stores).

Guy

On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:
>
> This is v0.8 with SLEN=8.