Date   

Re: What is the plan for rvv v1.0

mark
 

If you are looking for expected dates they are always in the spec status spreadsheet at:


if it is something else, please let us know what specifically you are looking for.

thanks
Mark


On Wed, Nov 25, 2020 at 12:51 AM <weiwei.wang@...> wrote:
Hi Krste and Andrew, 

What is the rough plan for rvv v1.0 release? I searched vector-ext mailing list but can’t find the info I want.

 

Thanks

Weiwei  


What is the plan for rvv v1.0

Wang Weiwei
 

Hi Krste and Andrew, 

What is the rough plan for rvv v1.0 release? I searched vector-ext mailing list but can’t find the info I want.

 

Thanks

Weiwei  


next vector meeting in 7 hours

Krste Asanovic
 

I think we'll be spending a chunk of time on mask layout and
implementation issues.

See you then,

Krste


Re: rename vfrece7/vfrsqrte7 to vfrec7 and vfrsqrt7

Andrew Waterman
 



On Sun, Nov 15, 2020 at 3:08 PM Krste Asanovic <krste@...> wrote:


This is issue #601.



It was pointed out that *e7 (estimate to 7 bits) suffix on mnemonic is

easily confused with e32 (element size 32) on other mnemonics.



This is probably one we can handle on email thread.   I'm in favor of

change (can keep old names as alias in toolchain for now to avoid churn).

👍





Krste












rename vfrece7/vfrsqrte7 to vfrec7 and vfrsqrt7

Krste Asanovic
 

This is issue #601.

It was pointed out that *e7 (estimate to 7 bits) suffix on mnemonic is
easily confused with e32 (element size 32) on other mnemonics.

This is probably one we can handle on email thread. I'm in favor of
change (can keep old names as alias in toolchain for now to avoid churn).

Krste


Vector TG minutes 2020/11/13 meeting

Krste Asanovic
 

Date: 2020/11/13
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~22
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

# Removing half-precision FP as mandate in V extension.

The group discussed a proposal to remove half-precision floating-point
operations as part of mandated set for V extension, moving FP16
support to an additional standard extension. BF16 support could be
added in same way at later date. For both 16b FP formats, a standard
subextension supporting only conversion up/down to/from FP32 was
agreed to be defined (no integer conversions, and no .rod. variant).
This approach enables implementations to support either, both, or
neither of these 16-bit floating-point formats. The expected impact
on software stack is expected to be small given that half-precision FP
use is not yet standard.

Related, there is a discussion on what level of support should be
mandated in next application-processor profile (RVA21). There were
three options considered:
1) no FP16 supported mandate
2) RVA21 mandates FP16 convert operations only
3) V mandates FP16 convert operations only

This is to be discussed further, but group seemed to favor option 1 or
2.

# Mask handling

There was initial discussion of concerns over mask handling for wide
spatial implementations and/or implementations with vector register
renaming.

One proposal was to not allow "tail undisturbed" on mask results, as
this complicated renamed implementations and those with specialized
mask handling, while not being particularly useful to software.
Created new issue #602 for this proposal.


Half-Precision, BFloat16, and Other Float Encoding: Reference Model Recommendations from Task Group

Krste Asanovic
 

Because there is no "official" BF16 standard (beyond interchange
format) and because other vendors have made incompatible choices (so
no de-facto standard either), we will need to define the RISC-V BF16
arithmetic standard.

Some people are working towards this as part of alternate FP group
proposal - there are a few details that need some thought.

Once this is specified, there can be reference implementations
produced.

Krste

On Fri, 13 Nov 2020 09:11:18 -0800, "CDS" <cohen.steed@wdc.com> said:
| In support of Open Source Software and publicly released modeling schemes, does the Vector Task Group have a recommendation for arithmetic
| reference? The published ISSs can provide checking results from a heavy-lifting simulation perspective, but even they must rely on something to
| model and calculate arithmetic results. The IEEE-754 encodings are handled by some easily-found solutions - what about BFloat16 and other encodings?
|


Half-Precision, BFloat16, and Other Float Encoding: Reference Model Recommendations from Task Group

CDS
 

In support of Open Source Software and publicly released modeling schemes, does the Vector Task Group have a recommendation for arithmetic reference? The published ISSs can provide checking results from a heavy-lifting simulation perspective, but even they must rely on something to model and calculate arithmetic results. The IEEE-754 encodings are handled by some easily-found solutions - what about BFloat16 and other encodings?


Vector TG minutes from 2020/11/6 meeting

Krste Asanovic
 

Also, reminder we'll be meeting tomorrow (Friday Nov 13) as per
calendar entry (7 hours from now),

Krste

Date: 2020/11/06
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~18
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

#592 "Heritage" - old prior art

We discussed how to add notes showing known older designs that had
same features as included in spec, and spent some time covering all
details in spec. We cannot discuss live patents in meetings, but we
can describe public domain techniques from 20+ years ago. Proposal is
to accept pull requests with details incorporated as NOTE comments for
now. Should ideally format differently than other commentary, but
this can be done later.

(#529) Whole register load/store misalignment exceptions

While we had previously agreed (#529) that whole register load/stores
could report misalignment exceptions if the base address was not
aligned with the encoded hint EEW, we had not considered cases where
the machine did not support smaller EEWs. In particular, stores are
always encoded with EEW=8.

We rejected allowing machines to report exceptions if not
VLEN-aligned, as this would complicate stack save/restore in ABIs with
smaller stack alignments.

We decided to stay with current text that allows misaligned exceptions
to be reported according to greater of the smallest supported EEW or
the encoded EEW. Profiles can mandate support for certain EEWs.


Re: vector strided stores when rs1=x0

Bill Huffman
 

This sounds right to me as well.  No use making a special case for strided stores with rs2=x0.

      Bill

On 11/9/20 12:04 PM, Nick Knight wrote:
EXTERNAL MAIL

I understand now. I'm on board iff the memory consistency model experts assent.

On Mon, Nov 9, 2020 at 11:41 AM Krste Asanovic <krste@...> wrote:
There’s a comment about this in spec already.

But note that this would be in a case where you're relying on having multiple accesses in a non-deterministic order to one memory location, which is probably fraught for other reasons.

Krste

On Nov 9, 2020, at 11:38 AM, Nick Knight <nick.knight@...> wrote:

Sorry, slightly off topic, but what was the rationale for 

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`.  


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste









Re: vector strided stores when rs1=x0

Nick Knight
 

I understand now. I'm on board iff the memory consistency model experts assent.

On Mon, Nov 9, 2020 at 11:41 AM Krste Asanovic <krste@...> wrote:
There’s a comment about this in spec already.

But note that this would be in a case where you're relying on having multiple accesses in a non-deterministic order to one memory location, which is probably fraught for other reasons.

Krste

On Nov 9, 2020, at 11:38 AM, Nick Knight <nick.knight@...> wrote:

Sorry, slightly off topic, but what was the rationale for 

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`.  


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste









Re: vector strided stores when rs1=x0

Krste Asanovic
 

There’s a comment about this in spec already.

But note that this would be in a case where you're relying on having multiple accesses in a non-deterministic order to one memory location, which is probably fraught for other reasons.

Krste

On Nov 9, 2020, at 11:38 AM, Nick Knight <nick.knight@...> wrote:

Sorry, slightly off topic, but what was the rationale for 

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`.  


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste









Re: vector strided stores when rs1=x0

Nick Knight
 

Sorry, slightly off topic, but what was the rationale for

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`. 


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste








Re: vector strided stores when rs1=x0

Krste Asanovic
 

I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste








Re: vector strided stores when rs1=x0

Krste Asanovic
 


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste







Re: vector strided stores when rs1=x0

Guy Lemieux
 

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO. (I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste






vector strided stores when rs1=x0

Krste Asanovic
 

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements. This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back). The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry. Software could then get a
similar effect by settng vl=1 before the store.

Krste


Re: Vector Byte Arrangement in Wide Implementations

Andrew Waterman
 



On Thu, Nov 5, 2020 at 11:05 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 10:51 PM, Bill Huffman wrote:


On 11/5/20 8:33 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 5:31 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 4:36 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?
How are you thinking that bit gets set/cleared?  The same store instruction is used whether or not the compiler will be able to put in a hint.

I was thinking of something along the lines of the Alpha 21264's store-wait predictor: every N cycles (with N probably somewhere in the range of [2^10, 2^14]), clear the valid bits.

Do you know of a reference for how the store-wait predictor works?  I can't find any reference to it in the 21264 hardware reference manual, though I found a reference (without description) in a paper.

I see it's called a "stWait table" in the hardware reference manual.

I see how that works for waiting loads.  I'm guessing that you're thinking of a PC based table here that remembers that certain whole register loads fail on the hint and should use the preview store EEW instead.  I'll think about that.  Maybe it would work OK.

The key insight from the 21264 design is that the valid bits are cleared periodically.  I appreciate that you were willing to meet me more than halfway, but I actually think your original idea would suffice: per-vector-register state + valid bits + periodic clearing.  I think what you assumed I was suggesting would also work.  Deciding which is better is a matter of ISCAtecture :-)
 

     Bill

Maybe you're thinking of tracking whether it's better to go by the EEW of the whole register load hint or the EEW of the most recent whole register store.  If the one that's being used is wrong often enough, try the other one.




The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

I'm hoping such a predictor is not needed for the reasons you say.  But I want to know that there is an "out" if it comes to that.

Gotcha.  And while I'm not 100% convinced of the efficacy of my particular proposal, I think it's sufficiently on the right track that clever microarchitects can devise a solution along those lines.

My immediate idea of a solution is that we have 3 bits of size hints for whole loads.  Only one value of the same field is used for whole register stores.  Let's use two values instead for the whole register store.  One hints that the following whole register load hint will be correct.  The other whole register store opcode hints that the EEW being stored is more likely to be correct for the following whole register load.  Architecturally, they do the same thing, of course.

      Bill


reminder, Vector task group meeting Friday

Krste Asanovic
 

We'll meet per the calendar entry.

Agenda is to go over any remaining unsettled open issues,

Krste


Re: Vector Byte Arrangement in Wide Implementations

Bill Huffman
 


On 11/5/20 10:51 PM, Bill Huffman wrote:


On 11/5/20 8:33 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 5:31 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 4:36 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?
How are you thinking that bit gets set/cleared?  The same store instruction is used whether or not the compiler will be able to put in a hint.

I was thinking of something along the lines of the Alpha 21264's store-wait predictor: every N cycles (with N probably somewhere in the range of [2^10, 2^14]), clear the valid bits.

Do you know of a reference for how the store-wait predictor works?  I can't find any reference to it in the 21264 hardware reference manual, though I found a reference (without description) in a paper.

I see it's called a "stWait table" in the hardware reference manual.

I see how that works for waiting loads.  I'm guessing that you're thinking of a PC based table here that remembers that certain whole register loads fail on the hint and should use the preview store EEW instead.  I'll think about that.  Maybe it would work OK.

     Bill

Maybe you're thinking of tracking whether it's better to go by the EEW of the whole register load hint or the EEW of the most recent whole register store.  If the one that's being used is wrong often enough, try the other one.




The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

I'm hoping such a predictor is not needed for the reasons you say.  But I want to know that there is an "out" if it comes to that.

Gotcha.  And while I'm not 100% convinced of the efficacy of my particular proposal, I think it's sufficiently on the right track that clever microarchitects can devise a solution along those lines.

My immediate idea of a solution is that we have 3 bits of size hints for whole loads.  Only one value of the same field is used for whole register stores.  Let's use two values instead for the whole register store.  One hints that the following whole register load hint will be correct.  The other whole register store opcode hints that the EEW being stored is more likely to be correct for the following whole register load.  Architecturally, they do the same thing, of course.

      Bill

241 - 260 of 761