Date   

Re: vector strided stores when rs1=x0

Nick Knight
 

I understand now. I'm on board iff the memory consistency model experts assent.

On Mon, Nov 9, 2020 at 11:41 AM Krste Asanovic <krste@...> wrote:
There’s a comment about this in spec already.

But note that this would be in a case where you're relying on having multiple accesses in a non-deterministic order to one memory location, which is probably fraught for other reasons.

Krste

On Nov 9, 2020, at 11:38 AM, Nick Knight <nick.knight@...> wrote:

Sorry, slightly off topic, but what was the rationale for 

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`.  


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste









Re: vector strided stores when rs1=x0

Krste Asanovic
 

There’s a comment about this in spec already.

But note that this would be in a case where you're relying on having multiple accesses in a non-deterministic order to one memory location, which is probably fraught for other reasons.

Krste

On Nov 9, 2020, at 11:38 AM, Nick Knight <nick.knight@...> wrote:

Sorry, slightly off topic, but what was the rationale for 

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`.  


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste









Re: vector strided stores when rs1=x0

Nick Knight
 

Sorry, slightly off topic, but what was the rationale for

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`. 


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste








Re: vector strided stores when rs1=x0

Krste Asanovic
 

I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste








Re: vector strided stores when rs1=x0

Krste Asanovic
 


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste







Re: vector strided stores when rs1=x0

Guy Lemieux
 

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO. (I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste






vector strided stores when rs1=x0

Krste Asanovic
 

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements. This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back). The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry. Software could then get a
similar effect by settng vl=1 before the store.

Krste


Re: Vector Byte Arrangement in Wide Implementations

Andrew Waterman
 



On Thu, Nov 5, 2020 at 11:05 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 10:51 PM, Bill Huffman wrote:


On 11/5/20 8:33 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 5:31 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 4:36 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?
How are you thinking that bit gets set/cleared?  The same store instruction is used whether or not the compiler will be able to put in a hint.

I was thinking of something along the lines of the Alpha 21264's store-wait predictor: every N cycles (with N probably somewhere in the range of [2^10, 2^14]), clear the valid bits.

Do you know of a reference for how the store-wait predictor works?  I can't find any reference to it in the 21264 hardware reference manual, though I found a reference (without description) in a paper.

I see it's called a "stWait table" in the hardware reference manual.

I see how that works for waiting loads.  I'm guessing that you're thinking of a PC based table here that remembers that certain whole register loads fail on the hint and should use the preview store EEW instead.  I'll think about that.  Maybe it would work OK.

The key insight from the 21264 design is that the valid bits are cleared periodically.  I appreciate that you were willing to meet me more than halfway, but I actually think your original idea would suffice: per-vector-register state + valid bits + periodic clearing.  I think what you assumed I was suggesting would also work.  Deciding which is better is a matter of ISCAtecture :-)
 

     Bill

Maybe you're thinking of tracking whether it's better to go by the EEW of the whole register load hint or the EEW of the most recent whole register store.  If the one that's being used is wrong often enough, try the other one.




The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

I'm hoping such a predictor is not needed for the reasons you say.  But I want to know that there is an "out" if it comes to that.

Gotcha.  And while I'm not 100% convinced of the efficacy of my particular proposal, I think it's sufficiently on the right track that clever microarchitects can devise a solution along those lines.

My immediate idea of a solution is that we have 3 bits of size hints for whole loads.  Only one value of the same field is used for whole register stores.  Let's use two values instead for the whole register store.  One hints that the following whole register load hint will be correct.  The other whole register store opcode hints that the EEW being stored is more likely to be correct for the following whole register load.  Architecturally, they do the same thing, of course.

      Bill


reminder, Vector task group meeting Friday

Krste Asanovic
 

We'll meet per the calendar entry.

Agenda is to go over any remaining unsettled open issues,

Krste


Re: Vector Byte Arrangement in Wide Implementations

Bill Huffman
 


On 11/5/20 10:51 PM, Bill Huffman wrote:


On 11/5/20 8:33 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 5:31 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 4:36 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?
How are you thinking that bit gets set/cleared?  The same store instruction is used whether or not the compiler will be able to put in a hint.

I was thinking of something along the lines of the Alpha 21264's store-wait predictor: every N cycles (with N probably somewhere in the range of [2^10, 2^14]), clear the valid bits.

Do you know of a reference for how the store-wait predictor works?  I can't find any reference to it in the 21264 hardware reference manual, though I found a reference (without description) in a paper.

I see it's called a "stWait table" in the hardware reference manual.

I see how that works for waiting loads.  I'm guessing that you're thinking of a PC based table here that remembers that certain whole register loads fail on the hint and should use the preview store EEW instead.  I'll think about that.  Maybe it would work OK.

     Bill

Maybe you're thinking of tracking whether it's better to go by the EEW of the whole register load hint or the EEW of the most recent whole register store.  If the one that's being used is wrong often enough, try the other one.




The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

I'm hoping such a predictor is not needed for the reasons you say.  But I want to know that there is an "out" if it comes to that.

Gotcha.  And while I'm not 100% convinced of the efficacy of my particular proposal, I think it's sufficiently on the right track that clever microarchitects can devise a solution along those lines.

My immediate idea of a solution is that we have 3 bits of size hints for whole loads.  Only one value of the same field is used for whole register stores.  Let's use two values instead for the whole register store.  One hints that the following whole register load hint will be correct.  The other whole register store opcode hints that the EEW being stored is more likely to be correct for the following whole register load.  Architecturally, they do the same thing, of course.

      Bill


Re: Vector Byte Arrangement in Wide Implementations

Bill Huffman
 


On 11/5/20 8:33 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 5:31 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 4:36 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?
How are you thinking that bit gets set/cleared?  The same store instruction is used whether or not the compiler will be able to put in a hint.

I was thinking of something along the lines of the Alpha 21264's store-wait predictor: every N cycles (with N probably somewhere in the range of [2^10, 2^14]), clear the valid bits.

Do you know of a reference for how the store-wait predictor works?  I can't find any reference to it in the 21264 hardware reference manual, though I found a reference (without description) in a paper.

Maybe you're thinking of tracking whether it's better to go by the EEW of the whole register load hint or the EEW of the most recent whole register store.  If the one that's being used is wrong often enough, try the other one.




The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

I'm hoping such a predictor is not needed for the reasons you say.  But I want to know that there is an "out" if it comes to that.

Gotcha.  And while I'm not 100% convinced of the efficacy of my particular proposal, I think it's sufficiently on the right track that clever microarchitects can devise a solution along those lines.

My immediate idea of a solution is that we have 3 bits of size hints for whole loads.  Only one value of the same field is used for whole register stores.  Let's use two values instead for the whole register store.  One hints that the following whole register load hint will be correct.  The other whole register store opcode hints that the EEW being stored is more likely to be correct for the following whole register load.  Architecturally, they do the same thing, of course.

      Bill


Re: Vector Byte Arrangement in Wide Implementations

Andrew Waterman
 



On Thu, Nov 5, 2020 at 5:31 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 4:36 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?
How are you thinking that bit gets set/cleared?  The same store instruction is used whether or not the compiler will be able to put in a hint.

I was thinking of something along the lines of the Alpha 21264's store-wait predictor: every N cycles (with N probably somewhere in the range of [2^10, 2^14]), clear the valid bits.



The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

I'm hoping such a predictor is not needed for the reasons you say.  But I want to know that there is an "out" if it comes to that.

Gotcha.  And while I'm not 100% convinced of the efficacy of my particular proposal, I think it's sufficiently on the right track that clever microarchitects can devise a solution along those lines.

      Bill


Re: Vector Byte Arrangement in Wide Implementations

Bill Huffman
 


On 11/5/20 5:31 PM, Bill Huffman wrote:


On 11/5/20 4:36 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?
How are you thinking that bit gets set/cleared?  The same store instruction is used whether or not the compiler will be able to put in a hint.

The possibility of a second store (for which there's easy encoding space) would allow the bit to be set or cleared depending on which store instruction was used.  And that was what I was thinking might work with the suggestion above for an additional store instruction but no additional load instructions.

      Bill


The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

I'm hoping such a predictor is not needed for the reasons you say.  But I want to know that there is an "out" if it comes to that.

      Bill


Re: Vector Byte Arrangement in Wide Implementations

Bill Huffman
 


On 11/5/20 4:36 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?
How are you thinking that bit gets set/cleared?  The same store instruction is used whether or not the compiler will be able to put in a hint.

The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

I'm hoping such a predictor is not needed for the reasons you say.  But I want to know that there is an "out" if it comes to that.

      Bill


Re: Vector Byte Arrangement in Wide Implementations

Andrew Waterman
 



On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?

The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

      Bill

      Bill



Re: Vector Byte Arrangement in Wide Implementations

Bill Huffman
 


On 11/5/20 3:35 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

      Bill

      Bill



Re: Vector Byte Arrangement in Wide Implementations

Andrew Waterman
 



On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

      Bill



Vector Byte Arrangement in Wide Implementations

Bill Huffman
 

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

      Bill



Re: [RISC-V] [tech] [RISC-V] [tech-*] STRATEGIC FEATURE COEXISTANCE was:([tech-fast-int] usefulness of PUSHINT/POPINT from [tech-code-size])

Guy Lemieux
 

Thanks Tim, I think that sums it up nicely.

I just wanted to put a pointer out to the original post that I made on isa-dev regarding opcode sharing / management:

It was very much motivated by the need to share the custom opcode blocks. Now it seems the official extensions will run out of opcode space too, so either we go 64b or we learn to share :-)

There may be some useful tidbits in the dialog that followed from that point forward.

Incidentally, for things like the vector extension, I think 64b encodings are inevitable. The opcode space is needed, particularly to allow orthogonality in all of the different operand modes. Plus, there is little harm — the majority of code is still scalar, and every 64b vector instruction replaces a handful of scalar operations so it is actually more compact than the scalar ISA already.

Ciao,
Guy



On Mon, Nov 2, 2020 at 5:58 PM Tim Vogt <tim.vogt@...> wrote:
















This is definitely a subject ripe for broader discussion.  In the FPGA special interest group we’ve been working on a framework for combining separately authored custom function units for the last year or so, and one of the key debate topics

has been how to manage the instruction encoding space.  As I read through this email thread, I can safely say that we’ve discussed all of the same topics and all of the same approaches listed here at one point or another.  For our particular scope, we’ve converged

on a bank-switching style implementation using a CSR, but that’s a very tactical approach for our specific goals.



 



Since this a common problem that keeps coming up in multiple contexts, it’s worth exploring whether we can come up with a more strategic framework and get consensus within the community, to avoid fragmentation and constant reinvention of

the wheel.



 



--Tim Vogt



 



--



Tim Vogt



Distinguished Engineer



Lattice Semiconductor



Office: 503-268-8068



 



 



 







From: tech@... <tech@...>

On Behalf Of
David Horner via lists.riscv.org


Sent: Monday, October 26, 2020 9:16 AM


To: Allen Baum <allen.baum@...>


Cc: Robert Chyla <Robert.Chyla@...>; tech-code-size@...; Greg Favor <gfavor@...>; Tariq Kurd <tariq.kurd@...>; Bill Huffman <huffman@...>; tech-fast-int@...; jeremy.bennett@...; tech@...;

tech-vector-ext@...; tech-p-ext@...; tech-bitmanip@.... <tech-bitmanip@...>


Subject: Re: [RISC-V] [tech] [RISC-V] [tech-*] STRATEGIC FEATURE COEXISTANCE was:([tech-fast-int] usefulness of PUSHINT/POPINT from [tech-code-size])







 





On 2020-10-26 12:48 a.m., Allen Baum wrote:











Are we talking about something that is effectively bank switching the opcodes here?









That is one approach. It is a consideration that has recently been mentioned wrt misa.













Something like that was proposed very early on, using a CSR (like MISA maybe - the details are lost to me) to enable and disable them.









I remember  Luke Kenneth Casson Leighton <lkcl@...> was in on the discussions.



A variety of csr and related approaches were considered.









The specific issue that brought it up is if someone developed a custom extension, did a lot of work, and then some other extension came along that stepped on those opcodes - and the implementation wanted to use both of them.







The author thought it was pretty obvious this kind of thing was going to happen. I don't think that exact scenario will, but running out of standard 32b opcodes with ratified extensions might.









exactly.



Also in lkcl's case the "vectorization" extension of all opcodes is [was proposed] of this nature









We're already starting to look at the long tail - extensions that are specialized to specific workloads, but highly advantageous to them.







I'm guessing we will get to the point that these extensions will not have to coexist inside a single app, though - so a bank switching approach (non-user mode at the least, perhaps not within an app at all) could potentially work, but it

sounds ugly to make the tools understand the configuration.









agreed. thus the uni-op-code approach wich can co-exist with any of these stragegies but provides a framework to mange them. (just as ascii and EBCIDIC extensions are comparably managed).













 







 







On Sat, Oct 24, 2020 at 8:23 AM ds2horner <ds2horner@...> wrote:









These are all important considerations.



However, what they have in common when considering Allen's question:







This discussion is bringing up an issue that needs wider discussion about extensions in general.





 







is that they are all tactical considerations are in the context of our current framework of instruction space allocation. What we will find is that these trade-off considerations will reinforce the dilemma

that Allen raises. How do we manage these conflicting "necessities/requirements" of different target environments. 







 







I have hinted at it already, we need not only tactical analysis of feature tradeoff in different domains but a strategic approach to support them.







 







The concern is nothing new. It has been raised, if only obliquely, many times prior on the [google]



groups.riscv.org
(-dev, -sw especially) and 

lists.riscv.org
TG threads.







The vector group, especially,  has grappled with it in the context of current V encoding being a subset of a [hypothetical] 64 bit encoding.







 







Specific proposals have been mentioned, but there was then no political will or perhaps more fairly, no common perception that there was a compelling reason to work systematically to address it. The [then]

common thinking was that 48 and 64 bit instruction spaces will be used as 32 and 16 bit are exhausted, and everyone will be happy. Well, that naive hope has not materialized and many are envisioning clashes that will hurt RISCV progress, either fragmentation

or stagnation, as tactical approaches and considerations are implement or debated.







 







Previously two major strategic approaches were hinted at, even if they were not outright proposed.







 







Hardware Support - this has been explicitly proposed in many flavours: and is currently in the minds of many.







     The idea is a mode shift analogous to







        arm's transition to thumb and back and







        intel's myriad of operating modes: real, protected, virtual, long and their disparate instantiations.







     I agree that implementations should have considerable freedom on how to provide hardware select-able functionality.







     However, a proposed framework to support that should be provided by



RISV.org
.







     Recent discussion and document tweaks about misa (Machine ISA register) suggest that this mechanism,







          though valuable, is inadequate as robust support for the explosion of features.







     An expanded framework will be necessary, perhaps along the lines of the two level performance counters definitions.







     The conflict with overlapping mappings of groups of instructions to the same encoding space is not easily addressed by this mechanism.







 







which leads us to







 







Software Support:







 







The Generalized Proposal:







All future extensions are not mapped to a fixed exclusive universal encoding,







but rather to appropriately sized [based initially off 32 isize] minor [22-bit], major[25-bit] or quadrant [30-bit] encoding,







that is allocated to the appropriate instruction encoding at link/load time to match the hardware [or hardware dynamic configuration, as above].







This handles the green field encodings.







Each feature could have a default minor/major/quadrant encoding designation.







 







Brown field can also managed, simply if the related co-encoded feature is present, with more complexity, and perhaps extensive opcode mapping if blended into other feature's encodings.







 







An implementation method would be to have a fixed exclusive universal prefix for each feature.







Each instruction would then be emitted by the compiler as a [prefix]:[instruction with default encoding] pair.







If the initial prefixes are also nops [most of which are currently designated as hints],







then the code would be executable on machines that use the default mapping







without any link/load intervention [at lower performance granted].







 







This approach is backward compatible for the other established extensions:







most notably F which consumes 7 major opcodes spaces [and *only* 5 with Zfinx (Zifloat?)] and







then AMO which also consumes the majority of a major opcode.







 







This strategic change has a number of immediate and significant benefits:







  1) custom reserved major op codes effectively become unreserved as "standard" extensions can be mapped there also.







       The custom reserved nature will then only be the designated default allocation, "standard extensions" will not default to them.







  2) as mentioned above, if the prefix is a nop then link/load support is not needed for direct execution support [only efficiency].







  3) the transition to higher bit encodings can be simplified. As easily as the compiler emmitting the designated  prefix for that feature that encodes for 64 bit instructions.







So, two assigned fixed exclusive encodings per feature may be useful, one a 64bit encoding and one a nop.







 





I do not intent to stifle any of the tactical discussions of co-usefulness of features and profile domains.


These are meaningful and useful considerations.





Rather, I hope that by having a framework for coexistance of features, that those discussions can proceed in a more guided way;




that discovers can be incorporated into a framework centric corpus of understanding of trade-offs and cooperative benefits of features/profiles.





 







 







On 2020-10-23 11:45 p.m., Robert Chyla wrote:









I agree with Greg's statements. For me 'code-size' is very important for small, deeply embedded/IoT-class small systems.







 







Work in other groups (bitmanip) will also benefit code size, but it is not primary focus I think as these will also improve code-speed.







 







Linux-like big processors usually have DDR RAM and code size is 'unlimited'.







It should not hurt as code-size advances will benefit such big systems, but we should not forget about 'cheap to implement'='logic size' factors.







 







IMO 'code-size' and 'code-speed' will be pulling same rug (ISA-space) into opposite directions. We must balance it properly - having a rug in one piece is IMO most important.







 







Regards,







/Robert







 







On 10/23/2020 5:11 PM, Greg Favor wrote:











It seems like a TG, probably through the statement of its charter, should clearly define what types or classes of systems it is focused on optimizing for (if there is an intended focus) and what types or classes of systems it does not expect

to be appropriate for.   More concretely, it seems like there are a few TG's developing extensions oriented towards embedded real time systems and/or low-cost embedded systems.  These are extensions that would probably not be implemented in full-blown Linux-class

systems.  Those extensions don't need to worry about being acceptable to such system designs, and can optimize for the requirements and constraints of their target class(es) of systems.





 







Unless I'm mistaken, this TG falls in that category.  And as long as the charter captures this, then the extension it produces can be properly evaluated against its goals and target system applications (and not be judged wrt other classes

of systems).  And key trade-off considerations - like certain types of implementation approaches being acceptable or unacceptable for the target system applications - should probably be agreed upon early on.







 







Greg







 







On Fri, Oct 23, 2020 at 4:34 PM Allen Baum <allen.baum@...> wrote:











This discussion is bringing up an issue that needs wider discussion about extensions in general.







Risc-V is intended to be an architecture that supports an extremely wide range of implementations, 







ranging from very low gate count microcontrollers, to high end superscalar out-of-order processors.







How do we evaluate an extension that only makes sense at one end or the other?







 







I don't expect a vector, or even hypervisor extensions in a low gate count system.







There are other extensions that are primarily aimed at specific applications areas as well.







 







A micro sequenced (e.g. push/pop[int]) op might be fairly trivial to implement in a low gate count system




(e.g. without VM, but with PMPs) and have significant savings in code size, power, and increased performance.







They may have none of those, or less significant, advantages in a high end implementation --







and/or might be very difficult or costly to implement in them, (e.g. for TLB miss, interrupt, & exception handling )







 (I am not claiming that these specific ops do, but just pretend there is one like that)







 







Should we avoid defining instructions and extensions like that? 







Or just allow that some extensions just don't make sense for some class of implementation?







Are there guidelines we can put in place to help make those decisions? 







This same (not precisely the same) kind of issue is rearing its head in other places, e.g. range based CMOs.







 















 





--


Regards,


Robert Chyla,

Lead Engineer, Debug and Trace Probe Software


IAR Systems




1211 Flynn Rd, Unit 104


Camarillo, CA
  93012 USA


Office: +1
805 383 3682 x104


E-mail: Robert.Chyla@... Website:



www.iar.com


































Re: Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

lidawei14@...
 

Hi all,

If I use EDIV to compute SpMV y = A * x as size r * c blocks, I might have to load size r of y and size c of x, these are shorter than VL = r * c, is there an efficient way to do this by current support?

If I would like to use a mask to compress, for VL = 16, I can store 16-bit value to memory and load it to GPR, then how can I transform it to a vector mask?

Thank you,
Dawei

161 - 180 of 671