Vector Memory Ordering


Bill Huffman
 

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill



Bill Huffman
 

Guy Lemieux commented:

I think 90%+ of implementations will choose to do ordered loads and stores even though unordered is permitted.

This means programmers will expect them to be ordered, and such software will not work properly on the remaining implementations. This compatibility problem is a concern.

I think the best way to combat this is to have 2 sets of instructions: ordered and unordered. The unordered implementation can simply do the ordered thing in simple implementations.

Ordered stores to a FIFO is a paradigm I was hoping to use for inter-processor communication.

I think compiler considerations are also important, but I don’t know the implications here.

Guy

Maybe what's below could be improved by saying that if the base address (in src1) was non-idempotent or an "ordered channel," the entire instruction would run in order.   If not, it would not.  We could allow stride of zero but not other overlapping strides for stores.  Having a later access come into a non-idempotent or ordered region would raise an exception.  That would provide for loads from and stores to a FIFO to work.  But it wouldn't provide for an instruction to "fall into" such a memory segment part way through.

      Bill

On 9/4/20 10:18 AM, Bill Huffman wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill



swallach
 

i have not been following this thread in lots of detail


could someone please explain why we need to differentiate  between ordered and unordered load/stores.

in the 6 or vector systems i have been involved with,  vector references bypassed the cache,  main memory was highly interleaved.

compilers  could not care less.  one of the major performance optimizations,  was to eliminate power of 2 strides. (generally manually done)

at convey we even had a option to  deploy a prime number of interleave memory system. (the bsp first did this)

i  saw one application,  a major cfd  code, that sorted references to a stencil based reference pattern.  this was done to optimize performance for cache based vector systems.

with the presence of HBM memory systems, and some cleaver memory controller design (that could be done with vector references information),  i am pretty sure ordered and unordered  loads/stores will have the same implementation. to be more specific,  a memory design for GUPS,  would have this  type of implementation


i    look forward to a better understanding










Guy Lemieux commented:


I think 90%+ of implementations will choose to do ordered loads and stores even though unordered is permitted.

This means programmers will expect them to be ordered, and such software will not work properly on the remaining implementations. This compatibility problem is a concern.

I think the best way to combat this is to have 2 sets of instructions: ordered and unordered. The unordered implementation can simply do the ordered thing in simple implementations.

Ordered stores to a FIFO is a paradigm I was hoping to use for inter-processor communication.

I think compiler considerations are also important, but I don’t know the implications here.

Guy

Maybe what's below could be improved by saying that if the base address (in src1) was non-idempotent or an "ordered channel," the entire instruction would run in order.   If not, it would not.  We could allow stride of zero but not other overlapping strides for stores.  Having a later access come into a non-idempotent or ordered region would raise an exception.  That would provide for loads from and stores to a FIFO to work.  But it wouldn't provide for an instruction to "fall into" such a memory segment part way through.

      Bill

On 9/4/20 10:18 AM, Bill Huffman wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.  
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill




WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


andrew@...
 



On Fri, Sep 4, 2020 at 10:19 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
This is not terribly straightforward.

I'll assume that the trap would only be a function of the stride and element/segment width, rather than checking that two active elements actually overlap at runtime.  Even so, very large strides can foul this up.  Consider (in RV32) vssseg8e32 with vl=5 and stride=0xC0000004.  Elements 0 and 4 overlap!  (This phenomenon can also happen for non-segment strided stores by using a misaligned stride, which (for good reason) is a valid thing to do.)

It's not an inexpensive computation in general.  It would be better, I think, to either make them in-order or to permit arbitrary reordering than to trap.

  1. All other vector loads and stores do their memory accesses in arbitrary order.
  2. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  3. All loads with vector sources must use a different register for the destination than any source (including mask).
Why is this change necessary?  Currently, depending on EEW, non-segment indexed loads are allowed to overlap the index register.  Are you suggesting that, not only can indexed loads access memory in arbitrary order, they can also write back imprecisely (past vstart), destroying the index register?

(Mask register overlap is already forbidden by the usual rules for different-EEW overlap.)
  1. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill



David Horner
 


On 2020-09-04 1:18 p.m., Bill Huffman wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order

I tentatively agree *if* "truly in order" means "as if writes were the equivalent scalar writes of the register elements processed in order from 0 to vl" .

However, RVWMO should be assumed and no further strengthening of "order" should be stipulated.

None overlapping element writes are allowed in RVWMO global memory order to freely pass each other.


I would however prefer to weaken the constraint

as #528 recommends (in addition to a mnemonic change):

...  VSUXEI  ... stipulate that the net result of the operation allows multiple outcomes visible to the local hart; essentially depending upon which element is last written to overlapping memory locations.

...  [VSXEI] to stipulate that the data written to memory will be as if the active elements were first copied to an XLEN length buffer ... processing the elements in the order from 0 to vl, and then only the affected bytes written to memory in whatever order RVVWMO allows.

...

The ideas are that

  1. rather than suggesting VSUXEI is the exception to the rule, it indicates that the ordered VSOXEI is a more constrained alternative to the default processor model that allows concurrent execution wherever possible and
  2. VSOXEI makes no guarantee on what is globally visible than what the RVVWMO model allows
  3. VSUXEI is still constrained by the RVVWMO rules.



David Horner
 


On 2020-09-04 5:53 p.m., Andrew Waterman wrote:


On Fri, Sep 4, 2020 at 10:19 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
Just a quick note here, if Ztso is active then "truly in order" has global ordering too.
  1. Strided stores that overlap (including segmented ones) will trap as illegal
This is not terribly straightforward.

I'll assume that the trap would only be a function of the stride and element/segment width, rather than checking that two active elements actually overlap at runtime.
Much of the discussion was driven by consideration for the difficulty in runtime overlap detection and challenges to ensure "correct" function.
Even so, very large strides can foul this up.  Consider (in RV32) vssseg8e32 with vl=5 and stride=0xC0000004.  Elements 0 and 4 overlap! 

Thinking was

1) such a restriction would be addressed by application fall back to vl=1 iterations

2) such a restriction could be relaxed later and thus defer addressing all permutations such as this. 

(This phenomenon can also happen for non-segment strided stores by using a misaligned stride,
Thinking was misaligned stride would also be restricted with the same fallback.
which (for good reason) is a valid thing to do.)

I believe misaligned stride could be very valuable for load, less valuable for store.

Would you please elaborate on the good reason(s) for misaligned stride?

But it Thinking was misaligned stride would also be restricted with the same fallback.

such a restriction could be relaxed later and thus defer addressing all permutations such as this.  


It's not an inexpensive computation in general.  It would be better, I think, to either make them in-order or to permit arbitrary reordering than to trap.

I am in favour of arbitrary reordering, but I'd prefer to refer to such as parallel execution; memory ordering is a distinct issue and often conflated with vector element processing sequences.


  1. All other vector loads and stores do their memory accesses in arbitrary order.
  2. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  3. All loads with vector sources must use a different register for the destination than any source (including mask).
Why is this change necessary?  Currently, depending on EEW, non-segment indexed loads are allowed to overlap the index register.  Are you suggesting that, not only can indexed loads access memory in arbitrary order, they can also write back imprecisely (past vstart),
yes. It was noted that this is already allowed by some instructions and explicitly by the relaxed (point 4) definition of precise vector traps.
destroying the index register?
potentially, so avoid the situation as in other instructions where vrestart is desired but costly, difficult or impossible to implement without the restriction.

(Mask register overlap is already forbidden by the usual rules for different-EEW overlap.)


David Horner
 


On 2020-09-04 5:00 p.m., swallach wrote:

i have not been following this thread in lots of detail


could someone please explain why we need to differentiate  between ordered and unordered load/stores.

These issues discuss the need for order and reasons to have variants:

https://github.com/riscv/riscv-v-spec/issues/501 Unordered Indexed Load

https://github.com/riscv/riscv-v-spec/issues/502  Unordered Index Operation Memory Ordering Description

https://github.com/riscv/riscv-v-spec/issues/504  Further Ordering Relaxation of Unordered Indexed Stores/Loads

https://github.com/riscv/riscv-v-spec/issues/528  rename VSXEI<EEW> to VSOXEI<EEW>


in the 6 or vector systems i have been involved with,  vector references bypassed the cache,  main memory was highly interleaved.

And this part has not been as extensively discussed. RISCV has a mechanism to characterize memory regions, including channels with "special" features.

The idea is that RISCV vector should be all things to all needs, and failing that, it should address as many needs as practical.

One of those "needs" is memory coherence that agrees with the RVWMO model.

I am in favour of a vector enhanced RVWMO, "RVVWMO", if it proves to be beneficial. Specifically, We propose a weaker RVVWMO for software to validate, and advice that all future safe implementations only use RVWMO. This allows a restraining of the proposed RVVWMO if issues do arise.


compilers  could not care less.
One objective is autovectorization, and for this case compilers definitely care.
 one of the major performance optimizations,  was to eliminate power of 2 strides. (generally manually done)

at convey we even had a option to  deploy a prime number of interleave memory system. (the bsp first did this)

i  saw one application,  a major cfd  code, that sorted references to a stencil based reference pattern.  this was done to optimize performance for cache based vector systems.

with the presence of HBM memory systems, and some cleaver memory controller design (that could be done with vector references information),  i am pretty sure ordered and unordered  loads/stores will have the same implementation. to be more specific,  a memory design for GUPS,  would have this  type of implementation


i    look forward to a better understanding
Thank you for all your inputs that have help me get a better understanding.










Guy Lemieux commented:


I think 90%+ of implementations will choose to do ordered loads and stores even though unordered is permitted.

This means programmers will expect them to be ordered, and such software will not work properly on the remaining implementations. This compatibility problem is a concern.

I think the best way to combat this is to have 2 sets of instructions: ordered and unordered. The unordered implementation can simply do the ordered thing in simple implementations.

Ordered stores to a FIFO is a paradigm I was hoping to use for inter-processor communication.

I think compiler considerations are also important, but I don’t know the implications here.

Guy

Maybe what's below could be improved by saying that if the base address (in src1) was non-idempotent or an "ordered channel," the entire instruction would run in order.   If not, it would not.  We could allow stride of zero but not other overlapping strides for stores.  Having a later access come into a non-idempotent or ordered region would raise an exception.  That would provide for loads from and stores to a FIFO to work.  But it wouldn't provide for an instruction to "fall into" such a memory segment part way through.

      Bill

On 9/4/20 10:18 AM, Bill Huffman wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.  
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill




WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


mark
 

Bill, great lists.

can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a doc later if we can avoid it.

Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?

general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.

Mark



On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill



Bill Huffman
 


On 9/4/20 10:25 PM, David Horner wrote:
EXTERNAL MAIL


On 2020-09-04 1:18 p.m., Bill Huffman wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order

I tentatively agree *if* "truly in order" means "as if writes were the equivalent scalar writes of the register elements processed in order from 0 to vl" .

However, RVWMO should be assumed and no further strengthening of "order" should be stipulated.

None overlapping element writes are allowed in RVWMO global memory order to freely pass each other.

Yes, that's what I intended.

      Bill


I would however prefer to weaken the constraint

as #528 recommends (in addition to a mnemonic change):

...  VSUXEI  ... stipulate that the net result of the operation allows multiple outcomes visible to the local hart; essentially depending upon which element is last written to overlapping memory locations.

...  [VSXEI] to stipulate that the data written to memory will be as if the active elements were first copied to an XLEN length buffer ... processing the elements in the order from 0 to vl, and then only the affected bytes written to memory in whatever order RVVWMO allows.

...

The ideas are that

  1. rather than suggesting VSUXEI is the exception to the rule, it indicates that the ordered VSOXEI is a more constrained alternative to the default processor model that allows concurrent execution wherever possible and
  2. VSOXEI makes no guarantee on what is globally visible than what the RVVWMO model allows
  3. VSUXEI is still constrained by the RVVWMO rules.



Allen Baum
 

If the memory ordering (in this case specifically memory access ordering, I think - correct me if I'm wrong) doesn't affect the final processor state (which would not be the case for a vector reduce that started reduction as the values became available, instead of in some specific order), I don't know that compliance tests can tell the difference.

Where this does affect processor state is exception reporting: a golden model which accesses results in one order may not have the same exception state as an implementation that accesses in a different order. Ditto for watchpoints (are they considered architectural?). A possible way to fix this is to have the golden model access in a canonical order, and implementations delay exception reporting until all accesses from logically previous addresses are complete (and any logically previous exception override one that was reported earlier)

Debug compliance testing will be verry interesting....

Non-idempotent memory accesses will also affect final state, but I'd argue that is non-ISA and we won't be testing that.
Concurrent, overlapping accesses is another pain point.

On Tue, Sep 8, 2020 at 7:21 AM mark <markhimelstein@...> wrote:
Bill, great lists.

can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a doc later if we can avoid it.

Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?

general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.

Mark



On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill



Bill Huffman
 


On 9/8/20 1:47 PM, Allen Baum wrote:
EXTERNAL MAIL

If the memory ordering (in this case specifically memory access ordering, I think - correct me if I'm wrong) doesn't affect the final processor state (which would not be the case for a vector reduce that started reduction as the values became available, instead of in some specific order), I don't know that compliance tests can tell the difference.

Where this does affect processor state is exception reporting: a golden model which accesses results in one order may not have the same exception state as an implementation that accesses in a different order. Ditto for watchpoints (are they considered architectural?). A possible way to fix this is to have the golden model access in a canonical order, and implementations delay exception reporting until all accesses from logically previous addresses are complete (and any logically previous exception override one that was reported earlier)

From most points of view it's a fine method (delay exception reporting until all accesses from logically previous addresses are complete and override).  But it seems to require either that stores are done in order anyway, or that stores of vector elements can be done more than once (only until the instruction completes).  I think I'm OK with vector elements being stored more than once during the instruction, but I don't know that it will be accepted easily.  The other choice seems to be to do many stores in order anyway.  For reasonably small strides, it may be possible to do exception checks all at once, but for large strides and scatters, I don't expect it is.

      Bill


Debug compliance testing will be verry interesting....

Non-idempotent memory accesses will also affect final state, but I'd argue that is non-ISA and we won't be testing that.
Concurrent, overlapping accesses is another pain point.

On Tue, Sep 8, 2020 at 7:21 AM mark <markhimelstein@...> wrote:
Bill, great lists.

can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a doc later if we can avoid it.

Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?

general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.

Mark



On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill



Allen Baum
 

You don't have to ensure previous stores (or loads) have completed - I misspoke. You have to ensure that they don't trap, which might be a much easier proposition. Implementations may typically do those checks in order anyway, and the checks only need to be done once/page (or PMA or PMP region, unfortunately).

On Tue, Sep 8, 2020 at 2:08 PM Bill Huffman <huffman@...> wrote:


On 9/8/20 1:47 PM, Allen Baum wrote:
EXTERNAL MAIL

If the memory ordering (in this case specifically memory access ordering, I think - correct me if I'm wrong) doesn't affect the final processor state (which would not be the case for a vector reduce that started reduction as the values became available, instead of in some specific order), I don't know that compliance tests can tell the difference.

Where this does affect processor state is exception reporting: a golden model which accesses results in one order may not have the same exception state as an implementation that accesses in a different order. Ditto for watchpoints (are they considered architectural?). A possible way to fix this is to have the golden model access in a canonical order, and implementations delay exception reporting until all accesses from logically previous addresses are complete (and any logically previous exception override one that was reported earlier)

From most points of view it's a fine method (delay exception reporting until all accesses from logically previous addresses are complete and override).  But it seems to require either that stores are done in order anyway, or that stores of vector elements can be done more than once (only until the instruction completes).  I think I'm OK with vector elements being stored more than once during the instruction, but I don't know that it will be accepted easily.  The other choice seems to be to do many stores in order anyway.  For reasonably small strides, it may be possible to do exception checks all at once, but for large strides and scatters, I don't expect it is.

      Bill


Debug compliance testing will be verry interesting....

Non-idempotent memory accesses will also affect final state, but I'd argue that is non-ISA and we won't be testing that.
Concurrent, overlapping accesses is another pain point.

On Tue, Sep 8, 2020 at 7:21 AM mark <markhimelstein@...> wrote:
Bill, great lists.

can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a doc later if we can avoid it.

Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?

general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.

Mark



On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill



Bill Huffman
 

The problem is not that you have to ensure previous (in the scalar loop order sense) entries don't trap before you take the exception on the later one.  The problem is that you have to not complete a later store before checking for earlier traps.  That is unfortunate for all vector stores but it is a disaster for scatters.  It's impossible to know scatters won't trap unless they're run through the MMU/PMP.  So you have to run all lower numbered ones through the MMU/PMP before you allow a higher numbered one to complete.  Which in an in-order implementation is doing them in order and in an out-of-order implementation is a significant constraint.

The problem, as I see it, is that reporting an exception on a higher numbered element when a lower numbered element has not yet been stored requires more state than vstart.  I think we either have to add exception state (not sure how) or allow stores to be done more than once.  This seems critical for scatters - at least unordered ones.  For strides larger than very small numbers, it is restrictive as well and we need to consider it, I think.  In that case, do we extend the same to unit/small stride as well?

      Bill


On 9/8/20 4:41 PM, Allen Baum wrote:

EXTERNAL MAIL

You don't have to ensure previous stores (or loads) have completed - I misspoke. You have to ensure that they don't trap, which might be a much easier proposition. Implementations may typically do those checks in order anyway, and the checks only need to be done once/page (or PMA or PMP region, unfortunately).

On Tue, Sep 8, 2020 at 2:08 PM Bill Huffman <huffman@...> wrote:


On 9/8/20 1:47 PM, Allen Baum wrote:
EXTERNAL MAIL

If the memory ordering (in this case specifically memory access ordering, I think - correct me if I'm wrong) doesn't affect the final processor state (which would not be the case for a vector reduce that started reduction as the values became available, instead of in some specific order), I don't know that compliance tests can tell the difference.

Where this does affect processor state is exception reporting: a golden model which accesses results in one order may not have the same exception state as an implementation that accesses in a different order. Ditto for watchpoints (are they considered architectural?). A possible way to fix this is to have the golden model access in a canonical order, and implementations delay exception reporting until all accesses from logically previous addresses are complete (and any logically previous exception override one that was reported earlier)

From most points of view it's a fine method (delay exception reporting until all accesses from logically previous addresses are complete and override).  But it seems to require either that stores are done in order anyway, or that stores of vector elements can be done more than once (only until the instruction completes).  I think I'm OK with vector elements being stored more than once during the instruction, but I don't know that it will be accepted easily.  The other choice seems to be to do many stores in order anyway.  For reasonably small strides, it may be possible to do exception checks all at once, but for large strides and scatters, I don't expect it is.

      Bill


Debug compliance testing will be verry interesting....

Non-idempotent memory accesses will also affect final state, but I'd argue that is non-ISA and we won't be testing that.
Concurrent, overlapping accesses is another pain point.

On Tue, Sep 8, 2020 at 7:21 AM mark <markhimelstein@...> wrote:
Bill, great lists.

can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a doc later if we can avoid it.

Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?

general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.

Mark



On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill



Guy Lemieux <guy.lemieux@...>
 

In less concerned with making the FIFO example work, and more concerned with ensuring programs are fully portable across all implementations.

Programmers don’t tend to read ISA specs and details. They try it, and infer behaviour from the implementations they have available to them. Hence, I believe there is a risk that many programmers will assume ordered stores in particular, but also ordered loads.

The ISAa cost of providing 2 modes, ordered and unordered, is either 1 bit in the ISA encoding or a mode bit along the lines of setvl. The implementation cost is nearly zero. The benefit is fully portable programs without surprises.

Guy



On Fri, Sep 4, 2020 at 12:54 PM Bill Huffman <huffman@...> wrote:










Guy Lemieux commented:









I think 90%+ of implementations will choose to do ordered loads and stores even though unordered is permitted.









This means programmers will expect them to be ordered, and such software will not work properly on the remaining implementations. This compatibility problem is a concern.







I think the best way to combat this is to have 2 sets of instructions: ordered and unordered. The unordered implementation can simply do the ordered thing in simple implementations.







Ordered stores to a FIFO is a paradigm I was hoping to use for inter-processor communication.







I think compiler considerations are also important, but I don’t know the implications here.







Guy








Maybe what's below could be improved by saying that if the base address (in src1) was non-idempotent or an "ordered channel," the entire instruction would run in order.   If not, it would not.  We could allow stride of zero

but not other overlapping strides for stores.  Having a later access come into a non-idempotent or ordered region would raise an exception.  That would provide for loads from and stores to a FIFO to work.  But it wouldn't provide for an instruction to "fall

into" such a memory segment part way through.



      Bill



On 9/4/20 10:18 AM, Bill Huffman wrote:






I think from this morning, we are considering:





  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??


A few of the consequences of this are:





  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values


  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 




  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.


We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount

of auto-vectorization that can be done?





Exception reporting still has issues:





  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.


  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?


I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than

for unit stride or even strided accesses.




      Bill




















Ken Dockser
 

Reviving this old thread with a question and a suggestion:

Question: What is the use case for supporting non-idempotent memory in a RISC-V Vector implementation as a part of a general-purpose rich-OS application processor? While I can certainly envision embedded applications that use non-idempotent memory, it seems unlikely that non-idempotent memory would be used when running arbitrary application code.

Suggestion: The current Vector specification's comments about supporting non-idempotent memory can easily mislead one into thinking that such support is required in all compliant implementations. We need an explicit  clarification in the specification along the lines of "Vector extension support for handling non-idempotent memory accesses is not required in implementations that prohibit or otherwise prevent (e.g., by trapping) such accesses." While my suggested sentence would likely benefit from some wordsmithing, I think that what I am trying to convey is essential in defining what is architecturally required.

Thanks,
Ken


Krste Asanovic
 

On Mon, 13 Dec 2021 12:09:54 -0800, "Ken Dockser" <kad@...> said:
| Reviving this old thread with a question and a suggestion:
| Question: What is the use case for supporting non-idempotent memory in a RISC-V Vector implementation as a part of a general-purpose
| rich-OS application processor? While I can certainly envision embedded applications that use non-idempotent memory, it seems unlikely
| that non-idempotent memory would be used when running arbitrary application code.

To reduce overhead, some embedded Linux systems allow user-mode code
to access devices directly, e.g., for dpdk networking. If anything,
there is a trend to support more user-level access to
devices/accelerators to reduce overhead, and to provide more isolation
between tasks (as opposed to shared device driver in kernel). Such
devices might have non-idempotent memory regions.

| Suggestion: The current Vector specification's comments about supporting non-idempotent memory can easily mislead one into thinking
| that such support is required in all compliant implementations. We need an explicit clarification in the specification along the
| lines of "Vector extension support for handling non-idempotent memory accesses is not required in implementations that prohibit or
| otherwise prevent (e.g., by trapping) such accesses." While my suggested sentence would likely benefit from some wordsmithing, I
| think that what I am trying to convey is essential in defining what is architecturally required.

We could add some non-normative text as a note to implementers, but
this allowance just follows from the general RISC-V concept that
certain memory address ranges only support certain operations (PMAs).

Calling this out as a special case in the spec could then require
repeating the statement all throughout all memory instructions for
consistency, to avoid questions about why some instructions have
optional support for some memory types versus others. We try to
factor out these concepts in the spec.

Krste


| Thanks,
| Ken
|