Re: Vector Memory Ordering


David Horner
 


On 2020-09-04 5:00 p.m., swallach wrote:

i have not been following this thread in lots of detail


could someone please explain why we need to differentiate  between ordered and unordered load/stores.

These issues discuss the need for order and reasons to have variants:

https://github.com/riscv/riscv-v-spec/issues/501 Unordered Indexed Load

https://github.com/riscv/riscv-v-spec/issues/502  Unordered Index Operation Memory Ordering Description

https://github.com/riscv/riscv-v-spec/issues/504  Further Ordering Relaxation of Unordered Indexed Stores/Loads

https://github.com/riscv/riscv-v-spec/issues/528  rename VSXEI<EEW> to VSOXEI<EEW>


in the 6 or vector systems i have been involved with,  vector references bypassed the cache,  main memory was highly interleaved.

And this part has not been as extensively discussed. RISCV has a mechanism to characterize memory regions, including channels with "special" features.

The idea is that RISCV vector should be all things to all needs, and failing that, it should address as many needs as practical.

One of those "needs" is memory coherence that agrees with the RVWMO model.

I am in favour of a vector enhanced RVWMO, "RVVWMO", if it proves to be beneficial. Specifically, We propose a weaker RVVWMO for software to validate, and advice that all future safe implementations only use RVWMO. This allows a restraining of the proposed RVVWMO if issues do arise.


compilers  could not care less.
One objective is autovectorization, and for this case compilers definitely care.
 one of the major performance optimizations,  was to eliminate power of 2 strides. (generally manually done)

at convey we even had a option to  deploy a prime number of interleave memory system. (the bsp first did this)

i  saw one application,  a major cfd  code, that sorted references to a stencil based reference pattern.  this was done to optimize performance for cache based vector systems.

with the presence of HBM memory systems, and some cleaver memory controller design (that could be done with vector references information),  i am pretty sure ordered and unordered  loads/stores will have the same implementation. to be more specific,  a memory design for GUPS,  would have this  type of implementation


i    look forward to a better understanding
Thank you for all your inputs that have help me get a better understanding.










Guy Lemieux commented:


I think 90%+ of implementations will choose to do ordered loads and stores even though unordered is permitted.

This means programmers will expect them to be ordered, and such software will not work properly on the remaining implementations. This compatibility problem is a concern.

I think the best way to combat this is to have 2 sets of instructions: ordered and unordered. The unordered implementation can simply do the ordered thing in simple implementations.

Ordered stores to a FIFO is a paradigm I was hoping to use for inter-processor communication.

I think compiler considerations are also important, but I don’t know the implications here.

Guy

Maybe what's below could be improved by saying that if the base address (in src1) was non-idempotent or an "ordered channel," the entire instruction would run in order.   If not, it would not.  We could allow stride of zero but not other overlapping strides for stores.  Having a later access come into a non-idempotent or ordered region would raise an exception.  That would provide for loads from and stores to a FIFO to work.  But it wouldn't provide for an instruction to "fall into" such a memory segment part way through.

      Bill

On 9/4/20 10:18 AM, Bill Huffman wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.  
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill




WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer

Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.