Re: Vector Memory Ordering


Andrew Waterman
 



On Fri, Sep 4, 2020 at 10:19 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
This is not terribly straightforward.

I'll assume that the trap would only be a function of the stride and element/segment width, rather than checking that two active elements actually overlap at runtime.  Even so, very large strides can foul this up.  Consider (in RV32) vssseg8e32 with vl=5 and stride=0xC0000004.  Elements 0 and 4 overlap!  (This phenomenon can also happen for non-segment strided stores by using a misaligned stride, which (for good reason) is a valid thing to do.)

It's not an inexpensive computation in general.  It would be better, I think, to either make them in-order or to permit arbitrary reordering than to trap.

  1. All other vector loads and stores do their memory accesses in arbitrary order.
  2. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  3. All loads with vector sources must use a different register for the destination than any source (including mask).
Why is this change necessary?  Currently, depending on EEW, non-segment indexed loads are allowed to overlap the index register.  Are you suggesting that, not only can indexed loads access memory in arbitrary order, they can also write back imprecisely (past vstart), destroying the index register?

(Mask register overlap is already forbidden by the usual rules for different-EEW overlap.)
  1. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.

      Bill


Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.