Re: Vector Memory Ordering


David Horner
 


On 2020-09-04 5:53 p.m., Andrew Waterman wrote:


On Fri, Sep 4, 2020 at 10:19 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
Just a quick note here, if Ztso is active then "truly in order" has global ordering too.
  1. Strided stores that overlap (including segmented ones) will trap as illegal
This is not terribly straightforward.

I'll assume that the trap would only be a function of the stride and element/segment width, rather than checking that two active elements actually overlap at runtime.
Much of the discussion was driven by consideration for the difficulty in runtime overlap detection and challenges to ensure "correct" function.
Even so, very large strides can foul this up.  Consider (in RV32) vssseg8e32 with vl=5 and stride=0xC0000004.  Elements 0 and 4 overlap! 

Thinking was

1) such a restriction would be addressed by application fall back to vl=1 iterations

2) such a restriction could be relaxed later and thus defer addressing all permutations such as this. 

(This phenomenon can also happen for non-segment strided stores by using a misaligned stride,
Thinking was misaligned stride would also be restricted with the same fallback.
which (for good reason) is a valid thing to do.)

I believe misaligned stride could be very valuable for load, less valuable for store.

Would you please elaborate on the good reason(s) for misaligned stride?

But it Thinking was misaligned stride would also be restricted with the same fallback.

such a restriction could be relaxed later and thus defer addressing all permutations such as this.  


It's not an inexpensive computation in general.  It would be better, I think, to either make them in-order or to permit arbitrary reordering than to trap.

I am in favour of arbitrary reordering, but I'd prefer to refer to such as parallel execution; memory ordering is a distinct issue and often conflated with vector element processing sequences.


  1. All other vector loads and stores do their memory accesses in arbitrary order.
  2. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  3. All loads with vector sources must use a different register for the destination than any source (including mask).
Why is this change necessary?  Currently, depending on EEW, non-segment indexed loads are allowed to overlap the index register.  Are you suggesting that, not only can indexed loads access memory in arbitrary order, they can also write back imprecisely (past vstart),
yes. It was noted that this is already allowed by some instructions and explicitly by the relaxed (point 4) definition of precise vector traps.
destroying the index register?
potentially, so avoid the situation as in other instructions where vrestart is desired but costly, difficult or impossible to implement without the restriction.

(Mask register overlap is already forbidden by the usual rules for different-EEW overlap.)

Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.