Re: Vector Memory Ordering

Bill Huffman

The problem is not that you have to ensure previous (in the scalar loop order sense) entries don't trap before you take the exception on the later one.  The problem is that you have to not complete a later store before checking for earlier traps.  That is unfortunate for all vector stores but it is a disaster for scatters.  It's impossible to know scatters won't trap unless they're run through the MMU/PMP.  So you have to run all lower numbered ones through the MMU/PMP before you allow a higher numbered one to complete.  Which in an in-order implementation is doing them in order and in an out-of-order implementation is a significant constraint.

The problem, as I see it, is that reporting an exception on a higher numbered element when a lower numbered element has not yet been stored requires more state than vstart.  I think we either have to add exception state (not sure how) or allow stores to be done more than once.  This seems critical for scatters - at least unordered ones.  For strides larger than very small numbers, it is restrictive as well and we need to consider it, I think.  In that case, do we extend the same to unit/small stride as well?


On 9/8/20 4:41 PM, Allen Baum wrote:


You don't have to ensure previous stores (or loads) have completed - I misspoke. You have to ensure that they don't trap, which might be a much easier proposition. Implementations may typically do those checks in order anyway, and the checks only need to be done once/page (or PMA or PMP region, unfortunately).

On Tue, Sep 8, 2020 at 2:08 PM Bill Huffman <huffman@...> wrote:

On 9/8/20 1:47 PM, Allen Baum wrote:

If the memory ordering (in this case specifically memory access ordering, I think - correct me if I'm wrong) doesn't affect the final processor state (which would not be the case for a vector reduce that started reduction as the values became available, instead of in some specific order), I don't know that compliance tests can tell the difference.

Where this does affect processor state is exception reporting: a golden model which accesses results in one order may not have the same exception state as an implementation that accesses in a different order. Ditto for watchpoints (are they considered architectural?). A possible way to fix this is to have the golden model access in a canonical order, and implementations delay exception reporting until all accesses from logically previous addresses are complete (and any logically previous exception override one that was reported earlier)

From most points of view it's a fine method (delay exception reporting until all accesses from logically previous addresses are complete and override).  But it seems to require either that stores are done in order anyway, or that stores of vector elements can be done more than once (only until the instruction completes).  I think I'm OK with vector elements being stored more than once during the instruction, but I don't know that it will be accepted easily.  The other choice seems to be to do many stores in order anyway.  For reasonably small strides, it may be possible to do exception checks all at once, but for large strides and scatters, I don't expect it is.


Debug compliance testing will be verry interesting....

Non-idempotent memory accesses will also affect final state, but I'd argue that is non-ISA and we won't be testing that.
Concurrent, overlapping accesses is another pain point.

On Tue, Sep 8, 2020 at 7:21 AM mark <markhimelstein@...> wrote:
Bill, great lists.

can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a doc later if we can avoid it.

Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?

general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.


On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman <huffman@...> wrote:

I think from this morning, we are considering:

  1. Ordered scatters are done truly in order
  2. Strided stores that overlap (including segmented ones) will trap as illegal
  3. All other vector loads and stores do their memory accesses in arbitrary order.
  4. A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
  5. All loads with vector sources must use a different register for the destination than any source (including mask).
  6. Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??

A few of the consequences of this are:

  • A gather with repeated elements can access the higher numbered elements first and lower ones later
  • A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
  • A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
  • When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
  • Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected. 
  • A stride-0 store to a FIFO will trap
  • A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
  • A non-idempotent memory location accessed by a vector load may be accessed multiple times.

We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time.  Even strides can't always be known at compile time.  Will this plan reduce the amount of auto-vectorization that can be done?

Exception reporting still has issues:

  • Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
  • For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
  • There may need to be a way to report a watchpoint on one element but restart on an earlier element
  • If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element?  Does that help anything if we do?

I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way.  If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.


Join to automatically receive all group messages.