mem model - RISC-V Vector Extension TG Minutes 2020/6/26

David Horner

I make clarifications on meeting minutes.
I propose we present
  1) a relaxed RVV memory/process model, more relaxed than we believe current implementations require for optimal performance.
   Application and privileged code will use this as the framework to code within.
  2) stipulate implementation constraints that are more restrictive than the memory/process model allows.
        This is already our current practice for such concerns. e.g. vd cannot overlap vs1/vs2 in many instructions.

Combined this allows tightening of the memory/process model to reduce software complexity (in expected fringe cases) and
    relaxation of implementation constraints as technology advances enable.

This is also the basic idea behind the #364 proposal.

On 2020-06-26 11:05 p.m., Krste Asanovic wrote:
Date: 2020/6/26
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~15
Current issues on github:


## Make vector alignment/misalignment independent of scalar

RISC-v implementations may or may not support misaligned accesses on
scalar load/stores. The consensus was to separate specification of
support on vector from scalar (i.e., they could differ in their
support for misaligned accesses). This led to a broader discussion
regarding PMAs, which should separate memory attributes for scalar
versus vector accesses.
A) Some discussion also centered around what alignment is.
## Unordered vector indexed stores report address exceptions in
element order

After much discussion, the group was leaning to requiring that vector
indexed stores have to report adddress exceptions in element order,
but the spec of unordered stores was to be reviewed both for this
issue, and for possible impact on vector memory model.
B) The related, reporting of load exceptions in order, also needs to be addressed.

C) A related issue:  Emulation of RVV instructions that might otherwise not perform any memory operations
  but may need to spill registers to efficiently use RVV instructions in the emulation.
     I mention this below but further analysis and actioning is needed.

Both A and B are sub issues of the RVV memory model that explicitly RVWMO does not address.
Instructions in the RV128 base instruction set and in future ISA extensions such as V (vector)
and P (SIMD) may give rise to multiple memory operations. However, the memory model for
these extensions has not yet been formalized.
This is substantially because :

Memory consistency models supporting overlapping memory accesses of different widths simultaneously
remain an active area of academic research and are not yet fully understood. The
specifics of how memory accesses of different sizes interact under RVWMO are specified to the
best of our current abilities, but they are subject to revision should new issues be uncovered.
Although we address element alignment, we make no statement for non-AMO operations on the granularity
of the load/store ops.
I believe we should clarify, and specifically that implementations are
    1) free to choose any granularity. E.g. cache line size.
    2) mix granularity in a single vector load or store
    3) decompose elements and re-order sub-elements
    a) allows for reasonable (not overly constrained) emulation of a given vector instruction.s
    b) unequivocally places RVV in the uncharted waters of mixed size interaction, where it inherently is given the nature of EEW.

Notably:From RVI:

A misaligned load or store instruction may be decomposed into a set of component memory opera-
tions of any granularity. An FLD or FSD instruction for which XLEN<64 may also be decomposed
into a set of component memory operations of any granularity. The memory operations generated
by such instructions are not ordered with respect to each other in program order, but they are
ordered normally with respect to the memory operations generated by preceding and subsequent
instructions in program order.
and from RVV Memory Order:

Vector memory instructions appear to execute in program order on the local hart. Vector memory instructions
follow RVWMO at the instruction level, and element operations are ordered within the instruction as if per-
formed by an element-ordered sequence of syntactically independent scalar instructions. Vector indexed-or-
dered stores write elements to memory in element order. Vector indexed-unordered stores do not preserve
element order for writes within a single vector store instruction.
1) Although we state "written in element order" for stores, no such guarantee is stated for loads.
        Without such a stipulation or a mechanism to ensure synchronization across harts, the vector store guarantee is of no consequence/benefit to other hart vector loads.
         I believe this is the appropriate default position.
            - Scalar ops with appropriate fences will see predictable vector stores.
            - scalar processes should high-level lock ( avoidance is also such a lock) against access of vector data while vector ops are in process on that data on another or the same hart.
            - vector processes, likewise, should high-level lock against the same vector data while vector ops are in process on that data on another or the same hart.
This is a reasonable initial position.
The lock level can be lowered to include complete code vectorization as RVWMO evolves.

However, even avoidance is an insufficient lock if no fencing is stipulated.

I suggest
    a) vector memory operations relative to scalar mem ops respect scalar oriented fences.
    b) vector to vector memory ops are not constrained by mem fences and a stronger fence formulation is required.
       i) I suggest that we consider only fence ops with both Input/Read and Output/Write predecessor/successor bits set provide vector to vector ordering

The one exception we should carve out is vector AMO ops, which are specifically designed for such interactions.

2) I believe the execution model should allow vector instructions to conceptually proceed in parallel, within some constraints.
    Specifically, the goal is to allow the processing as if each element index were operated upon by independent harts.
    I will detail this in a github issue.