Re: mem model - RISC-V Vector Extension TG Minutes 2020/6/26
David Horner
TL;DR;
I make clarifications on meeting minutes. I propose we present 1) a relaxed RVV memory/process model, more relaxed than we believe current implementations require for optimal performance. Application and privileged code will use this as the framework to code within. 2) stipulate implementation constraints that are more restrictive than the memory/process model allows. This is already our current practice for such concerns. e.g. vd cannot overlap vs1/vs2 in many instructions. Combined this allows tightening of the memory/process model to reduce software complexity (in expected fringe cases) and relaxation of implementation constraints as technology advances enable. This is also the basic idea behind the #364 proposal. On 2020-06-26 11:05 p.m., Krste Asanovic wrote: Date: 2020/6/26A) Some discussion also centered around what alignment is. ## Unordered vector indexed stores report address exceptions inB) The related, reporting of load exceptions in order, also needs to be addressed. C) A related issue: Emulation of RVV instructions that might otherwise not perform any memory operations but may need to spill registers to efficiently use RVV instructions in the emulation. I mention this below but further analysis and actioning is needed. Both A and B are sub issues of the RVV memory model that explicitly RVWMO does not address. Instructions in the RV128 base instruction set and in future ISA extensions such as V (vector)This is substantially because : Memory consistency models supporting overlapping memory accesses of different widths simultaneouslyAlthough we address element alignment, we make no statement for non-AMO operations on the granularity of the load/store ops. I believe we should clarify, and specifically that implementations are 1) free to choose any granularity. E.g. cache line size. 2) mix granularity in a single vector load or store 3) decompose elements and re-order sub-elements This a) allows for reasonable (not overly constrained) emulation of a given vector instruction.s b) unequivocally places RVV in the uncharted waters of mixed size interaction, where it inherently is given the nature of EEW. Notably:From RVI: A misaligned load or store instruction may be decomposed into a set of component memory opera-and from RVV Memory Order: Vector memory instructions appear to execute in program order on the local hart. Vector memory instructions1) Although we state "written in element order" for stores, no such guarantee is stated for loads. Without such a stipulation or a mechanism to ensure synchronization across harts, the vector store guarantee is of no consequence/benefit to other hart vector loads. I believe this is the appropriate default position. - Scalar ops with appropriate fences will see predictable vector stores. - scalar processes should high-level lock ( avoidance is also such a lock) against access of vector data while vector ops are in process on that data on another or the same hart. - vector processes, likewise, should high-level lock against the same vector data while vector ops are in process on that data on another or the same hart. This is a reasonable initial position. The lock level can be lowered to include complete code vectorization as RVWMO evolves. However, even avoidance is an insufficient lock if no fencing is stipulated. I suggest a) vector memory operations relative to scalar mem ops respect scalar oriented fences. b) vector to vector memory ops are not constrained by mem fences and a stronger fence formulation is required. i) I suggest that we consider only fence ops with both Input/Read and Output/Write predecessor/successor bits set provide vector to vector ordering The one exception we should carve out is vector AMO ops, which are specifically designed for such interactions. 2) I believe the execution model should allow vector instructions to conceptually proceed in parallel, within some constraints. Specifically, the goal is to allow the processing as if each element index were operated upon by independent harts. I will detail this in a github issue. |
|