Vector Task Group minutes 2020/5/15 - CLSTR for in-register to in-memory alignment
I made this into its own thread as well.toggle quoted messageShow quoted text
I think there is a parallel in the in-register/in-memory issue and memory consistency model/methods.
The complexity of consistency models and methods is enormous, and similarly we have a plethora of approaches to address mixed-SEW operations.
For Memory Consistency models hardware support
at one extreme Strict Consistency through to slow consistency.
And there are the software model from Sequential Consistency through to Alpha.
For RVV we have similar microarch models on both the ALU and memory sides, many hardware variations and tradeoffs.
These potentially affect the software visible features and in-register byte orders, with analogous software memory models from agnostic ( v0.9 SLEN=VLEN) to various awareness , casting operations (like acquire/release constraints) to address SLEN<VLEN anomalies (variations from SLEN=VLEN operation).
I want to attempt a comparison of CLSTR implementations in conjunction with SLEN vs VLEN with Sequential Consistency and relaxed models RVTSO and RVWMO.
A) SLEN=VLEN is like a Sequential Consistency implementation.
Programs see the behaviour exactly as expected, but it doesn't scale well with increased parallelism.
B) Like a relaxed memory model, where disjoint (concurrent) code segments compete for a memory location, for SLEN<VLEN disjoint segments of code (that could be in a cooperative process, but could also be running consecutively in another portion of code) assume in-register and in-memory structure is maintained. The code segments do not have a synchronized view of the data.
B1) SLEN<VLEN and CLSTR=XLEN resembles RVTSO. It is a limited relaxing that other architectures take for granted. It allows these other disjoint pieces of code ("concurrent coopertive processes" or consecutively executed) to see in-register = in-memory for all elements <= CLSTR, a reasonable behaviour. However, mixed-SEW operations will run slower for these in-memory order elements. Like RVTSO, the full freedom of relaxing other order constraints is missing, with the resultant performance loss.
B2) SLEN<VLEN and CLSTR=8 requires that all the code segments are aware of the in-register vs in-memory discrepancy (synchronization between all the components , awareness of the format is required). This is like the RVWMO that requires special attention to all the relaxed constraints for all code segments that are "sharing" a memory location. But code runs the fastest in this configuration.
I believe (A) is not required by the software industry, which often accept the constraint that memory mapping of components over a certain minimum size, usually word size, are not guaranteed to be processed atomically not stored contiguously for vector operations.
I believe B1 will be readily accepted as a limitation for ported code and first implementations of tool chains.
As a result, SLEN<VLEN implementations that provide both B1 and B2, select-able under program control, will be able to run the full body of code under B1.
BUT software developers/maintainers will be incentivized, for performance critical code, to vet that code and enable B2.
Substantial return on investment.
Similarly, tool chains will thus be incentivized to detect code that can run in B2 and automate the process of increasing performance.
Of course HPC will from the start ensure critical sections can run in B2. HPC will eliminate, or provide transparent B2 compatible means to do, in-SEW manipulations of the bytes. Using the transparent B2 compatible means only if that is a clear win for performance.
On 2020-05-27 11:05 a.m., David Horner via lists.riscv.org wrote:
for those not on Github I posted this to #461: