Vector Task Group minutes 2020/5/15 - CLSTR for in-register to in-memory alignment

David Horner

I made this into its own thread as well.

I think there is a parallel in the in-register/in-memory issue and  memory consistency model/methods.

The complexity of consistency models and methods is enormous, and similarly we have a plethora of approaches to address mixed-SEW operations.

For Memory Consistency models  hardware support
at one extreme Strict Consistency through to slow consistency.

And there are the software model from Sequential Consistency through to Alpha.

For RVV we have similar microarch models on both the ALU and memory sides, many hardware variations and tradeoffs.
These potentially affect the software visible features and in-register byte orders, with analogous software memory models from agnostic ( v0.9 SLEN=VLEN) to various awareness , casting operations (like acquire/release constraints) to address SLEN<VLEN anomalies (variations from SLEN=VLEN operation).

I want to attempt a comparison of CLSTR implementations in conjunction with SLEN vs VLEN with  Sequential Consistency and relaxed models RVTSO and RVWMO.

A) SLEN=VLEN is like a Sequential Consistency implementation.
        Programs see the behaviour exactly as expected, but it doesn't scale well with increased parallelism.

B)  Like a relaxed memory model, where disjoint (concurrent) code segments compete for a memory location, for SLEN<VLEN disjoint segments of code (that could be in a cooperative process, but could also be running consecutively in another portion of code) assume in-register and in-memory structure is maintained. The code segments do not have a synchronized view of the data.

B1)  SLEN<VLEN and CLSTR=XLEN resembles RVTSO. It is a limited relaxing that other architectures take for granted. It allows these other disjoint pieces of code ("concurrent coopertive processes" or consecutively executed) to see in-register = in-memory for all elements <= CLSTR, a reasonable behaviour. However, mixed-SEW operations will run slower for these in-memory order elements. Like RVTSO, the full freedom of relaxing other order constraints is missing, with the resultant performance loss.

B2) SLEN<VLEN and CLSTR=8 requires that all the code segments are aware of the in-register vs in-memory discrepancy (synchronization between all the components , awareness of the format is required). This is like the RVWMO that requires special attention to all the relaxed constraints for all code segments that are "sharing" a memory location. But code runs the fastest in this configuration.

I believe (A) is not required by the software industry, which often accept the constraint that memory mapping of components over a certain minimum size, usually word size, are not guaranteed to be processed atomically not stored contiguously for vector operations.

I believe B1 will be readily accepted as a limitation for ported code and first implementations of tool chains.

As a result, SLEN<VLEN implementations that provide both B1 and B2, select-able under program control, will be able to run the full body of code under B1.

 BUT software developers/maintainers will be incentivized, for performance critical code,  to vet that code and enable B2.
Substantial return on investment.
Similarly, tool chains will thus be incentivized to detect code that can run  in  B2 and automate the process of increasing performance.
Of course HPC will from the start ensure critical sections can run in B2. HPC will eliminate, or provide transparent B2 compatible means to do, in-SEW manipulations of the bytes. Using the transparent B2 compatible means only if that is a clear win for performance.

On 2020-05-27 11:05 a.m., David Horner via wrote:

for those not on Github I posted this to #461:

CLSTR can be considered a progressive SLEN=VLEN switch.

Rather than all or nothing as the SLEN=VLEN switch provides for in-memory compatibility,
clstr provides either a fixed or variable degradation for widening operations on SEW<CLSTR.

With a fixed clstr all operations operate at peak performance, except for mixed SEW widening/narrowing and only those with SEW<CLSTR.

In-memory and in-register formats align when SEW<=CLSTR.
Software is fully aware of the mapping, and already accommodates this behaviour for many existing architectures. (Analogous to big-endian vs little-endian in many aspects, although with bigendian all the bytes are present at each SEW level)

The clstr parameter is potentially writable, and for the higher end machines it appears very reasonable that they would provide at least two settings for CLSTR, byte and XLEN.
This would provide in-memory alignment at XLEN for code that is not sure of its dependence on it, and an optimization for widening/narrowing at SEW<XLEN for code that is sure it does not depend on in-memory format for that section of code.

Because clstr is potentially writable software can avoid performance penalties by other means as well, and leverage other potential structural benefits. They will turn a liability into a feature.

I have a pending proposal for exactly that idea “Its not a bug, its a feature”
That enables clstr for SLEN=VLEN implementations also, and allows addressing of even/odd groupings for SLEN<VLEN , too.

On 2020-05-26 5:03 p.m., David Horner via wrote:
for those not on Github I posted this to #461:

I gather what was missing from this were examples.
I prefer to consider clstr as a dynamic parameter, that some implementations will use a range of values.

However, for the sake of examples we can consider the cases where CLSTR=32.
So after a load:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                                  7 6 5 4                         3 2 1 0 SEW=8b

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b

VLEN=256b, SLEN=64, vl=13, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                        C         B A 9 8         7 6 5 4         3 2 1 0 SEW=8b SEW=8b


                7       3       6       2       5       1       4       0 SEW=32b

                        B               A               9       C       8  @ reg+1

By inspection unary and binary single SEW operations do not affect order.
However, for a widening operation, EEW=16 and 64 respectively which will yield:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b

                        3               1               2               0 SEW=64b

                        7               5               6               4  @ reg+1

Narrowing work in reverse.
When SLEN=VLEN clstr is irrelevant and effectively infinite as there is no other SLEN group in which to advance, so the current SLEN chunk has to be used (in the round-robin fashion.
Thank you for the template to use.
I don’t think SLEN = 1/4 VLEN has to be diagrammed.
And of course, store also works in reverse of load.

On 2020-05-26 11:17 a.m., David Horner via wrote:

On Tue, May 26, 2020, 04:38 , <krste@...> wrote:



I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. 
I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. 
I don't understand the reason for this constraint.
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).
Still not got it.

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b
clstr is not a  count but a size.
When CLSTR is 32 this last row is
                7       5       3       1       6       4       2       0 SEW=ELEN=32b
If I understood your diagram correctly.

See #461. It is effectively what SLEN was under v0.8. But potentially configurable.

I'm doing this for my cell phone. I'll work it up better when I'm at my laptop