Re: Vector Task Group minutes 2020/5/15


David Horner
 

for those not on Github I posted this to #461:

CLSTR can be considered a progressive SLEN=VLEN switch.

Rather than all or nothing as the SLEN=VLEN switch provides for in-memory compatibility,
clstr provides either a fixed or variable degradation for widening operations on SEW<CLSTR.

With a fixed clstr all operations operate at peak performance, except for mixed SEW widening/narrowing and only those with SEW<CLSTR.

In-memory and in-register formats align when SEW<=CLSTR.
Software is fully aware of the mapping, and already accommodates this behaviour for many existing architectures. (Analogous to big-endian vs little-endian in many aspects, although with bigendian all the bytes are present at each SEW level)

The clstr parameter is potentially writable, and for the higher end machines it appears very reasonable that they would provide at least two settings for CLSTR, byte and XLEN.
This would provide in-memory alignment at XLEN for code that is not sure of its dependence on it, and an optimization for widening/narrowing at SEW<XLEN for code that is sure it does not depend on in-memory format for that section of code.

Because clstr is potentially writable software can avoid performance penalties by other means as well, and leverage other potential structural benefits. They will turn a liability into a feature.

I have a pending proposal for exactly that idea “Its not a bug, its a feature”
That enables clstr for SLEN=VLEN implementations also, and allows addressing of even/odd groupings for SLEN<VLEN , too.



On 2020-05-26 5:03 p.m., David Horner via lists.riscv.org wrote:

for those not on Github I posted this to #461:

I gather what was missing from this were examples.
I prefer to consider clstr as a dynamic parameter, that some implementations will use a range of values.

However, for the sake of examples we can consider the cases where CLSTR=32.
So after a load:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                                  7 6 5 4                         3 2 1 0 SEW=8b

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b

VLEN=256b, SLEN=64, vl=13, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                        C         B A 9 8         7 6 5 4         3 2 1 0 SEW=8b SEW=8b

  

                7       3       6       2       5       1       4       0 SEW=32b

                        B               A               9       C       8  @ reg+1


By inspection unary and binary single SEW operations do not affect order.
However, for a widening operation, EEW=16 and 64 respectively which will yield:

VLEN=256b, SLEN=128, vl=8, CLSTR=32

Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0

                            7   6   3   2                   5   4   1   0 SEW=16b

                7       5       3       1       6       4       2       0 SEW=32b


                        3               1               2               0 SEW=64b

                        7               5               6               4  @ reg+1


Narrowing work in reverse.
When SLEN=VLEN clstr is irrelevant and effectively infinite as there is no other SLEN group in which to advance, so the current SLEN chunk has to be used (in the round-robin fashion.
Thank you for the template to use.
I don’t think SLEN = 1/4 VLEN has to be diagrammed.
And of course, store also works in reverse of load.
     
   
 




         



     

       
         

   














 
 
     






           


           
           
                 
 
  @David-Horner
 

   
     
     
       
         
 
 
     
       

On 2020-05-26 11:17 a.m., David Horner via lists.riscv.org wrote:

On Tue, May 26, 2020, 04:38 , <krste@...> wrote:

.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. 
Correct 
I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. 
I don't understand the reason for this constraint.
E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).
Still not got it.

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b
clstr is not a  count but a size.
When CLSTR is 32 this last row is
                7       5       3       1       6       4       2       0 SEW=ELEN=32b
If I understood your diagram correctly.

See #461. It is effectively what SLEN was under v0.8. But potentially configurable.

I'm doing this for my cell phone. I'll work it up better when I'm at my laptop


Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.