for those not on Github I posted this to #461:
CLSTR can be considered a progressive SLEN=VLEN switch.
Rather than all or nothing as the SLEN=VLEN switch provides for
in-memory compatibility,
clstr provides either a fixed or variable degradation for widening
operations on SEW<CLSTR.
With a fixed clstr all operations operate at peak performance,
except for mixed SEW widening/narrowing and only those with
SEW<CLSTR.
In-memory and in-register formats align when SEW<=CLSTR.
Software is fully aware of the mapping, and already accommodates
this behaviour for many existing architectures. (Analogous to
big-endian vs little-endian in many aspects, although with
bigendian all the bytes are present at each SEW level)
The clstr parameter is potentially writable, and for the higher
end machines it appears very reasonable that they would provide at
least two settings for CLSTR, byte and XLEN.
This would provide in-memory alignment at XLEN for code that is
not sure of its dependence on it, and an optimization for
widening/narrowing at SEW<XLEN for code that is sure it does
not depend on in-memory format for that section of code.
Because clstr is potentially writable software can avoid
performance penalties by other means as well, and leverage other
potential structural benefits. They will turn a liability into a
feature.
I have a pending proposal for exactly that idea “Its not a bug,
its a feature”
That enables clstr for SLEN=VLEN implementations also, and allows
addressing of even/odd groupings for SLEN<VLEN , too.
On 2020-05-26 5:03 p.m., David Horner
via lists.riscv.org wrote:
toggle quoted message
Show quoted text
for those not on Github I posted this to #461:
I gather what was missing from this were examples.
I prefer to consider clstr as a dynamic parameter, that some
implementations will use a range of values.
However, for the sake of examples we can consider the cases where
CLSTR=32.
So after a load:
VLEN=256b, SLEN=128, vl=8, CLSTR=32
Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7
6 5 4 3 2 1 0
7 6 5
4 3 2 1 0 SEW=8b
7 6 3 2
5 4 1 0 SEW=16b
7 5 3 1 6
4 2 0 SEW=32b
VLEN=256b, SLEN=64, vl=13, CLSTR=32
Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7
6 5 4 3 2 1 0
C B A 9 8 7 6 5
4 3 2 1 0 SEW=8b SEW=8b
7 3 6 2 5
1 4 0 SEW=32b
B A
9 C 8 @ reg+1
By inspection unary and binary single SEW operations do not affect
order.
However, for a widening operation, EEW=16 and 64 respectively
which will yield:
VLEN=256b, SLEN=128, vl=8, CLSTR=32
Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7
6 5 4 3 2 1 0
7 6 3 2
5 4 1 0 SEW=16b
7 5 3 1 6
4 2 0 SEW=32b
3 1
2 0 SEW=64b
7 5
6 4 @ reg+1
Narrowing work in reverse.
When SLEN=VLEN clstr is irrelevant and effectively infinite as
there is no other SLEN group in which to advance, so the current
SLEN chunk has to be used (in the round-robin fashion.
Thank you for the template to use.
I don’t think SLEN = 1/4 VLEN has to be diagrammed.
And of course, store also works in reverse of load.
@David-Horner
On 2020-05-26 11:17 a.m., David
Horner via lists.riscv.org wrote:
On Tue, May 26, 2020,
04:38 , <
krste@...> wrote:
.
----------------------------------------------------------------------
I think David is trying to find a design where bytes are
contiguous
within ELEN (or some other unit < ELEN) but then
striped above that to
avoid casting.
Correct
I don't
think this can work.
First, SLEN has to be big enough to hold ELEN/8 * ELEN
words.
I don't understand the reason for this
constraint.
E.g.,
when ELEN=32, you can pack four contiguous bytes in
ELEN,but then
require SLEN to have space for four ELEN words to avoid
either wires
crossing SLEN partitions, or requiring multiple cycles to
compute
small vectors (v0.8 design).
Still not got it.
VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte 1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8
7 6 5 4 3 2 1 0
7 6 5 4
3 2 1 0 SEW=8b
7 6 5 4 3 2
1 0 SEW=ELEN=32b
clstr is not a count but a size.
When CLSTR is 32 this last row is
7 5 3 1 6 4 2
0 SEW=ELEN=32b
If I
understood your diagram correctly.
See #461.
It is effectively what SLEN was under v0.8. But potentially
configurable.
I'm doing this for my
cell phone. I'll work it up better when I'm at my laptop