Hi all,
I was wondering if the group has considered before (and rejected) the following
register layout proposal.
In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:
n = (i div LMUL)*SEW/8
j = (i mod LMUL) when LMUL > 1, else j = 0
where 'div' is integer division, e.g., 7 div 4 = 1.
As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):
VLEN=128b, SEW=8b, LMUL=1
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] F E D C B A 9 8 7 6 5 4 3 2 1 0
VLEN=128b, SEW=16b, LMUL=1
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] 7 6 5 4 3 2 1 0
VLEN=128b, SEW=32b, LMUL=1
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] 3 2 1 0
VLEN=128b, SEW=8b, LMUL=2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] 1E 1C 1A 18 16 14 12 10 E C A 8 6 4 2 0
v[2*n+1] 1F 1D 1B 19 17 15 13 11 F D B 9 7 5 3 1
VLEN=128b, SEW=16b, LMUL=2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] E C A 8 6 4 2 0
v[2*n+1] F D B 9 7 5 3 1
VLEN=128b, SEW=32b, LMUL=2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] 6 4 2 0
v[2*n+1] 7 5 3 1
VLEN=128b, SEW=8b, LMUL=4
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] 3C 38 34 30 2C 28 24 20 1C 18 14 10 C 8 4 0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11 D 9 5 1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12 E A 6 2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13 F B 7 3
VLEN=128b, SEW=16b, LMUL=4
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] 1C 18 14 10 C 8 4 0
v[2*n+1] 1D 19 15 11 D 9 5 1
v[2*n+2] 1E 1A 16 12 E A 6 2
v[2*n+3] 1F 1B 17 13 F B 7 3
VLEN=128b, SEW=32b, LMUL=4
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] C 8 4 0
v[2*n+1] D 9 5 1
v[2*n+2] E A 6 2
v[2*n+3] F B 7 3
VLEN=128b, SEW=8b, LMUL=1/2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0
VLEN=128b, SEW=16b, LMUL=1/2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] - 3 - 2 - 1 - 0
VLEN=128b, SEW=32b, LMUL=1/2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] - 1 - 0
VLEN=128b, SEW=8b, LMUL=1/4
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] - - - 3 - - - 2 - - - 1 - - - 0
VLEN=128b, SEW=16b, LMUL=1/4
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] - - - 1 - - - 0
VLEN=128b, SEW=32b, LMUL=1/4
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] - - - 0
The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.
SEW/LMUL=32:
VLEN=128b, SEW=8b, LMUL=1/4
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] - - - 3 - - - 2 - - - 1 - - - 0
VLEN=128b, SEW=16b, LMUL=1/2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] - 3 - 2 - 1 - 0
VLEN=128b, SEW=32b, LMUL=1
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] 3 2 1 0
SEW/LMUL=16:
VLEN=128b, SEW=8b, LMUL=1/2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0
VLEN=128b, SEW=16b, LMUL=1
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] 7 6 5 4 3 2 1 0
VLEN=128b, SEW=32b, LMUL=2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] 6 4 2 0
v[2*n+1] 7 5 3 1
SEW/LMUL=8:
VLEN=128b, SEW=8b, LMUL=1
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n] F E D C B A 9 8 7 6 5 4 3 2 1 0
VLEN=128b, SEW=16b, LMUL=2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] E C A 8 6 4 2 0
v[2*n+1] F D B 9 7 5 3 1
VLEN=128b, SEW=32b, LMUL=4
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] C 8 4 0
v[2*n+1] D 9 5 1
v[2*n+2] E A 6 2
v[2*n+3] F B 7 3
SEW/LMUL=4:
VLEN=128b, SEW=8b, LMUL=2
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] 1E 1C 1A 18 16 14 12 10 E C A 8 6 4 2 0
v[2*n+1] 1F 1D 1B 19 17 15 13 11 F D B 9 7 5 3 1
VLEN=128b, SEW=16b, LMUL=4
Byte F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n] 1C 18 14 10 C 8 4 0
v[2*n+1] 1D 19 15 11 D 9 5 1
v[2*n+2] 1E 1A 16 12 E A 6 2
v[2*n+3] 1F 1B 17 13 F B 7 3
Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.
So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.
Thanks,
Grigorios Magklis