Vector Task Group minutes 2020/5/15 - V0.8 design with SLEN=8


David Horner
 

I have some suggestions for the reasons for moving from v0.8 vertical striping to v0.9 horizontal SLEN (interleave)

Under  v0.8
A)   when vl < VLEN/SEW*LMUL the top elements are not filled.
            This can lead to under utilization of the top lanes.
            Even though vl is 1/2 or less of the max,  all registers in the group are referenced, and hence slower and more power use in the general case.

           Your proposal does not

B)       When LMUL>1, SLEN determines the in-memory to in-register alignment.
            As SLEN is usually greater or equal to XLEN, this is usually managable by compilers.
                Indeed, it has been proposed as a "poor man's SLEN shuffle"

C)       Various aspects were tied to SEW/LMUL ratio.
             Notably, mask alignment but also required to keep VLMAX unchanged and thus vl unaffected.

D)       the vertical striping as a means to facilitate mixed SEW operations forces a different structure for each LMUL=2,4 and 8.
            Even for simple machines this complex model is required. And only powers of 2 for LMUL are possible.
           
E)         Fractional LMUL does not emerge easily out of this design, especially as it relates to SLEN. 

With v0.9 and SLEN=VLEN all these characteristics could be eliminated providing a very simple model for simple implementations.
            Each level of LMUL has the same "format". LMULs of 3,5,6 and 7 are possible in software and potentially with hardware support.

      As we know higher performance implementations that need to invoke SLEN<VLEN are not so simple.
            But the fundamental format at each LMUL level is the same allowing LMULs of 3,5,6 and 7.

I expect there are other aspects that drove the decision, and Krste may be at liberty to share them.

           
Unfortunately, v0.8 with SLEN=8 doesn't solve any of these problems, and exasperates item B to make it as problematic as in v0.9 SLEN<VLEN.


On 2020-05-27 1:13 p.m., David Horner via lists.riscv.org wrote:
This is v0.8 with SLEN=8.



On Wed, May 27, 2020, 07:59 Grigorios Magklis, <grigorios.magklis@...> wrote:
Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Can you explain the kind of code optimizations that are not as easy with V0.8 when SLEN is determined at run time?
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions).
LMUL=1 was always easy. It is the LMUL!=1 that for v0.8 is harder.
When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.
I tend to agree with you.
But there can be many implementations with SLEN<VLEN but by design no implementations with only LMUL=1.
Restricting LMUL <= 1 is still a major blow to the architecture.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN),
costs may indeed be similar but a solution is not yet resolved.
Further, the burden to manage in-register mismatch to in-memory is a big risk for acceptance in either v0.8 or v0.9.
but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.
I hope ( and believe) we can find a solution for v0.9 that will not fragment software ecosystem.

I greatly appreciate your contribution, but for the reasons I mentioned at top, I think v0.9 is a better base to try to move forward.
I may have missed something very obvious, and perhaps it resides in my missing the kind of code optimizations you allude to above.
Thanks again.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4




|







Bruce Hoult
 

On Fri, May 29, 2020 at 10:27 AM David Horner <ds2horner@...> wrote:
I have some suggestions for the reasons for moving from v0.8 vertical striping to v0.9 horizontal SLEN (interleave)

Under  v0.8
A)   when vl < VLEN/SEW*LMUL the top elements are not filled.
            This can lead to under utilization of the top lanes.
            Even though vl is 1/2 or less of the max,  all registers in the group are referenced, and hence slower and more power use in the general case.

I seem to recall that at some point LMUL was only a suggestion and that if the requested vl was short (e.g. the last strip-mining loop on a long application vector) the vsetvl[i] instruction was free to reduce the requested LMUL.

Maybe that was back in 0.7, but I think it should still work with type punning as long as vl*EW is always the same (which it has to be anyway).