Date
1 - 2 of 2
Vector Task Group minutes 2020/5/15 - V0.8 design with SLEN=8
David Horner
I have some suggestions for the reasons for moving from v0.8
vertical striping to v0.9 horizontal SLEN (interleave)
Under v0.8
A) when vl < VLEN/SEW*LMUL the top elements are not filled.
This can lead to under utilization of the top lanes.
Even though vl is 1/2 or less of the max, all registers in the group are referenced, and hence slower and more power use in the general case.
Your proposal does not
B) When LMUL>1, SLEN determines the in-memory to in-register alignment.
As SLEN is usually greater or equal to XLEN, this is usually managable by compilers.
Indeed, it has been proposed as a "poor man's SLEN shuffle"
C) Various aspects were tied to SEW/LMUL ratio.
Notably, mask alignment but also required to keep VLMAX unchanged and thus vl unaffected.
D) the vertical striping as a means to facilitate mixed SEW operations forces a different structure for each LMUL=2,4 and 8.
Even for simple machines this complex model is required. And only powers of 2 for LMUL are possible.
E) Fractional LMUL does not emerge easily out of this design, especially as it relates to SLEN.
With v0.9 and SLEN=VLEN all these characteristics could be eliminated providing a very simple model for simple implementations.
Each level of LMUL has the same "format". LMULs of 3,5,6 and 7 are possible in software and potentially with hardware support.
As we know higher performance implementations that need to invoke SLEN<VLEN are not so simple.
But the fundamental format at each LMUL level is the same allowing LMULs of 3,5,6 and 7.
I expect there are other aspects that drove the decision, and Krste may be at liberty to share them.
Unfortunately, v0.8 with SLEN=8 doesn't solve any of these problems, and exasperates item B to make it as problematic as in v0.9 SLEN<VLEN.
But there can be many implementations with SLEN<VLEN but by design no implementations with only LMUL=1.
Restricting LMUL <= 1 is still a major blow to the architecture.
Further, the burden to manage in-register mismatch to in-memory is a big risk for acceptance in either v0.8 or v0.9.
I greatly appreciate your contribution, but for the reasons I mentioned at top, I think v0.9 is a better base to try to move forward.
I may have missed something very obvious, and perhaps it resides in my missing the kind of code optimizations you allude to above.
Thanks again.
Under v0.8
A) when vl < VLEN/SEW*LMUL the top elements are not filled.
This can lead to under utilization of the top lanes.
Even though vl is 1/2 or less of the max, all registers in the group are referenced, and hence slower and more power use in the general case.
Your proposal does not
B) When LMUL>1, SLEN determines the in-memory to in-register alignment.
As SLEN is usually greater or equal to XLEN, this is usually managable by compilers.
Indeed, it has been proposed as a "poor man's SLEN shuffle"
C) Various aspects were tied to SEW/LMUL ratio.
Notably, mask alignment but also required to keep VLMAX unchanged and thus vl unaffected.
D) the vertical striping as a means to facilitate mixed SEW operations forces a different structure for each LMUL=2,4 and 8.
Even for simple machines this complex model is required. And only powers of 2 for LMUL are possible.
E) Fractional LMUL does not emerge easily out of this design, especially as it relates to SLEN.
With v0.9 and SLEN=VLEN all these characteristics could be eliminated providing a very simple model for simple implementations.
Each level of LMUL has the same "format". LMULs of 3,5,6 and 7 are possible in software and potentially with hardware support.
As we know higher performance implementations that need to invoke SLEN<VLEN are not so simple.
But the fundamental format at each LMUL level is the same allowing LMULs of 3,5,6 and 7.
I expect there are other aspects that drove the decision, and Krste may be at liberty to share them.
Unfortunately, v0.8 with SLEN=8 doesn't solve any of these problems, and exasperates item B to make it as problematic as in v0.9 SLEN<VLEN.
On 2020-05-27 1:13 p.m., David Horner
via lists.riscv.org wrote:
Can you explain the kind of code optimizations that are not as easy with V0.8 when SLEN is determined at run time?This is v0.8 with SLEN=8.
On Wed, May 27, 2020, 07:59 Grigorios Magklis, <grigorios.magklis@...> wrote:
Hi all,
I was wondering if the group has considered before (and rejected) the followingregister layout proposal.
In this scheme, there is no SLEN parameter, instead the layout is solely definedby SEW and LMUL. For a given LMUL and SEW, the i-th element in the registergroup starts at the n-th byte of the j-th register (both values starting at 0), as follows:
n = (i div LMUL)*SEW/8j = (i mod LMUL) when LMUL > 1, else j = 0
where 'div' is integer division, e.g., 7 div 4 = 1.
As shown in the examples below, with this scheme, when LMUL=1 the registerlayout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1elements are allocated "vertically" across the register group (similar toSLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register(similar to SLEN=SEW/LMUL):
VLEN=128b, SEW=8b, LMUL=1Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] F E D C B A 9 8 7 6 5 4 3 2 1 0
VLEN=128b, SEW=16b, LMUL=1Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] 7 6 5 4 3 2 1 0
VLEN=128b, SEW=32b, LMUL=1Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] 3 2 1 0
VLEN=128b, SEW=8b, LMUL=2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] 1E 1C 1A 18 16 14 12 10 E C A 8 6 4 2 0v[2*n+1] 1F 1D 1B 19 17 15 13 11 F D B 9 7 5 3 1
VLEN=128b, SEW=16b, LMUL=2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] E C A 8 6 4 2 0v[2*n+1] F D B 9 7 5 3 1
VLEN=128b, SEW=32b, LMUL=2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] 6 4 2 0v[2*n+1] 7 5 3 1
VLEN=128b, SEW=8b, LMUL=4Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] 3C 38 34 30 2C 28 24 20 1C 18 14 10 C 8 4 0v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11 D 9 5 1v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12 E A 6 2v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13 F B 7 3
VLEN=128b, SEW=16b, LMUL=4Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] 1C 18 14 10 C 8 4 0v[2*n+1] 1D 19 15 11 D 9 5 1v[2*n+2] 1E 1A 16 12 E A 6 2v[2*n+3] 1F 1B 17 13 F B 7 3
VLEN=128b, SEW=32b, LMUL=4Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] C 8 4 0v[2*n+1] D 9 5 1v[2*n+2] E A 6 2v[2*n+3] F B 7 3
VLEN=128b, SEW=8b, LMUL=1/2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0
VLEN=128b, SEW=16b, LMUL=1/2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] - 3 - 2 - 1 - 0
VLEN=128b, SEW=32b, LMUL=1/2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] - 1 - 0
VLEN=128b, SEW=8b, LMUL=1/4Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] - - - 3 - - - 2 - - - 1 - - - 0
VLEN=128b, SEW=16b, LMUL=1/4Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] - - - 1 - - - 0
VLEN=128b, SEW=32b, LMUL=1/4Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] - - - 0
The benefit of this scheme is that software always knows the layout of the elementsin the register just by programming SEW and LMUL, as this is now the same forall implementations, and thus can optimize code accordingly if it so wishes.
LMUL=1 was always easy. It is the LMUL!=1 that for v0.8 is harder.Also, as long as we keep SEW/LMUL constant, mixed-width operations always stayin-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shownbelow.
SEW/LMUL=32:
VLEN=128b, SEW=8b, LMUL=1/4Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] - - - 3 - - - 2 - - - 1 - - - 0
VLEN=128b, SEW=16b, LMUL=1/2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] - 3 - 2 - 1 - 0
VLEN=128b, SEW=32b, LMUL=1Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] 3 2 1 0
SEW/LMUL=16:
VLEN=128b, SEW=8b, LMUL=1/2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0
VLEN=128b, SEW=16b, LMUL=1Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] 7 6 5 4 3 2 1 0
VLEN=128b, SEW=32b, LMUL=2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] 6 4 2 0v[2*n+1] 7 5 3 1
SEW/LMUL=8:
VLEN=128b, SEW=8b, LMUL=1Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[n] F E D C B A 9 8 7 6 5 4 3 2 1 0
VLEN=128b, SEW=16b, LMUL=2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] E C A 8 6 4 2 0v[2*n+1] F D B 9 7 5 3 1
VLEN=128b, SEW=32b, LMUL=4Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] C 8 4 0v[2*n+1] D 9 5 1v[2*n+2] E A 6 2v[2*n+3] F B 7 3
SEW/LMUL=4:
VLEN=128b, SEW=8b, LMUL=2Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] 1E 1C 1A 18 16 14 12 10 E C A 8 6 4 2 0v[2*n+1] 1F 1D 1B 19 17 15 13 11 F D B 9 7 5 3 1
VLEN=128b, SEW=16b, LMUL=4Byte F E D C B A 9 8 7 6 5 4 3 2 1 0v[2*n] 1C 18 14 10 C 8 4 0v[2*n+1] 1D 19 15 11 D 9 5 1v[2*n+2] 1E 1A 16 12 E A 6 2v[2*n+3] 1F 1B 17 13 F B 7 3
Additionally, because the in-register layout is the same as the in-memorylayout when LMUL=1, there is no need to shuffle bytes when moving data in andout of memory, which may allow the implementation to optimize this case (andfor software to eliminate any typecast instructions).
I tend to agree with you.When LMUL>1 or LMUL<1,then loads and stores need to shuffle bytes around, but I think the cost of this issimilar to what v0.9 requires with SLEN<VLEN.
But there can be many implementations with SLEN<VLEN but by design no implementations with only LMUL=1.
Restricting LMUL <= 1 is still a major blow to the architecture.
costs may indeed be similar but a solution is not yet resolved.
So, I think this scheme has the same benefits and similar implementation coststo the v0.9 SLEN=VLEN scheme for machines with short vectors (except thatSLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machineswith long vectors (which is the whole point of avoiding SLEN=VLEN),
Further, the burden to manage in-register mismatch to in-memory is a big risk for acceptance in either v0.8 or v0.9.
I hope ( and believe) we can find a solution for v0.9 that will not fragment software ecosystem.but withthe benefit that the layout is architected and we do not introducefragmentation to the ecosystem.
I greatly appreciate your contribution, but for the reasons I mentioned at top, I think v0.9 is a better base to try to move forward.
I may have missed something very obvious, and perhaps it resides in my missing the kind of code optimizations you allude to above.
Thanks again.
Thanks,Grigorios Magklis
On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:
I appreciate your agreement with my analysis, Krste. However, I wasn't
drawing a conclusion. I lean toward the conclusion that we keep the
"new v0.9 scheme" below and live with casts. But I wasn't fully sure
and wanted to see where the discussion might go. I suspect each of the
extra gates for memory access and the slower speed of short vectors is
sufficient by itself to argue pretty strongly against the "v0.8 scheme"
- which is the only one I can see that might have the desired properties.
I also agree that I don't think it's possible to find "a design where
bytes are contiguous within ELEN." I worked out what I think are the
outlines of a proof that it's not possible, but I thought I'd suggest
what I did at a high level first and only try to make the proof more
rigorous if necessary.
Bill
On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL
I think Bill is right in his analysis, but I disagree with his
conclusion.
The scheme Bill is describing is basically the v0.8 scheme:
Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4
|
On Fri, May 29, 2020 at 10:27 AM David Horner <ds2horner@...> wrote:
I have some suggestions for the reasons for moving from v0.8 vertical striping to v0.9 horizontal SLEN (interleave)
Under v0.8
A) when vl < VLEN/SEW*LMUL the top elements are not filled.
This can lead to under utilization of the top lanes.
Even though vl is 1/2 or less of the max, all registers in the group are referenced, and hence slower and more power use in the general case.
I seem to recall that at some point LMUL was only a suggestion and that if the requested vl was short (e.g. the last strip-mining loop on a long application vector) the vsetvl[i] instruction was free to reduce the requested LMUL.
Maybe that was back in 0.7, but I think it should still work with type punning as long as vl*EW is always the same (which it has to be anyway).