Vector Task Group minutes 2020/5/15 - V0.8 design with SLEN=8
I have some suggestions for the reasons for moving from v0.8 vertical striping to v0.9 horizontal SLEN (interleave)
A) when vl < VLEN/SEW*LMUL the top elements are not filled.
This can lead to under utilization of the top lanes.
Even though vl is 1/2 or less of the max, all registers in the group are referenced, and hence slower and more power use in the general case.
Your proposal does not
B) When LMUL>1, SLEN determines the in-memory to in-register alignment.
As SLEN is usually greater or equal to XLEN, this is usually managable by compilers.
Indeed, it has been proposed as a "poor man's SLEN shuffle"
C) Various aspects were tied to SEW/LMUL ratio.
Notably, mask alignment but also required to keep VLMAX unchanged and thus vl unaffected.
D) the vertical striping as a means to facilitate mixed SEW operations forces a different structure for each LMUL=2,4 and 8.
Even for simple machines this complex model is required. And only powers of 2 for LMUL are possible.
E) Fractional LMUL does not emerge easily out of this design, especially as it relates to SLEN.
With v0.9 and SLEN=VLEN all these characteristics could be eliminated providing a very simple model for simple implementations.
Each level of LMUL has the same "format". LMULs of 3,5,6 and 7 are possible in software and potentially with hardware support.
As we know higher performance implementations that need to invoke SLEN<VLEN are not so simple.
But the fundamental format at each LMUL level is the same allowing LMULs of 3,5,6 and 7.
I expect there are other aspects that drove the decision, and Krste may be at liberty to share them.
Unfortunately, v0.8 with SLEN=8 doesn't solve any of these problems, and exasperates item B to make it as problematic as in v0.9 SLEN<VLEN.
On 2020-05-27 1:13 p.m., David Horner via lists.riscv.org wrote:
Can you explain the kind of code optimizations that are not as easy with V0.8 when SLEN is determined at run time?
LMUL=1 was always easy. It is the LMUL!=1 that for v0.8 is harder.
I tend to agree with you.
But there can be many implementations with SLEN<VLEN but by design no implementations with only LMUL=1.
Restricting LMUL <= 1 is still a major blow to the architecture.
costs may indeed be similar but a solution is not yet resolved.
Further, the burden to manage in-register mismatch to in-memory is a big risk for acceptance in either v0.8 or v0.9.
I hope ( and believe) we can find a solution for v0.9 that will not fragment software ecosystem.
I greatly appreciate your contribution, but for the reasons I mentioned at top, I think v0.9 is a better base to try to move forward.
I may have missed something very obvious, and perhaps it resides in my missing the kind of code optimizations you allude to above.
On Fri, May 29, 2020 at 10:27 AM David Horner <ds2horner@...> wrote:
I seem to recall that at some point LMUL was only a suggestion and that if the requested vl was short (e.g. the last strip-mining loop on a long application vector) the vsetvl[i] instruction was free to reduce the requested LMUL.
Maybe that was back in 0.7, but I think it should still work with type punning as long as vl*EW is always the same (which it has to be anyway).