Date
1 - 2 of 2
Vector Task Group minutes 2020/5/15 - V0.8 design with SLEN=8
On Fri, May 29, 2020 at 10:27 AM David Horner <ds2horner@...> wrote:
I seem to recall that at some point LMUL was only a suggestion and that if the requested vl was short (e.g. the last strip-mining loop on a long application vector) the vsetvl[i] instruction was free to reduce the requested LMUL. Maybe that was back in 0.7, but I think it should still work with type punning as long as vl*EW is always the same (which it has to be anyway). |
|
David Horner
I have some suggestions for the reasons for moving from v0.8
vertical striping to v0.9 horizontal SLEN (interleave)
Under v0.8 A) when vl < VLEN/SEW*LMUL the top elements are not filled. This can lead to under utilization of the top lanes. Even though vl is 1/2 or less of the max, all registers in the group are referenced, and hence slower and more power use in the general case. Your proposal does not B) When LMUL>1, SLEN determines the in-memory to in-register alignment. As SLEN is usually greater or equal to XLEN, this is usually managable by compilers. Indeed, it has been proposed as a "poor man's SLEN shuffle" C) Various aspects were tied to SEW/LMUL ratio. Notably, mask alignment but also required to keep VLMAX unchanged and thus vl unaffected. D) the vertical striping as a means to facilitate mixed SEW operations forces a different structure for each LMUL=2,4 and 8. Even for simple machines this complex model is required. And only powers of 2 for LMUL are possible. E) Fractional LMUL does not emerge easily out of this design, especially as it relates to SLEN. With v0.9 and SLEN=VLEN all these characteristics could be eliminated providing a very simple model for simple implementations. Each level of LMUL has the same "format". LMULs of 3,5,6 and 7 are possible in software and potentially with hardware support. As we know higher performance implementations that need to invoke SLEN<VLEN are not so simple. But the fundamental format at each LMUL level is the same allowing LMULs of 3,5,6 and 7. I expect there are other aspects that drove the decision, and Krste may be at liberty to share them. Unfortunately, v0.8 with SLEN=8 doesn't solve any of these problems, and exasperates item B to make it as problematic as in v0.9 SLEN<VLEN. On 2020-05-27 1:13 p.m., David Horner
via lists.riscv.org wrote:
Can you explain the kind of code optimizations that are not as easy with V0.8 when SLEN is determined at run time? LMUL=1 was always easy. It is the LMUL!=1 that for v0.8 is harder. I tend to agree with you. But there can be many implementations with SLEN<VLEN but by design no implementations with only LMUL=1. Restricting LMUL <= 1 is still a major blow to the architecture. costs may indeed be similar but a solution is not yet resolved. Further, the burden to manage in-register mismatch to in-memory is a big risk for acceptance in either v0.8 or v0.9. I hope ( and believe) we can find a solution for v0.9 that will not fragment software ecosystem. I greatly appreciate your contribution, but for the reasons I mentioned at top, I think v0.9 is a better base to try to move forward. I may have missed something very obvious, and perhaps it resides in my missing the kind of code optimizations you allude to above. Thanks again.
|
|