Re: Vector Task Group minutes 2020/5/15 - precise layout not matter
On 2020-05-27 7:58 p.m., Guy Lemieux wrote:
I believe this can be weakened to required:On Wed, May 27, 2020 at 10:24 AM Guy Lemieux <glemieux@...> wrote:The precise data layout pattern does not matter. What matters is that a single distribution pattern is agreed upon to avoid fragmenting the software ecosystem.
select-able distribution patterns that are sufficiently compatible that they
avoid fragmenting the software ecosystem.
On 2020-05-27 10:29 a.m., Guy Lemieux wrote:
> As a follow-up, the main goal of LMUL>1 is to get better storage efficiency out of the register file, allowing for slightly higher compute unit utilization.
> The memory system should not require LMUL>1 to get better bandwidth utilization. An advanced memory system can fuse back-to-back loads (or stores) to get improved bandwidth. Some memory systems may break up vector memory transfers into fixed-size quanta (eg, cache lines) anyways.
> Restricting LMUL=1 for loads/stores therefore primarily impacts instruction issue bandwidth and executable size. These shouldn’t be highe drawbacks.
> On Wed, May 27, 2020 at 6:56 AM Guy Lemieux via lists.riscv.org <glemieux=vectorblox.com@...> wrote:
> I support this scheme, but I would further add a restriction on loads/stores to only support LMUL=1 (no register groups). Instead, any data stored in a registe group with LMUL!=1 must first be “cast” into registers with LMUL=1. To do this, special cast instructions would be required; likely this cast can be done in-place (same source and dest registers).
I am confused, as it appears you have backed off on this with your response to Nick on 2020-05-27, 7:58 p.m.
There the load can target multiple registers in the group.
The most puzzling concern I have is what will these cast instructions do?
Presumably apply some mapping from in-memory to an internal in-register format?
But which one? Does the precise layout not really matter?
The different layouts considered each have advantages and disadvantages.
From a simplicity and software fragmentation perspective the v0.9 SLEN=VLEN is perfect.
SLEN can be completely ignored as it matches VLEN. in-register format matches in-memory, even for LMUL>1 .
Sweet. But as VLEN increases, the performance impacts increase non-linearly and substantially for an important target group.
So if high-performance/large-VLEN implementations are going to happen some accommodation must occur.
> The key advantage of this new restriction is to remove all data shuffling from the interface between external memory and the register file — just transfer bytes in memory order. This vastly simplifies “basic” implementations by keeping data shuffling exclusively on the ALU side of the register file.
I think this is where you envision cast instructions providing the solution, but a restricted set of cast instructions that
- only operate on LMUL=1 structures
- are capable (if not restricted) to occur on the ALU rather than memory side.
I don't believe either of these restrictions are necessary for v0.9 design when SLEN=VLEN, as LMUL>1 structure is identical to LMUL=1
and micro architectures can fuse a following cast (if as you suggest they are able to be done in-place) to operate on memory side if it so chooses.
With my additional restriction, the load/store side of an implementation is greatly simplified, allowing for simple implementations. The main drawback of my restriction is how to avoid the overhead of the cast instruction in an aggressive implementation? The cast instruction must rearrange data to translate between LMUL!=1 and LMUL=1 data layouts; my proposal requires these casts to be executed between any load/stores (which always assume LMUL=1) and compute instructions which use LMUL!=1.
I expect you envision these operators to be elective, only included where needed so performance is not affected adversely.
If so, I believe the fragmentation concern is reintroduced.
If not, then unique characteristics of the cast instructions are required as you allude, and I suspect target format will be important for that to happen.
I think this can sometimes be done for "free" by carefully planning your compute instructions. For example, a series of vld instructions with LMUL=1 followed by a cast to LMUL>1 to the same register group destination can be macro-op fused.
So far there has been no cast magic-bullet.
Although I agree deferring casting (however it is done) to the ultimate operation is of considerable potential benefit all of this is quite heavy handed.
As it is not required for the simple v0.9 SLEN=VLEN model industry adoption is certainly not assured.
I think you have done a good job at illustrating the complexity of optimizing for mixed SEW operators and especially the casting approach to "fix it".I don't think the same thing can be done for vst instructions, unless it macro-op fuses a longer sequence consisting of cast / vst / clear register group (or some other operation that overwrites the cast destination, indicating the cast is superfluous and only used by the stores).
Therefore, I am trending away from a cast instruction solution.
I have not delivered by promised comparison of alternatives.
The considerations you have provided have helped me to come closer to a formulation of it.
To me, CLSTR looks most promising.
But only as these alternatives are championed, discussed and analyzed do we get a better idea of the nature of this mixed SEW beast.
Guy On Wed, May 27, 2020 at 10:13 AM David Horner <ds2horner@...> wrote:This is v0.8 with SLEN=8.
On 2020-05-27 7:58 p.m., Guy Lemieux wrote:
It would appear your premise that applying cast only to LMUL=1 is also gone.Nick, thanks for that code snippet, it's really insightful. I have a few comments: a) this is for LMUL=8, the worst-case (most code bloat) b) this would be automatically generated by a compiler, so visuals are not meaningful, though code storage may be an issue c) the repetition of vsetvli and sub instructions is not needed; programmer may assume that all vector registers are equal in size d) the vsetvli / add / sub instructions have minimal runtime due to behaving like scalar operations e) the repetition 8 times (or whatever LMUL you want) vor vle/vse and the add can be mimicked by a change in the ISA to handle a set of registers in a register group automatically, eg: instead of this 8 times for v0 to v7: vle8.v v0, (a2) add a2, a2, t2 we can allow vle to operate on registers groups, but done one register at a time in sequence, by doing this just once: vle8.v v0, (a2), m8 // does 8 independent loads, loading to v0 from address a2, v1 from address a2+vl, v2 from address a2+2*vl, etc That way ALL of the code bloat is now gone.
I may be completely wrong, so I would greatly appreciate your expounding on this.
How does this apply for v0.9 SLEN= or SLEN< VLEN ?
Does it only apply to v0.8?