Re: Issue categorization - #460
David Horner
On 2020-07-01 4:33 a.m.,
krste@... wrote:
No problem, happy you did.
The very special nature and functionality of vsetvli justifies considering special encoding. All the following are valid considerations. None of which have been quantified. I see no impact to the ABI itself. The ABI puts usage pressure and constraints on some registers, however each encoding allows for 4 registers of which at least 2 do not uniquely participate in the ABI. (e.g. are user saved, and many are defined as temp). An assembler macro could readily generate candidates and choose a default. That therefore is not overly onerous for assembler programmers. Tracking the used register is more challenging. But a symbolic register from a reserved set of temps would be one way of diminishing the intellectual challenge. Assembler programmers mastered 8086 and successors, so this is trivial in comparison. As for compilers, the cost is much less than the perverse manipulations needed to continue to support the irregular register behaviours of 8086, 286 and 386. The application programmers using the compilers will be isolated from the impacts regarding code generation. Application debugging is not impacted as the decision to allocate the registers used is made statically at compile time. This is the most challenging to quantify. How far is far enough? What latent demand have we missed or misjudged? I agree there are many other ways. They should also be quantified and compared to determine the appropriate approach. Specifically you champion one below, that with tweaks looks very promising. A very important factor, I agree. Many technically superior approaches have been abandoned for this reason. I want to ensure we make an informed decision on the best long term approaches, whether to abandon for short term advantage or not. I concur and have been an advocate of their inclusion as a feature and in vestvli. I also agree. I also believe the frequency of use for work loads in specific domains justifies inclusion in an immediate format. However, I cannot quantify this, nor assess the tipping point at which inclusion into vtype vs vcsr applies. I anticipate the vma/vta are not directly coupled to vsew/vlmul. Indeed, I anticipate that true need for tail unchanged is rare. At the other end of the spectrum, the overhead to toggle tail.unchanged/agnostic for HPC performance gain is sufficiently low that even including the recalculation of vl it is warranted. There is continuum through which there are tradeoffs in changing vsew/vlmul and leaving vta/vma unchanged. There appears to be perfomance benefits to switching vta and vma as immediates in an instruction that does not update vl. Immediates provide the microarch benefit of instruction decode look ahead (over in-register value vsetvl). These, too, needs consideration. For me the most compelling reason for vma/vta inclusion in V1.0 is early visablity, acknowledgement and acceptance of the feature. For that same reason I leaned towards including them in the immediate of vsetvli. Now that the feature is already noted and accepted within the eco system, I lean towards reserving vsetvli immediates for those vtype elements that change vl. A tradeoff and thus no hard numbers for answers. In other word YMMV. It depends upon the specific algorithm. My expectation is that most algorithms ( even after maturity, as we can expect initial use to be simplistic) will be locally simple if not trivial; which will lend themselves to very long vl optimization. My guess is LMUL of 6 is likely an important sweet spot for complex processes. Not really, although LMUL of 6 pairs with 3, and provides optimal correlation, using the technique on AVL below, LMUL of 5 can partially use LMUL of 3 with an overall space efficiency of 7/8ths, and LMUL of 7 use LMUL 4 with efficiency of 10/11ths. Even as second class citizens they can make significant contributions to the eco-system. Agreed, until you encounter the corner cases, especially changing from effective LMUL of 6 to 3, again, what I'd consider the sweet spot for the feature. How useful LMUL=3,5,6 and 7 will be is hard to assess. As it can be emulated, ( in many use cases conveniently and effectively) in the current LMUL = 2**n model, it is therefor not, on its own, a compelling reason to adopt rd/vs1 encoding. However, the issue is whether collectively the reasons justify preemptive inclusion or at minimum, positioning. As per usual, you have presented a masterful solution. This is essentially the non-transient variation of #423 I'd suggest that 1) vsettyi (set vtype imm) as its name 2) only 15 bits be provided with 29 through 25 encoded as zeros, the others values reserved (see #456 ytransient solution) (we can transparently expand through bits 25 to 29 if that proves to be the best use of the bits) 3) vta and vma be moved to this instruction rather than remain in vsetvli. (I suspect a variant of ediv will modify vl and thus be most appropriate for vsetvli.) With this in place, rd/rs1 encoding of immediates is much less compelling (almost dead).
|
|