Re: Issue categorization - #460

David Horner

On 2020-07-01 4:33 a.m., krste@... wrote:

Forgive top posting, but the email was long and I want to bring key
points forward for others.
No problem, happy you did.
#458/460 propose restricting allowed register numbers in a vsetvli to
give more bits to future vsetvli uses.  I just cannot see why we do
this for several reasons:

The very special nature and functionality of vsetvli justifies considering special encoding.
All the following are valid considerations.
None of which have been quantified.
1) the impact on ABI and compilers and assembly programmers by restricting registers in this way
    I see no impact to the ABI itself.
        The ABI puts usage pressure and constraints on some registers,
             however each encoding allows for 4 registers
             of which at least 2 do not uniquely participate in the ABI.
               (e.g. are user saved, and many are defined as temp).

     An assembler macro could readily generate candidates and choose a default.
          That therefore is not overly onerous for assembler programmers.
          Tracking the used register is more challenging.
          But a symbolic register from a reserved set of temps would be one way of diminishing the intellectual challenge. 
          Assembler programmers mastered 8086 and successors, so this is trivial in comparison.

    As for compilers, the cost is much less than the perverse manipulations needed to continue to support the irregular register behaviours of 8086, 286 and 386.
          The application programmers using the compilers will be isolated from the impacts regarding code generation.

    Application debugging is not impacted as the decision to allocate the registers used is made statically at compile time.

2) we are far from using all encoding bits in vsetvli yet
  This is the most challenging to quantify. How far is far enough? What latent demand have we missed or misjudged?
3) there are many other cleaner ways to increase configuration space when we need it
 I agree there are many other ways.
 They should also be quantified and compared to determine the appropriate approach.
 Specifically you champion one below, that with tweaks looks very promising.
4) momentum, people have been working with current scheme and this is a major shift
A very important factor, I agree.
Many technically superior approaches have been abandoned for this reason.
I want to ensure we make an informed decision on the best long term approaches, whether to abandon for short term advantage or not.
The only two bits under debate are ta and ma.

I view ta/ma as a real and immediate need versus potential future
hypothetical needs. 
I concur and have been an advocate of their inclusion as a feature and in vestvli.
 In some cases, these will be varied inside inner
loop, e.g.,when stripmining through reductions with a vector of
partial sums, or when performing various convolution operations that
edit vector registers.  In other cases, they will be relatively
static, when agnostic is always fine.
I also agree.
I also believe the frequency of use for work loads in specific domains justifies inclusion in an immediate format.
However, I cannot quantify this, nor assess the tipping point at which inclusion into vtype vs vcsr applies.

I anticipate the vma/vta are not directly coupled to vsew/vlmul.
Indeed, I anticipate that true need for tail unchanged is rare.
At the other end of the spectrum,  the overhead to toggle tail.unchanged/agnostic for HPC performance gain  is sufficiently low that even including the recalculation of vl it is warranted.

There is continuum through which there are tradeoffs in changing vsew/vlmul and leaving vta/vma unchanged.
There appears to be perfomance benefits to switching vta and vma as immediates in an instruction that does not update vl. 
Immediates provide the microarch benefit of instruction decode look ahead (over in-register value vsetvl). 
These, too, needs consideration.

For me the most compelling reason for vma/vta inclusion in V1.0 is early visablity, acknowledgement and acceptance of the feature.
For that same reason I leaned towards including them in the immediate of vsetvli.
Now that the feature is already noted and accepted within the eco system, I lean towards reserving vsetvli immediates for those vtype elements that change vl.
For the specific use case given in #460, I view non-power-of-2 LMUL as
possibly interesting but not in same importance category as supporting
renaming with mask agnostic.  For example, how much is really gained
versus simply rounding down to next power-of-2 LMUL?  The largest gain
is from 4 to 7, but at LMUL=4, any reasonable hardware should already
be very efficient and there will then be more registers available for
live values.
A tradeoff and thus no hard numbers for answers.
In other word YMMV. It depends upon the specific algorithm.
My expectation is that most algorithms
( even after maturity, as we can expect initial use to be simplistic)
will be locally simple if not trivial; which will lend themselves to very long vl optimization.

My guess is LMUL of 6 is likely an important sweet spot for complex processes.
 There are also issues about widening/narrowing, etc.,
when EMUL is not a power-of-two, implying we need EMUL 2.5, 3.5 for
example, or don't support those operations for those LMULs so they're
2nd-class citizens.
Not really, although LMUL of 6 pairs with 3, and provides optimal correlation,
using the technique on AVL below,
LMUL of 5 can partially use LMUL of 3 with an overall space efficiency of 7/8ths, and
LMUL of 7 use LMUL 4 with efficiency of 10/11ths.

Even as second class citizens they can make significant contributions to the eco-system.
  Also, most of the same effect can be had with an
additional Bitmanip "minu" instruction on AVL that's only needed once
per inner loop with some setup outside loop.
Agreed, until you encounter the corner cases,
especially changing from effective LMUL of 6 to 3,
again, what I'd consider the sweet spot for the feature.

How useful LMUL=3,5,6 and 7 will be is hard to assess.
As it can be emulated, ( in many use cases conveniently and
effectively) in the current LMUL = 2**n  model, it is therefor not,
on its own, a compelling reason to adopt rd/vs1 encoding.

However, the issue is whether collectively the reasons justify preemptive inclusion
 or at minimum, positioning.
To expand the set of custom types, where we usually don't need to
encode vl setting at same time but rely on existing vl setting (as
with vsetvli x0, x0, imm encoding), we could add a new vsetvlx
encoding based on vsetvl but with bit 30 set.
31 30         25 24      20 19      15 14   12 11      7 6     0
0 |        zimm[10:0]      |    rs1   | 1 1 1 |    rd   |1010111| vsetvli
1 |   000000    |   rs2    |    rs1   | 1 1 1 |    rd   |1010111| vsetvl
1 |   1xxxxx    |  xxxxx   |  xxxxx   | 1 1 1 |  xxxxx  |1010111| vsetvlx

This gives 20 bits to change vtype with new vl setting based on
existing vl (or where there may be some fixed known relationships
based on the special data types or shape, where software doesn't need
to see new vl value returned in rd but knows how to count based on other vl
values).  Note that existing immediate fields in vsetvli can live in
same bit positions in this instruction.
As per usual, you have presented a masterful solution.
This is essentially the non-transient variation of #423

I'd suggest that
    1) vsettyi (set vtype imm) as its name
    2) only 15 bits be provided with 29 through 25 encoded as zeros, the others values reserved
            (see #456 ytransient solution)
            (we can transparently expand through bits 25 to 29 if that proves to be the best use of the bits)
    3) vta and vma be moved to this instruction rather than remain in vsetvli.
        (I suspect a variant of ediv will modify vl and thus be most appropriate for vsetvli.)

With this in place, rd/rs1 encoding of immediates is much less compelling (almost dead).



On Wed, 1 Jul 2020 03:02:18 -0400, DSHORNER <ds2horner@...> said:

Join to automatically receive all group messages.