Re: Issue #365 vsetvl{i} x0, x0 instruction forms


On Wed, Jul 22, 2020 at 11:19 PM David Horner <ds2horner@...> wrote:

Point of agreement #1 - x0,x0 variant should not change vl.
I believe we are also in agreement on

#2 - if vl would change because of a SEW/LMUL change vill should be set.

Outstanding questions:

#3) If vill is set should vl remain unchanged? (I vote for yes).

Other vsetvl[i] instructions that set vill=1 also set vl=0.  Deviating from that course would be needlessly painful and not especially beneficial.

#4) Should potential change of vl  set vill? Currently that condition is
equivalent to a SEW/LMUL ratio change.

     4a) in all cases? even if vl is zero? even if vl is 1? (this rule
has fringe cases).
     4b) what do we do when another vtype parameter is added that also
would potentially change vl?
             What is the likely formulation of such an algorithm?
             In general something comparable to a simple ratio would be
             I believe this SEW/LMUL formulation is not future proof.

#5) Why not defined the x0,x0 variant that doesn't change vl as
succeeding if vl doesn't change?
        Only setting vill if the resultant new-vl does not match the
previous vl.
           (Point #3 is still relevant, but there are no longer any
corner cases as in 4a and 4b).

Krste below expresses some reasons that lean towards SEW/LMUL invariance
rather than vl invariance be the determinant for setting vill.

Specifically, comparing vl to the new-vl requires reading the old vl and
that is potentially expensive, why not avoid the read of vl altogether?

One approach is based on #4.
Instead read previous(current?) vlmul and vsew, calculate ratio, compare
with new ratio and set vill if different.
We can avoid vlmul/vsew read by retaining the current SEW/LMUL values
(or ratio)
(can be stored locally, only 6 bits for vsew and vlmul)
and compare that to the new SEW/LMUL ratio.
quite efficient.

What of advocating for #5 - what is the overhead here?

A simplistic approach can read vl and push it through the existing
   except when the calculated MAXVL exceeds the calculated vl set vill
otherwise leave the current vl alone (or overwrite it with itself,
For simple designs there is a simple implementation that can further be
optimized by setting vill  on a slow path.

Alternatively we can use the SEW/LMUL optimization approach:
We can store the vl info locally.
For standard V minimum (log2(VLEN*8); log2(128) + log2(8) = 7 + 3 = 10
with an additional bit per doubling of VLEN) .
We compare the calculated vl with that.
This compares favourably to #4 optimization.

But we can do better than that.
We only need compare calculated MAXVL
(comparable computational cost to SEW/LMUL ratio)
which is normally done anyway (so can leverage existing circuitry)
and compare that to locally stored vl information.
MAXVL varies from 1 (in the worst case) to VLEN*8.
As MAXVL is always a power of 2 the number of bits to store is
log2(log2(VLEN*8)) or 4 bits for up to VLEN=2K.
Thus 4 bits for the locally saved vl information which is the minimal
MAXVL for current vl.
(V minimum is ELEN=64 and VLEN=128 which is among the case for which 3
bits suffice)
I'm not a circuit guru, but "MINVL" from vl is inexpensive  to calculate,
  especially as it also does not need to be on the critical path for
non- x0,x0 variants
   that are the only ones that need store vl info locally.

It would appear that #5 is a net win for circuitry and a better
formulation of vl unchanged requirements.

It's not just about the cost of the comparators; it's also about avoiding the RAW hazard on the previous value of VL.

The RAW hazard on the previous value of vtype in Krste's proposal is less of a concern, since the previous vtype will usually have been supplied by an immediate operand.  Optimizing for this case, it's straightforward for renamed implementations to maintain a speculative copy of the vtype register in the decode stage.  The same doesn't work for vl, which in most cases was most recently sourced from a register operand.

#5 now has my vote.

I provide further analysis within the replies below.

On 2020-07-22 8:21 p.m., Bill Huffman wrote:
> I agree with Krste's support for Guy's proposal here.
thanks for the response.
>   Loops with
> multiple element widths are likely to have more non-vl-changing
> instructions than vl-changing instructions.  Knowing this from the
> instruction without having to track the sequence involved is likely to
> pay benefits in implementation.
an Important and valid point that I also support.
>         Bill
> On 7/22/20 4:35 PM, Krste Asanovic wrote:
>> The main issue is whether the current PoR has any useful purpose when
>> vl changes.
I disagree with characterizing this as the main issue.
I agree that it is an important consideration.

The pivotal question as I see it is, what action the instruction should
take when vl would change.
PoR says change it, as any other vsetvl variant would.
>>    I don't subscribe to "field of dreams" approach.  I tried
>> to find some scenarios hoping there would be some useful cases, but
>> struggled to come up with anything substantial with current PoR.
>> There are certainly some possible alternate vl-changing behaviors that
>> could be useful, but those would be a different instruction.  Unless
>> there is a clear use, the additional vl-modifying behavior in PoR
>> cannot really be stated as a positive but only a curiosity.
Until such a use is discovered.

I don't disagree that it is an important consideration, only that it is
If explicitly disallowing the "apparently useless" behaviour itself
causes substantial cost, we can live with a meaningless instruction
RVI frequently allows formulations that lacking a clear and compelling
use case because
  it is an artifact of the general useful operation, that to exclude it
would increase overhead (instruction decode , etc.)
e.g. bne rs1,rs2,-2, branch to within the same instruction which,
depending upon rs1/rs2 values can be a C.BNEZ infinite loop if a
specific register (x8 through x15) is non-zero.
The same could hold true here. In my opinion, this is substantially why
(this was main part of my reasoning),  the current PoR was adopted.

Bill expresses succinctly:

> Loops with multiple element widths are likely to have more non-vl-changing
> instructions than vl-changing instructions.

It is precisely due to the nature of its expected (lack of) use,
that in other situations we would disregard the low use and esoteric
case as harmless.
Consider the reluctance to reserve RVV simm5/rs1=0 formulations that
match an existing simpler instruction.
However, in this case I agree that the formulation x0,x0 is valuable to
use effectively, solely because vsetvli is so important.
Even as a secondary consideration, lack of usefulness is disturbing for
a dominant feature.
>> On the negative side, a microarchitecture will have to assume
>> vl will be read and written by this instruction, even if it almost
>> never changes.
That is PoR and I believe there is now general agreement (3 to zero so
far) that changing vl is not the desired behaviour.
So, Point of agreement #1 - x0,x0 variant  should not change vl.

>>   Even for simple machines, this will probably cause
>> some extra flops to be clocked.
Let's put this into perspective - all other vsetvl variants write vl,
that is the primary purpose, it is explicitly in the name.
We are proposing an optimization for what we anticipate (reasonably) to
be a common used case, as Bill stated.
The potential is to save some flops by avoiding the write (and delays
caused by its cascade/flow/synch effects) .

>>   For machines with renaming, it can
>> require a new physical vl is allocated early in machine even if vl
>> rarely changes.  There might be microarch techniques to recycle vl
>> regs more quickly once known not to change, but would be much simpler
>> not to have to deal with this.
Agreed.  Another check for the Point of agreement #1
>> The (certification)verification cost alone is a big
>> negative for a feature that could be rarely/never used.
Agreed.  Another check for the Point of agreement #1
>> These instructions will likely be common in loops dealing with
>> multiple element widths (a common loop will have only one vsetvli that
>> changes vl and potentially many that manipulate SEW/LMUL), and so
>> optimizing their implementation is important.  Having a hardware
>> instruction that is "change vtype but not vl, or error" is clearly
>> useful I think.
Agreed.  The above argument restated as for the Point of agreement #1
>> The dynamic debug aspect, I agree is relatively minor, but given the
>> prevalence of "change vtype but not vl" instructions, it is only a
>> positive that bugs are caught even if not always with clear
>> determination of problematic instruction (though I guess it will very
>> rare that the bug will be difficult to find even if only trap on use).
Expressed as not as persuasive, but at least a fraction check for the
Point of agreement #1.

>> Even though I view dynamic debug as a minor benefit, I think even that
>> minor concrete benefit outweighs the unknown abstract benefit of
>> "change vl" behavior, unless there are some great use cases for the
>> existing PoR scheme that we've missed.
I agree.

But we cross a line to believe the objective is "that bugs are caught".
What bug is it that we believe we can design hardware to catch?

As a database analyst, I told the application developers with whom I worked
    that their compiled and running program was not "wrong".
It was doing just fine exactly what they directed it to do.
It was the perfect program for a problem other than the one they wanted
to solve.

Ditto for bugs. Behaviour that one programer wants to avoid another may

We cannot solve bugs in hardware. CICS attempts to do so are infamous.
All we can do is provide operations that do exactly as they are
stipulated, ideally with no corner cases, with a simple conceptual

Enforcing a perceived good software/"expected use" policy is rarely
directly achievable or desirable.
Keep  SEW/LMUL ratio invariant is a policy/"expected use case".
I contend there are deliberate exceptions to this policy, or, in the
alternative,  at minimum the policy has a limited domain.
If there are exceptions or the domain is limited it is not a good
characteristic to enforce, even as a special instruction formulation.

Rather, a better characteristic to enforce is the vl in-variance in a
special formulation.
It follows from the instruction formulation in which no explicit AVL is
supplied (X0).
It is the underlying characteristic in the checks above and below.
>> But again, the implementation saving from not having to worry about
>> dynamic vl changes for these instructions to me far outweighs the
>> other issues.

>> Krste
>>>>>>> On Wed, 22 Jul 2020 09:02:03 -0400, "David Horner" <ds2horner@...> said:
>> | I wholeheartedly agree with resolving on the mailing list.
>> | This should be the rule not exception.
>> | On 2020-07-21 11:58 p.m., Krste Asanovic wrote:
>> || I want to bring this to group's attention as I think I've convinced
>> || myself that Guy's suggestion is the correct path to follow, i.e.,
>> ||
>> || vsetvli x0, x0, imm
>> ||
>> || will raise vill if the new SEW'/LMUL' ratio is not the same as the old
>> || SEW/LMUL implying vl might change.  Similarly for vsetvl version.
>> | My considerations for allowing vl to change were
>> | a) having a compelling reason to change PoR.
>> |       vsetvl[i]is extremely important to RVV success.
>> |       It deserves deep scrutiny.
>> |       Challenging each and every change,
>> |      as well as proposing any plausible enhancement
>> |      are equally important to get this feature,
>> |      more so than others, right(tm).
>> | b) tracking assemblers and compilers could present warnings.
>> |       Part of my support was my bias towards encouraging vl tracking
>> | support.
>> |       Tracking vl in code has substantial benefits beyond a replacement
>> | for this x0,x0 behaviour.
>> |       I believe RVV success and adoption will be substantially hampered
>> | without it.I believe that ultimately IOT machines will benefit from RVV
>> |           if we continue to emphasize  simplicity  the design.
>> |       It is however specific to RVV.
>> |       So marginal hardware support that appears to mitigate a need for
>> | vl tracking gets a check in the negative column.
>> | c) A perceived simplicity of PoR for minimal designs.
>> |       I am biased toward ensuring simple machines can efficiently
>> | support for RVV.
>> |       Initial uptake is likely to be in the application/HPC domain, but
>> |       I believe that ultimately IOT machines will benefit from RVV
>> |           if we continue to emphasize  simplicity  the design.
>> | d) Setting vill is excellent as a means to avoid trap behaviour.
>> |       however it requires explicit check after vtype setting ops.
>> |       Opportunistic approaches will rely on the subsequent fault.
>> |       This situation is theoretically impossible to statically backward
>> | trace.
>> |       A given RVV data instruction could be branched to from anywhere,
>> |           conditional execution could have executed any vsetvl instruction
>> |           with virtually any rs1 value.
>> |       This biases me away from setting vill, in the x0,x0 case setting
>> | vl avoid vill set.
>> |       However, in practice branching into a loop will be errant
>> | behaviour and
>> |          RVV data instructions will be paired with a vsetvli instruction.
>> |       My paranoia causes me this too heavily at times. (.... reweighing
>> | risks)
>> | e)  in the x0,x0 formulation, vsetvli cannot determine from immediate
>> | parsing alone vill state.
>> |       we have strived to ensure the immediate format will meet virtually
>> | all in loop use cases.
>> |       Ideally, vsetvl is reserved for context switch (and custom)
>> | situations.
>> |       I considered x0,x0 a punt to vsetvl (potentially slow) path to
>> | allow for the immediate form optimization
>> |       (i.e. no vill setting considerations after parse) .
>> |       However, reweighing the benefit of retaining vl and requiring a
>> | late setting of vill.
>> |       Given vill setting can always be performed on a slow path
>> |       with little real impact to normal code ....  reweighing risks.
>> || Apart from the debugging motivation that Guy presented,
>> | see my point d.
>> || I would add
>> || that this definition effectively removes any read or write of vl from
>> || the instruction, possibly removing hazards and simplifying dependency
>> || tracking and relieving an OoO machine from providing a new rename
>> || register for vl (might still need for vtype).
>> | this does not talk to my point c.
>> ||
>> || I could not find any non-esoteric use for the vl-trimming behavior of
>> || the current PoR for larger SEW/LMUL,
>> | I've found coders and compiler writers collectively more ingenious than I,
>> |   not only more eyes in free software but a spectrum of inner-eye
>> | perceptions and mindsets.
>> | So although relevant to the discussion, in the negative it is not
>> | compelling as a benefit.
>> || so given these benefits I move we
>> || adopt the "sets vill for non-iso SEW/LMUL" meaning.
>> || The circuit has
>> || to calculate (vsew_new-vlmul_new)!=(vsew_old-vlmul_old) to determine
>> || vill, but now never needs to read
>> | I disagree with this behaviour. increasing VLMAX does not invalidate
>> | current vl and should not" raise an exception" even indirectly.
>> | If we are needing a warning , let assembler/compilers do it note b above.
>> | I also disagree that we always set vill if VLMAX reduced but vl is still
>> | < newVLMAX.
>> | Only if the ratio changes do we need to read vl, so in the frequent case
>> | I agree vl read can be avoided.
>> | To avoid a vl read
>> || or write vl.
>> | My principle is hardware should not attempt to debug or correct software.
>> | Although hardware developers may believe a specific
>> | validation/verification facility will be useful to programmers (SEW/LMUL
>> | in-variance checking)
>> | such "policy" should not be imposed but rather a means to electively
>> | support such a policy be provided.
>> | Setting vill when original vl cannot be maintained is valid, enforcing
>> | an invariance policy is not.
>> || ...
>> || As a general optimization guide, software should endeavor to use this
>> || form instead of passing in AVL to avoid the vl update when not
>> || necessary.
>> | I agree.
>> | This is what was envisioned by providing x0,x0.
>> | Further, this encoding implies an intent which makes code clearer.
>> | Someone doing tricks needs to add a comment.
>> | I'm leaning to accepting the proposal as I amended.
>> ||
>> || I hope this is one we can resolve on the mailing list to save time in
>> || the next meeting.
>> | as do I.
>> ||
>> || Krste
>> ||
>> ||
>> ||
>> ||
>> |

Join { to automatically receive all group messages.