Re: Issue #365 vsetvl{i} x0, x0 instruction forms

David Horner


Point of agreement #1 - x0,x0 variant should not change vl.
I believe we are also in agreement on

#2 - if vl would change because of a SEW/LMUL change vill should be set.

Outstanding questions:

#3) If vill is set should vl remain unchanged? (I vote for yes).

#4) Should potential change of vl  set vill? Currently that condition is equivalent to a SEW/LMUL ratio change.

    4a) in all cases? even if vl is zero? even if vl is 1? (this rule has fringe cases).
    4b) what do we do when another vtype parameter is added that also would potentially change vl?
            What is the likely formulation of such an algorithm?
            In general something comparable to a simple ratio would be inadequate.
            I believe this SEW/LMUL formulation is not future proof.

#5) Why not defined the x0,x0 variant that doesn't change vl as succeeding if vl doesn't change?
       Only setting vill if the resultant new-vl does not match the previous vl.
          (Point #3 is still relevant, but there are no longer any corner cases as in 4a and 4b).

Krste below expresses some reasons that lean towards SEW/LMUL invariance rather than vl invariance be the determinant for setting vill.

Specifically, comparing vl to the new-vl requires reading the old vl and that is potentially expensive, why not avoid the read of vl altogether?

One approach is based on #4.
Instead read previous(current?) vlmul and vsew, calculate ratio, compare with new ratio and set vill if different.
We can avoid vlmul/vsew read by retaining the current SEW/LMUL values (or ratio)
(can be stored locally, only 6 bits for vsew and vlmul)
and compare that to the new SEW/LMUL ratio.
quite efficient.

What of advocating for #5 - what is the overhead here?

A simplistic approach can read vl and push it through the existing circuitry,
  except when the calculated MAXVL exceeds the calculated vl set vill
otherwise leave the current vl alone (or overwrite it with itself, whichever).
For simple designs there is a simple implementation that can further be optimized by setting vill  on a slow path.

Alternatively we can use the SEW/LMUL optimization approach:
We can store the vl info locally.
For standard V minimum (log2(VLEN*8); log2(128) + log2(8) = 7 + 3 = 10 bits,
with an additional bit per doubling of VLEN) .
We compare the calculated vl with that.
This compares favourably to #4 optimization.

But we can do better than that.
We only need compare calculated MAXVL
(comparable computational cost to SEW/LMUL ratio)
which is normally done anyway (so can leverage existing circuitry)
and compare that to locally stored vl information.
MAXVL varies from 1 (in the worst case) to VLEN*8.
As MAXVL is always a power of 2 the number of bits to store is log2(log2(VLEN*8)) or 4 bits for up to VLEN=2K.
Thus 4 bits for the locally saved vl information which is the minimal MAXVL for current vl.
(V minimum is ELEN=64 and VLEN=128 which is among the case for which 3 bits suffice)
I'm not a circuit guru, but "MINVL" from vl is inexpensive  to calculate,
 especially as it also does not need to be on the critical path for non- x0,x0 variants
  that are the only ones that need store vl info locally.

It would appear that #5 is a net win for circuitry and a better formulation of vl unchanged requirements.

#5 now has my vote.

I provide further analysis within the replies below.

On 2020-07-22 8:21 p.m., Bill Huffman wrote:
I agree with Krste's support for Guy's proposal here.
thanks for the response.
Loops with
multiple element widths are likely to have more non-vl-changing
instructions than vl-changing instructions. Knowing this from the
instruction without having to track the sequence involved is likely to
pay benefits in implementation.
an Important and valid point that I also support.

On 7/22/20 4:35 PM, Krste Asanovic wrote:

The main issue is whether the current PoR has any useful purpose when
vl changes.
I disagree with characterizing this as the main issue.
I agree that it is an important consideration.

The pivotal question as I see it is, what action the instruction should take when vl would change.
PoR says change it, as any other vsetvl variant would.
I don't subscribe to "field of dreams" approach. I tried
to find some scenarios hoping there would be some useful cases, but
struggled to come up with anything substantial with current PoR.
There are certainly some possible alternate vl-changing behaviors that
could be useful, but those would be a different instruction. Unless
there is a clear use, the additional vl-modifying behavior in PoR
cannot really be stated as a positive but only a curiosity.
Until such a use is discovered.

I don't disagree that it is an important consideration, only that it is secondary.
If explicitly disallowing the "apparently useless" behaviour itself causes substantial cost, we can live with a meaningless instruction formulation.
RVI frequently allows formulations that lacking a clear and compelling use case because
 it is an artifact of the general useful operation, that to exclude it would increase overhead (instruction decode , etc.)
e.g. bne rs1,rs2,-2, branch to within the same instruction which, depending upon rs1/rs2 values can be a C.BNEZ infinite loop if a specific register (x8 through x15) is non-zero.
The same could hold true here. In my opinion, this is substantially why (this was main part of my reasoning),  the current PoR was adopted.

Bill expresses succinctly:

Loops with multiple element widths are likely to have more non-vl-changing
instructions than vl-changing instructions.
It is precisely due to the nature of its expected (lack of) use,
that in other situations we would disregard the low use and esoteric case as harmless.
Consider the reluctance to reserve RVV simm5/rs1=0 formulations that match an existing simpler instruction.
However, in this case I agree that the formulation x0,x0 is valuable to use effectively, solely because vsetvli is so important.
Even as a secondary consideration, lack of usefulness is disturbing for a dominant feature.
On the negative side, a microarchitecture will have to assume
vl will be read and written by this instruction, even if it almost
never changes.
That is PoR and I believe there is now general agreement (3 to zero so far) that changing vl is not the desired behaviour.
So, Point of agreement #1 - x0,x0 variant  should not change vl.

Even for simple machines, this will probably cause
some extra flops to be clocked.
Let's put this into perspective - all other vsetvl variants write vl, that is the primary purpose, it is explicitly in the name.
We are proposing an optimization for what we anticipate (reasonably) to be a common used case, as Bill stated.
The potential is to save some flops by avoiding the write (and delays caused by its cascade/flow/synch effects) .

For machines with renaming, it can
require a new physical vl is allocated early in machine even if vl
rarely changes. There might be microarch techniques to recycle vl
regs more quickly once known not to change, but would be much simpler
not to have to deal with this.
Agreed.  Another check for the Point of agreement #1
The (certification)verification cost alone is a big
negative for a feature that could be rarely/never used.
Agreed.  Another check for the Point of agreement #1

These instructions will likely be common in loops dealing with
multiple element widths (a common loop will have only one vsetvli that
changes vl and potentially many that manipulate SEW/LMUL), and so
optimizing their implementation is important. Having a hardware
instruction that is "change vtype but not vl, or error" is clearly
useful I think.
Agreed.  The above argument restated as for the Point of agreement #1
The dynamic debug aspect, I agree is relatively minor, but given the
prevalence of "change vtype but not vl" instructions, it is only a
positive that bugs are caught even if not always with clear
determination of problematic instruction (though I guess it will very
rare that the bug will be difficult to find even if only trap on use).
Expressed as not as persuasive, but at least a fraction check for the Point of agreement #1.

Even though I view dynamic debug as a minor benefit, I think even that
minor concrete benefit outweighs the unknown abstract benefit of
"change vl" behavior, unless there are some great use cases for the
existing PoR scheme that we've missed.
I agree.

But we cross a line to believe the objective is "that bugs are caught".
What bug is it that we believe we can design hardware to catch?

As a database analyst, I told the application developers with whom I worked
   that their compiled and running program was not "wrong".
It was doing just fine exactly what they directed it to do.
It was the perfect program for a problem other than the one they wanted to solve.

Ditto for bugs. Behaviour that one programer wants to avoid another may intend.

We cannot solve bugs in hardware. CICS attempts to do so are infamous.
All we can do is provide operations that do exactly as they are stipulated, ideally with no corner cases, with a simple conceptual definition.

Enforcing a perceived good software/"expected use" policy is rarely directly achievable or desirable.
Keep  SEW/LMUL ratio invariant is a policy/"expected use case".
I contend there are deliberate exceptions to this policy, or, in the alternative,  at minimum the policy has a limited domain.
If there are exceptions or the domain is limited it is not a good characteristic to enforce, even as a special instruction formulation.

Rather, a better characteristic to enforce is the vl in-variance in a special formulation.
It follows from the instruction formulation in which no explicit AVL is supplied (X0).
It is the underlying characteristic in the checks above and below.

But again, the implementation saving from not having to worry about
dynamic vl changes for these instructions to me far outweighs the
other issues.

On Wed, 22 Jul 2020 09:02:03 -0400, "David Horner" <ds2horner@...> said:
| I wholeheartedly agree with resolving on the mailing list.
| This should be the rule not exception.

| On 2020-07-21 11:58 p.m., Krste Asanovic wrote:
|| I want to bring this to group's attention as I think I've convinced
|| myself that Guy's suggestion is the correct path to follow, i.e.,
|| vsetvli x0, x0, imm
|| will raise vill if the new SEW'/LMUL' ratio is not the same as the old
|| SEW/LMUL implying vl might change. Similarly for vsetvl version.
| My considerations for allowing vl to change were

| a) having a compelling reason to change PoR.
|      vsetvl[i]is extremely important to RVV success.
|      It deserves deep scrutiny.
|      Challenging each and every change,
|     as well as proposing any plausible enhancement
|     are equally important to get this feature,
|     more so than others, right(tm).

| b) tracking assemblers and compilers could present warnings.
|      Part of my support was my bias towards encouraging vl tracking
| support.
|      Tracking vl in code has substantial benefits beyond a replacement
| for this x0,x0 behaviour.
|      I believe RVV success and adoption will be substantially hampered
| without it.I believe that ultimately IOT machines will benefit from RVV
|          if we continue to emphasize  simplicity  the design.
|      It is however specific to RVV.
|      So marginal hardware support that appears to mitigate a need for
| vl tracking gets a check in the negative column.

| c) A perceived simplicity of PoR for minimal designs.
|      I am biased toward ensuring simple machines can efficiently
| support for RVV.

|      Initial uptake is likely to be in the application/HPC domain, but
|      I believe that ultimately IOT machines will benefit from RVV
|          if we continue to emphasize  simplicity  the design.

| d) Setting vill is excellent as a means to avoid trap behaviour.
|      however it requires explicit check after vtype setting ops.
|      Opportunistic approaches will rely on the subsequent fault.
|      This situation is theoretically impossible to statically backward
| trace.
|      A given RVV data instruction could be branched to from anywhere,
|          conditional execution could have executed any vsetvl instruction
|          with virtually any rs1 value.
|      This biases me away from setting vill, in the x0,x0 case setting
| vl avoid vill set.
|      However, in practice branching into a loop will be errant
| behaviour and
|         RVV data instructions will be paired with a vsetvli instruction.
|      My paranoia causes me this too heavily at times. (.... reweighing
| risks)

| e)  in the x0,x0 formulation, vsetvli cannot determine from immediate
| parsing alone vill state.
|      we have strived to ensure the immediate format will meet virtually
| all in loop use cases.
|      Ideally, vsetvl is reserved for context switch (and custom)
| situations.
|      I considered x0,x0 a punt to vsetvl (potentially slow) path to
| allow for the immediate form optimization
|      (i.e. no vill setting considerations after parse) .
|      However, reweighing the benefit of retaining vl and requiring a
| late setting of vill.
|      Given vill setting can always be performed on a slow path
|      with little real impact to normal code ....  reweighing risks.

|| Apart from the debugging motivation that Guy presented,
| see my point d.
|| I would add
|| that this definition effectively removes any read or write of vl from
|| the instruction, possibly removing hazards and simplifying dependency
|| tracking and relieving an OoO machine from providing a new rename
|| register for vl (might still need for vtype).
| this does not talk to my point c.
|| I could not find any non-esoteric use for the vl-trimming behavior of
|| the current PoR for larger SEW/LMUL,
| I've found coders and compiler writers collectively more ingenious than I,
|  not only more eyes in free software but a spectrum of inner-eye
| perceptions and mindsets.

| So although relevant to the discussion, in the negative it is not
| compelling as a benefit.
|| so given these benefits I move we
|| adopt the "sets vill for non-iso SEW/LMUL" meaning.

|| The circuit has
|| to calculate (vsew_new-vlmul_new)!=(vsew_old-vlmul_old) to determine
|| vill, but now never needs to read
| I disagree with this behaviour. increasing VLMAX does not invalidate
| current vl and should not" raise an exception" even indirectly.
| If we are needing a warning , let assembler/compilers do it note b above.

| I also disagree that we always set vill if VLMAX reduced but vl is still
| < newVLMAX.
| Only if the ratio changes do we need to read vl, so in the frequent case
| I agree vl read can be avoided.
| To avoid a vl read
|| or write vl.
| My principle is hardware should not attempt to debug or correct software.
| Although hardware developers may believe a specific
| validation/verification facility will be useful to programmers (SEW/LMUL
| in-variance checking)
| such "policy" should not be imposed but rather a means to electively
| support such a policy be provided.
| Setting vill when original vl cannot be maintained is valid, enforcing
| an invariance policy is not.
|| ...
|| As a general optimization guide, software should endeavor to use this
|| form instead of passing in AVL to avoid the vl update when not
|| necessary.
| I agree.
| This is what was envisioned by providing x0,x0.
| Further, this encoding implies an intent which makes code clearer.
| Someone doing tricks needs to add a comment.

| I'm leaning to accepting the proposal as I amended.
|| I hope this is one we can resolve on the mailing list to save time in
|| the next meeting.
| as do I.
|| Krste


Join { to automatically receive all group messages.