Issue #365 vsetvl{i} x0, x0 instruction forms


Krste Asanovic
 

I want to bring this to group's attention as I think I've convinced
myself that Guy's suggestion is the correct path to follow, i.e.,

vsetvli x0, x0, imm

will raise vill if the new SEW'/LMUL' ratio is not the same as the old
SEW/LMUL implying vl might change. Similarly for vsetvl version.

Apart from the debugging motivation that Guy presented, I would add
that this definition effectively removes any read or write of vl from
the instruction, possibly removing hazards and simplifying dependency
tracking and relieving an OoO machine from providing a new rename
register for vl (might still need for vtype).

I could not find any non-esoteric use for the vl-trimming behavior of
the current PoR for larger SEW/LMUL, so given these benefits I move we
adopt the "sets vill for non-iso SEW/LMUL" meaning. The circuit has
to calculate (vsew_new-vlmul_new)!=(vsew_old-vlmul_old) to determine
vill, but now never needs to read or write vl.

I had thought we might be able to reuse the encoding space by
declaring the non-iso SEW/EMUL case "reserved", but this does not
really enable forwards compatibility without mandating a
data-dependent trap as David noted.

As a general optimization guide, software should endeavor to use this
form instead of passing in AVL to avoid the vl update when not
necessary.

I hope this is one we can resolve on the mailing list to save time in
the next meeting.

Krste


David Horner
 

I wholeheartedly agree with resolving on the mailing list.
This should be the rule not exception.


On 2020-07-21 11:58 p.m., Krste Asanovic wrote:
I want to bring this to group's attention as I think I've convinced
myself that Guy's suggestion is the correct path to follow, i.e.,

vsetvli x0, x0, imm

will raise vill if the new SEW'/LMUL' ratio is not the same as the old
SEW/LMUL implying vl might change. Similarly for vsetvl version.
My considerations for allowing vl to change were

a) having a compelling reason to change PoR.
     vsetvl[i]is extremely important to RVV success.
     It deserves deep scrutiny.
     Challenging each and every change,
    as well as proposing any plausible enhancement
    are equally important to get this feature,
    more so than others, right(tm).

b) tracking assemblers and compilers could present warnings.
     Part of my support was my bias towards encouraging vl tracking support.
     Tracking vl in code has substantial benefits beyond a replacement for this x0,x0 behaviour.
     I believe RVV success and adoption will be substantially hampered without it.I believe that ultimately IOT machines will benefit from RVV
         if we continue to emphasize  simplicity  the design.
     It is however specific to RVV.
     So marginal hardware support that appears to mitigate a need for vl tracking gets a check in the negative column.

c) A perceived simplicity of PoR for minimal designs.
     I am biased toward ensuring simple machines can efficiently support for RVV.

     Initial uptake is likely to be in the application/HPC domain, but
     I believe that ultimately IOT machines will benefit from RVV
         if we continue to emphasize  simplicity  the design.

d) Setting vill is excellent as a means to avoid trap behaviour.
     however it requires explicit check after vtype setting ops.
     Opportunistic approaches will rely on the subsequent fault.
     This situation is theoretically impossible to statically backward trace.
     A given RVV data instruction could be branched to from anywhere,
         conditional execution could have executed any vsetvl instruction
         with virtually any rs1 value.
     This biases me away from setting vill, in the x0,x0 case setting vl avoid vill set.
     However, in practice branching into a loop will be errant behaviour and
        RVV data instructions will be paired with a vsetvli instruction.
     My paranoia causes me this too heavily at times. (.... reweighing risks)


e)  in the x0,x0 formulation, vsetvli cannot determine from immediate parsing alone vill state.
     we have strived to ensure the immediate format will meet virtually all in loop use cases.
     Ideally, vsetvl is reserved for context switch (and custom) situations.
     I considered x0,x0 a punt to vsetvl (potentially slow) path to allow for the immediate form optimization
     (i.e. no vill setting considerations after parse) .
     However, reweighing the benefit of retaining vl and requiring a late setting of vill.
     Given vill setting can always be performed on a slow path
     with little real impact to normal code ....  reweighing risks.


Apart from the debugging motivation that Guy presented,
see my point d.
I would add
that this definition effectively removes any read or write of vl from
the instruction, possibly removing hazards and simplifying dependency
tracking and relieving an OoO machine from providing a new rename
register for vl (might still need for vtype).
this does not talk to my point c.

I could not find any non-esoteric use for the vl-trimming behavior of
the current PoR for larger SEW/LMUL,
I've found coders and compiler writers collectively more ingenious than I,
 not only more eyes in free software but a spectrum of inner-eye perceptions and mindsets.

So although relevant to the discussion, in the negative it is not compelling as a benefit.
so given these benefits I move we
adopt the "sets vill for non-iso SEW/LMUL" meaning.
The circuit has
to calculate (vsew_new-vlmul_new)!=(vsew_old-vlmul_old) to determine
vill, but now never needs to read
I disagree with this behaviour. increasing VLMAX does not invalidate current vl and should not" raise an exception" even indirectly.
If we are needing a warning , let assembler/compilers do it note b above.

I also disagree that we always set vill if VLMAX reduced but vl is still < newVLMAX.
Only if the ratio changes do we need to read vl, so in the frequent case I agree vl read can be avoided.
To avoid a vl read
or write vl.
My principle is hardware should not attempt to debug or correct software.
Although hardware developers may believe a specific validation/verification facility will be useful to programmers (SEW/LMUL in-variance checking)
such "policy" should not be imposed but rather a means to electively support such a policy be provided.
Setting vill when original vl cannot be maintained is valid, enforcing an invariance policy is not.
...
As a general optimization guide, software should endeavor to use this
form instead of passing in AVL to avoid the vl update when not
necessary.
I agree.
This is what was envisioned by providing x0,x0.
Further, this encoding implies an intent which makes code clearer.
Someone doing tricks needs to add a comment.

I'm leaning to accepting the proposal as I amended.

I hope this is one we can resolve on the mailing list to save time in
the next meeting.
as do I.

Krste



Krste Asanovic
 

The main issue is whether the current PoR has any useful purpose when
vl changes. I don't subscribe to "field of dreams" approach. I tried
to find some scenarios hoping there would be some useful cases, but
struggled to come up with anything substantial with current PoR.
There are certainly some possible alternate vl-changing behaviors that
could be useful, but those would be a different instruction. Unless
there is a clear use, the additional vl-modifying behavior in PoR
cannot really be stated as a positive but only a curiosity.

On the negative side, a microarchitecture will have to assume
vl will be read and written by this instruction, even if it almost
never changes. Even for simple machines, this will probably cause
some extra flops to be clocked. For machines with renaming, it can
require a new physical vl is allocated early in machine even if vl
rarely changes. There might be microarch techniques to recycle vl
regs more quickly once known not to change, but would be much simpler
not to have to deal with this. The verification cost alone is a big
negative for a feature that could be rarely/never used.

These instructions will likely be common in loops dealing with
multiple element widths (a common loop will have only one vsetvli that
changes vl and potentially many that manipulate SEW/LMUL), and so
optimizing their implementation is important. Having a hardware
instruction that is "change vtype but not vl, or error" is clearly
useful I think.

The dynamic debug aspect, I agree is relatively minor, but given the
prevalence of "change vtype but not vl" instructions, it is only a
positive that bugs are caught even if not always with clear
determination of problematic instruction (though I guess it will very
rare that the bug will be difficult to find even if only trap on use).

Even though I view dynamic debug as a minor benefit, I think even that
minor concrete benefit outweighs the unknown abstract benefit of
"change vl" behavior, unless there are some great use cases for the
existing PoR scheme that we've missed.

But again, the implementation saving from not having to worry about
dynamic vl changes for these instructions to me far outweighs the
other issues.

Krste


On Wed, 22 Jul 2020 09:02:03 -0400, "David Horner" <ds2horner@...> said:
| I wholeheartedly agree with resolving on the mailing list.
| This should be the rule not exception.


| On 2020-07-21 11:58 p.m., Krste Asanovic wrote:
|| I want to bring this to group's attention as I think I've convinced
|| myself that Guy's suggestion is the correct path to follow, i.e.,
||
|| vsetvli x0, x0, imm
||
|| will raise vill if the new SEW'/LMUL' ratio is not the same as the old
|| SEW/LMUL implying vl might change. Similarly for vsetvl version.
| My considerations for allowing vl to change were

| a) having a compelling reason to change PoR.
|      vsetvl[i]is extremely important to RVV success.
|      It deserves deep scrutiny.
|      Challenging each and every change,
|     as well as proposing any plausible enhancement
|     are equally important to get this feature,
|     more so than others, right(tm).

| b) tracking assemblers and compilers could present warnings.
|      Part of my support was my bias towards encouraging vl tracking
| support.
|      Tracking vl in code has substantial benefits beyond a replacement
| for this x0,x0 behaviour.
|      I believe RVV success and adoption will be substantially hampered
| without it.I believe that ultimately IOT machines will benefit from RVV
|          if we continue to emphasize  simplicity  the design.
|      It is however specific to RVV.
|      So marginal hardware support that appears to mitigate a need for
| vl tracking gets a check in the negative column.

| c) A perceived simplicity of PoR for minimal designs.
|      I am biased toward ensuring simple machines can efficiently
| support for RVV.

|      Initial uptake is likely to be in the application/HPC domain, but
|      I believe that ultimately IOT machines will benefit from RVV
|          if we continue to emphasize  simplicity  the design.

| d) Setting vill is excellent as a means to avoid trap behaviour.
|      however it requires explicit check after vtype setting ops.
|      Opportunistic approaches will rely on the subsequent fault.
|      This situation is theoretically impossible to statically backward
| trace.
|      A given RVV data instruction could be branched to from anywhere,
|          conditional execution could have executed any vsetvl instruction
|          with virtually any rs1 value.
|      This biases me away from setting vill, in the x0,x0 case setting
| vl avoid vill set.
|      However, in practice branching into a loop will be errant
| behaviour and
|         RVV data instructions will be paired with a vsetvli instruction.
|      My paranoia causes me this too heavily at times. (.... reweighing
| risks)


| e)  in the x0,x0 formulation, vsetvli cannot determine from immediate
| parsing alone vill state.
|      we have strived to ensure the immediate format will meet virtually
| all in loop use cases.
|      Ideally, vsetvl is reserved for context switch (and custom)
| situations.
|      I considered x0,x0 a punt to vsetvl (potentially slow) path to
| allow for the immediate form optimization
|      (i.e. no vill setting considerations after parse) .
|      However, reweighing the benefit of retaining vl and requiring a
| late setting of vill.
|      Given vill setting can always be performed on a slow path
|      with little real impact to normal code ....  reweighing risks.


|| Apart from the debugging motivation that Guy presented,
| see my point d.
|| I would add
|| that this definition effectively removes any read or write of vl from
|| the instruction, possibly removing hazards and simplifying dependency
|| tracking and relieving an OoO machine from providing a new rename
|| register for vl (might still need for vtype).
| this does not talk to my point c.
||
|| I could not find any non-esoteric use for the vl-trimming behavior of
|| the current PoR for larger SEW/LMUL,
| I've found coders and compiler writers collectively more ingenious than I,
|  not only more eyes in free software but a spectrum of inner-eye
| perceptions and mindsets.

| So although relevant to the discussion, in the negative it is not
| compelling as a benefit.
|| so given these benefits I move we
|| adopt the "sets vill for non-iso SEW/LMUL" meaning.

|| The circuit has
|| to calculate (vsew_new-vlmul_new)!=(vsew_old-vlmul_old) to determine
|| vill, but now never needs to read
| I disagree with this behaviour. increasing VLMAX does not invalidate
| current vl and should not" raise an exception" even indirectly.
| If we are needing a warning , let assembler/compilers do it note b above.

| I also disagree that we always set vill if VLMAX reduced but vl is still
| < newVLMAX.
| Only if the ratio changes do we need to read vl, so in the frequent case
| I agree vl read can be avoided.
| To avoid a vl read
|| or write vl.
| My principle is hardware should not attempt to debug or correct software.
| Although hardware developers may believe a specific
| validation/verification facility will be useful to programmers (SEW/LMUL
| in-variance checking)
| such "policy" should not be imposed but rather a means to electively
| support such a policy be provided.
| Setting vill when original vl cannot be maintained is valid, enforcing
| an invariance policy is not.
|| ...
|| As a general optimization guide, software should endeavor to use this
|| form instead of passing in AVL to avoid the vl update when not
|| necessary.
| I agree.
| This is what was envisioned by providing x0,x0.
| Further, this encoding implies an intent which makes code clearer.
| Someone doing tricks needs to add a comment.

| I'm leaning to accepting the proposal as I amended.
||
|| I hope this is one we can resolve on the mailing list to save time in
|| the next meeting.
| as do I.
||
|| Krste
||
||
||
||


|


Bill Huffman
 

I agree with Krste's support for Guy's proposal here. Loops with
multiple element widths are likely to have more non-vl-changing
instructions than vl-changing instructions. Knowing this from the
instruction without having to track the sequence involved is likely to
pay benefits in implementation.

Bill

On 7/22/20 4:35 PM, Krste Asanovic wrote:
EXTERNAL MAIL



The main issue is whether the current PoR has any useful purpose when
vl changes. I don't subscribe to "field of dreams" approach. I tried
to find some scenarios hoping there would be some useful cases, but
struggled to come up with anything substantial with current PoR.
There are certainly some possible alternate vl-changing behaviors that
could be useful, but those would be a different instruction. Unless
there is a clear use, the additional vl-modifying behavior in PoR
cannot really be stated as a positive but only a curiosity.

On the negative side, a microarchitecture will have to assume
vl will be read and written by this instruction, even if it almost
never changes. Even for simple machines, this will probably cause
some extra flops to be clocked. For machines with renaming, it can
require a new physical vl is allocated early in machine even if vl
rarely changes. There might be microarch techniques to recycle vl
regs more quickly once known not to change, but would be much simpler
not to have to deal with this. The verification cost alone is a big
negative for a feature that could be rarely/never used.

These instructions will likely be common in loops dealing with
multiple element widths (a common loop will have only one vsetvli that
changes vl and potentially many that manipulate SEW/LMUL), and so
optimizing their implementation is important. Having a hardware
instruction that is "change vtype but not vl, or error" is clearly
useful I think.

The dynamic debug aspect, I agree is relatively minor, but given the
prevalence of "change vtype but not vl" instructions, it is only a
positive that bugs are caught even if not always with clear
determination of problematic instruction (though I guess it will very
rare that the bug will be difficult to find even if only trap on use).

Even though I view dynamic debug as a minor benefit, I think even that
minor concrete benefit outweighs the unknown abstract benefit of
"change vl" behavior, unless there are some great use cases for the
existing PoR scheme that we've missed.

But again, the implementation saving from not having to worry about
dynamic vl changes for these instructions to me far outweighs the
other issues.

Krste


On Wed, 22 Jul 2020 09:02:03 -0400, "David Horner" <ds2horner@...> said:
| I wholeheartedly agree with resolving on the mailing list.
| This should be the rule not exception.


| On 2020-07-21 11:58 p.m., Krste Asanovic wrote:
|| I want to bring this to group's attention as I think I've convinced
|| myself that Guy's suggestion is the correct path to follow, i.e.,
||
|| vsetvli x0, x0, imm
||
|| will raise vill if the new SEW'/LMUL' ratio is not the same as the old
|| SEW/LMUL implying vl might change. Similarly for vsetvl version.
| My considerations for allowing vl to change were

| a) having a compelling reason to change PoR.
|      vsetvl[i]is extremely important to RVV success.
|      It deserves deep scrutiny.
|      Challenging each and every change,
|     as well as proposing any plausible enhancement
|     are equally important to get this feature,
|     more so than others, right(tm).

| b) tracking assemblers and compilers could present warnings.
|      Part of my support was my bias towards encouraging vl tracking
| support.
|      Tracking vl in code has substantial benefits beyond a replacement
| for this x0,x0 behaviour.
|      I believe RVV success and adoption will be substantially hampered
| without it.I believe that ultimately IOT machines will benefit from RVV
|          if we continue to emphasize  simplicity  the design.
|      It is however specific to RVV.
|      So marginal hardware support that appears to mitigate a need for
| vl tracking gets a check in the negative column.

| c) A perceived simplicity of PoR for minimal designs.
|      I am biased toward ensuring simple machines can efficiently
| support for RVV.

|      Initial uptake is likely to be in the application/HPC domain, but
|      I believe that ultimately IOT machines will benefit from RVV
|          if we continue to emphasize  simplicity  the design.

| d) Setting vill is excellent as a means to avoid trap behaviour.
|      however it requires explicit check after vtype setting ops.
|      Opportunistic approaches will rely on the subsequent fault.
|      This situation is theoretically impossible to statically backward
| trace.
|      A given RVV data instruction could be branched to from anywhere,
|          conditional execution could have executed any vsetvl instruction
|          with virtually any rs1 value.
|      This biases me away from setting vill, in the x0,x0 case setting
| vl avoid vill set.
|      However, in practice branching into a loop will be errant
| behaviour and
|         RVV data instructions will be paired with a vsetvli instruction.
|      My paranoia causes me this too heavily at times. (.... reweighing
| risks)


| e)  in the x0,x0 formulation, vsetvli cannot determine from immediate
| parsing alone vill state.
|      we have strived to ensure the immediate format will meet virtually
| all in loop use cases.
|      Ideally, vsetvl is reserved for context switch (and custom)
| situations.
|      I considered x0,x0 a punt to vsetvl (potentially slow) path to
| allow for the immediate form optimization
|      (i.e. no vill setting considerations after parse) .
|      However, reweighing the benefit of retaining vl and requiring a
| late setting of vill.
|      Given vill setting can always be performed on a slow path
|      with little real impact to normal code ....  reweighing risks.


|| Apart from the debugging motivation that Guy presented,
| see my point d.
|| I would add
|| that this definition effectively removes any read or write of vl from
|| the instruction, possibly removing hazards and simplifying dependency
|| tracking and relieving an OoO machine from providing a new rename
|| register for vl (might still need for vtype).
| this does not talk to my point c.
||
|| I could not find any non-esoteric use for the vl-trimming behavior of
|| the current PoR for larger SEW/LMUL,
| I've found coders and compiler writers collectively more ingenious than I,
|  not only more eyes in free software but a spectrum of inner-eye
| perceptions and mindsets.

| So although relevant to the discussion, in the negative it is not
| compelling as a benefit.
|| so given these benefits I move we
|| adopt the "sets vill for non-iso SEW/LMUL" meaning.

|| The circuit has
|| to calculate (vsew_new-vlmul_new)!=(vsew_old-vlmul_old) to determine
|| vill, but now never needs to read
| I disagree with this behaviour. increasing VLMAX does not invalidate
| current vl and should not" raise an exception" even indirectly.
| If we are needing a warning , let assembler/compilers do it note b above.

| I also disagree that we always set vill if VLMAX reduced but vl is still
| < newVLMAX.
| Only if the ratio changes do we need to read vl, so in the frequent case
| I agree vl read can be avoided.
| To avoid a vl read
|| or write vl.
| My principle is hardware should not attempt to debug or correct software.
| Although hardware developers may believe a specific
| validation/verification facility will be useful to programmers (SEW/LMUL
| in-variance checking)
| such "policy" should not be imposed but rather a means to electively
| support such a policy be provided.
| Setting vill when original vl cannot be maintained is valid, enforcing
| an invariance policy is not.
|| ...
|| As a general optimization guide, software should endeavor to use this
|| form instead of passing in AVL to avoid the vl update when not
|| necessary.
| I agree.
| This is what was envisioned by providing x0,x0.
| Further, this encoding implies an intent which makes code clearer.
| Someone doing tricks needs to add a comment.

| I'm leaning to accepting the proposal as I amended.
||
|| I hope this is one we can resolve on the mailing list to save time in
|| the next meeting.
| as do I.
||
|| Krste
||
||
||
||


|



David Horner
 

TL;DR;

Point of agreement #1 - x0,x0 variant should not change vl.
I believe we are also in agreement on

#2 - if vl would change because of a SEW/LMUL change vill should be set.

Outstanding questions:

#3) If vill is set should vl remain unchanged? (I vote for yes).

#4) Should potential change of vl  set vill? Currently that condition is equivalent to a SEW/LMUL ratio change.

    4a) in all cases? even if vl is zero? even if vl is 1? (this rule has fringe cases).
    4b) what do we do when another vtype parameter is added that also would potentially change vl?
            What is the likely formulation of such an algorithm?
            In general something comparable to a simple ratio would be inadequate.
            I believe this SEW/LMUL formulation is not future proof.

#5) Why not defined the x0,x0 variant that doesn't change vl as succeeding if vl doesn't change?
       Only setting vill if the resultant new-vl does not match the previous vl.
          (Point #3 is still relevant, but there are no longer any corner cases as in 4a and 4b).

Krste below expresses some reasons that lean towards SEW/LMUL invariance rather than vl invariance be the determinant for setting vill.

Specifically, comparing vl to the new-vl requires reading the old vl and that is potentially expensive, why not avoid the read of vl altogether?

One approach is based on #4.
Instead read previous(current?) vlmul and vsew, calculate ratio, compare with new ratio and set vill if different.
We can avoid vlmul/vsew read by retaining the current SEW/LMUL values (or ratio)
(can be stored locally, only 6 bits for vsew and vlmul)
and compare that to the new SEW/LMUL ratio.
quite efficient.

What of advocating for #5 - what is the overhead here?

A simplistic approach can read vl and push it through the existing circuitry,
  except when the calculated MAXVL exceeds the calculated vl set vill
otherwise leave the current vl alone (or overwrite it with itself, whichever).
For simple designs there is a simple implementation that can further be optimized by setting vill  on a slow path.

Alternatively we can use the SEW/LMUL optimization approach:
We can store the vl info locally.
For standard V minimum (log2(VLEN*8); log2(128) + log2(8) = 7 + 3 = 10 bits,
with an additional bit per doubling of VLEN) .
We compare the calculated vl with that.
This compares favourably to #4 optimization.

But we can do better than that.
We only need compare calculated MAXVL
(comparable computational cost to SEW/LMUL ratio)
which is normally done anyway (so can leverage existing circuitry)
and compare that to locally stored vl information.
MAXVL varies from 1 (in the worst case) to VLEN*8.
As MAXVL is always a power of 2 the number of bits to store is log2(log2(VLEN*8)) or 4 bits for up to VLEN=2K.
Thus 4 bits for the locally saved vl information which is the minimal MAXVL for current vl.
(V minimum is ELEN=64 and VLEN=128 which is among the case for which 3 bits suffice)
I'm not a circuit guru, but "MINVL" from vl is inexpensive  to calculate,
 especially as it also does not need to be on the critical path for non- x0,x0 variants
  that are the only ones that need store vl info locally.
.

It would appear that #5 is a net win for circuitry and a better formulation of vl unchanged requirements.

#5 now has my vote.

I provide further analysis within the replies below.

On 2020-07-22 8:21 p.m., Bill Huffman wrote:
I agree with Krste's support for Guy's proposal here.
thanks for the response.
Loops with
multiple element widths are likely to have more non-vl-changing
instructions than vl-changing instructions. Knowing this from the
instruction without having to track the sequence involved is likely to
pay benefits in implementation.
an Important and valid point that I also support.
Bill

On 7/22/20 4:35 PM, Krste Asanovic wrote:
EXTERNAL MAIL



The main issue is whether the current PoR has any useful purpose when
vl changes.
I disagree with characterizing this as the main issue.
I agree that it is an important consideration.

The pivotal question as I see it is, what action the instruction should take when vl would change.
PoR says change it, as any other vsetvl variant would.
I don't subscribe to "field of dreams" approach. I tried
to find some scenarios hoping there would be some useful cases, but
struggled to come up with anything substantial with current PoR.
There are certainly some possible alternate vl-changing behaviors that
could be useful, but those would be a different instruction. Unless
there is a clear use, the additional vl-modifying behavior in PoR
cannot really be stated as a positive but only a curiosity.
Until such a use is discovered.

I don't disagree that it is an important consideration, only that it is secondary.
If explicitly disallowing the "apparently useless" behaviour itself causes substantial cost, we can live with a meaningless instruction formulation.
RVI frequently allows formulations that lacking a clear and compelling use case because
 it is an artifact of the general useful operation, that to exclude it would increase overhead (instruction decode , etc.)
e.g. bne rs1,rs2,-2, branch to within the same instruction which, depending upon rs1/rs2 values can be a C.BNEZ infinite loop if a specific register (x8 through x15) is non-zero.
The same could hold true here. In my opinion, this is substantially why (this was main part of my reasoning),  the current PoR was adopted.

Bill expresses succinctly:

Loops with multiple element widths are likely to have more non-vl-changing
instructions than vl-changing instructions.
It is precisely due to the nature of its expected (lack of) use,
that in other situations we would disregard the low use and esoteric case as harmless.
Consider the reluctance to reserve RVV simm5/rs1=0 formulations that match an existing simpler instruction.
However, in this case I agree that the formulation x0,x0 is valuable to use effectively, solely because vsetvli is so important.
Even as a secondary consideration, lack of usefulness is disturbing for a dominant feature.
On the negative side, a microarchitecture will have to assume
vl will be read and written by this instruction, even if it almost
never changes.
That is PoR and I believe there is now general agreement (3 to zero so far) that changing vl is not the desired behaviour.
So, Point of agreement #1 - x0,x0 variant  should not change vl.

Even for simple machines, this will probably cause
some extra flops to be clocked.
Let's put this into perspective - all other vsetvl variants write vl, that is the primary purpose, it is explicitly in the name.
We are proposing an optimization for what we anticipate (reasonably) to be a common used case, as Bill stated.
The potential is to save some flops by avoiding the write (and delays caused by its cascade/flow/synch effects) .

For machines with renaming, it can
require a new physical vl is allocated early in machine even if vl
rarely changes. There might be microarch techniques to recycle vl
regs more quickly once known not to change, but would be much simpler
not to have to deal with this.
Agreed.  Another check for the Point of agreement #1
The (certification)verification cost alone is a big
negative for a feature that could be rarely/never used.
Agreed.  Another check for the Point of agreement #1

These instructions will likely be common in loops dealing with
multiple element widths (a common loop will have only one vsetvli that
changes vl and potentially many that manipulate SEW/LMUL), and so
optimizing their implementation is important. Having a hardware
instruction that is "change vtype but not vl, or error" is clearly
useful I think.
Agreed.  The above argument restated as for the Point of agreement #1
The dynamic debug aspect, I agree is relatively minor, but given the
prevalence of "change vtype but not vl" instructions, it is only a
positive that bugs are caught even if not always with clear
determination of problematic instruction (though I guess it will very
rare that the bug will be difficult to find even if only trap on use).
Expressed as not as persuasive, but at least a fraction check for the Point of agreement #1.





Even though I view dynamic debug as a minor benefit, I think even that
minor concrete benefit outweighs the unknown abstract benefit of
"change vl" behavior, unless there are some great use cases for the
existing PoR scheme that we've missed.
I agree.

But we cross a line to believe the objective is "that bugs are caught".
What bug is it that we believe we can design hardware to catch?

As a database analyst, I told the application developers with whom I worked
   that their compiled and running program was not "wrong".
It was doing just fine exactly what they directed it to do.
It was the perfect program for a problem other than the one they wanted to solve.

Ditto for bugs. Behaviour that one programer wants to avoid another may intend.

We cannot solve bugs in hardware. CICS attempts to do so are infamous.
All we can do is provide operations that do exactly as they are stipulated, ideally with no corner cases, with a simple conceptual definition.

Enforcing a perceived good software/"expected use" policy is rarely directly achievable or desirable.
Keep  SEW/LMUL ratio invariant is a policy/"expected use case".
I contend there are deliberate exceptions to this policy, or, in the alternative,  at minimum the policy has a limited domain.
If there are exceptions or the domain is limited it is not a good characteristic to enforce, even as a special instruction formulation.

Rather, a better characteristic to enforce is the vl in-variance in a special formulation.
It follows from the instruction formulation in which no explicit AVL is supplied (X0).
It is the underlying characteristic in the checks above and below.

But again, the implementation saving from not having to worry about
dynamic vl changes for these instructions to me far outweighs the
other issues.
Krste


On Wed, 22 Jul 2020 09:02:03 -0400, "David Horner" <ds2horner@...> said:
| I wholeheartedly agree with resolving on the mailing list.
| This should be the rule not exception.


| On 2020-07-21 11:58 p.m., Krste Asanovic wrote:
|| I want to bring this to group's attention as I think I've convinced
|| myself that Guy's suggestion is the correct path to follow, i.e.,
||
|| vsetvli x0, x0, imm
||
|| will raise vill if the new SEW'/LMUL' ratio is not the same as the old
|| SEW/LMUL implying vl might change. Similarly for vsetvl version.
| My considerations for allowing vl to change were

| a) having a compelling reason to change PoR.
|      vsetvl[i]is extremely important to RVV success.
|      It deserves deep scrutiny.
|      Challenging each and every change,
|     as well as proposing any plausible enhancement
|     are equally important to get this feature,
|     more so than others, right(tm).

| b) tracking assemblers and compilers could present warnings.
|      Part of my support was my bias towards encouraging vl tracking
| support.
|      Tracking vl in code has substantial benefits beyond a replacement
| for this x0,x0 behaviour.
|      I believe RVV success and adoption will be substantially hampered
| without it.I believe that ultimately IOT machines will benefit from RVV
|          if we continue to emphasize  simplicity  the design.
|      It is however specific to RVV.
|      So marginal hardware support that appears to mitigate a need for
| vl tracking gets a check in the negative column.

| c) A perceived simplicity of PoR for minimal designs.
|      I am biased toward ensuring simple machines can efficiently
| support for RVV.

|      Initial uptake is likely to be in the application/HPC domain, but
|      I believe that ultimately IOT machines will benefit from RVV
|          if we continue to emphasize  simplicity  the design.

| d) Setting vill is excellent as a means to avoid trap behaviour.
|      however it requires explicit check after vtype setting ops.
|      Opportunistic approaches will rely on the subsequent fault.
|      This situation is theoretically impossible to statically backward
| trace.
|      A given RVV data instruction could be branched to from anywhere,
|          conditional execution could have executed any vsetvl instruction
|          with virtually any rs1 value.
|      This biases me away from setting vill, in the x0,x0 case setting
| vl avoid vill set.
|      However, in practice branching into a loop will be errant
| behaviour and
|         RVV data instructions will be paired with a vsetvli instruction.
|      My paranoia causes me this too heavily at times. (.... reweighing
| risks)


| e)  in the x0,x0 formulation, vsetvli cannot determine from immediate
| parsing alone vill state.
|      we have strived to ensure the immediate format will meet virtually
| all in loop use cases.
|      Ideally, vsetvl is reserved for context switch (and custom)
| situations.
|      I considered x0,x0 a punt to vsetvl (potentially slow) path to
| allow for the immediate form optimization
|      (i.e. no vill setting considerations after parse) .
|      However, reweighing the benefit of retaining vl and requiring a
| late setting of vill.
|      Given vill setting can always be performed on a slow path
|      with little real impact to normal code ....  reweighing risks.


|| Apart from the debugging motivation that Guy presented,
| see my point d.
|| I would add
|| that this definition effectively removes any read or write of vl from
|| the instruction, possibly removing hazards and simplifying dependency
|| tracking and relieving an OoO machine from providing a new rename
|| register for vl (might still need for vtype).
| this does not talk to my point c.
||
|| I could not find any non-esoteric use for the vl-trimming behavior of
|| the current PoR for larger SEW/LMUL,
| I've found coders and compiler writers collectively more ingenious than I,
|  not only more eyes in free software but a spectrum of inner-eye
| perceptions and mindsets.

| So although relevant to the discussion, in the negative it is not
| compelling as a benefit.
|| so given these benefits I move we
|| adopt the "sets vill for non-iso SEW/LMUL" meaning.

|| The circuit has
|| to calculate (vsew_new-vlmul_new)!=(vsew_old-vlmul_old) to determine
|| vill, but now never needs to read
| I disagree with this behaviour. increasing VLMAX does not invalidate
| current vl and should not" raise an exception" even indirectly.
| If we are needing a warning , let assembler/compilers do it note b above.

| I also disagree that we always set vill if VLMAX reduced but vl is still
| < newVLMAX.
| Only if the ratio changes do we need to read vl, so in the frequent case
| I agree vl read can be avoided.
| To avoid a vl read
|| or write vl.
| My principle is hardware should not attempt to debug or correct software.
| Although hardware developers may believe a specific
| validation/verification facility will be useful to programmers (SEW/LMUL
| in-variance checking)
| such "policy" should not be imposed but rather a means to electively
| support such a policy be provided.
| Setting vill when original vl cannot be maintained is valid, enforcing
| an invariance policy is not.
|| ...
|| As a general optimization guide, software should endeavor to use this
|| form instead of passing in AVL to avoid the vl update when not
|| necessary.
| I agree.
| This is what was envisioned by providing x0,x0.
| Further, this encoding implies an intent which makes code clearer.
| Someone doing tricks needs to add a comment.

| I'm leaning to accepting the proposal as I amended.
||
|| I hope this is one we can resolve on the mailing list to save time in
|| the next meeting.
| as do I.
||
|| Krste
||
||
||
||


|


David Horner
 

Messed up when I was trying to simplify the text.

On 2020-07-23 2:19 a.m., David Horner via lists.riscv.org wrote:

What of advocating for #5 - what is the overhead here?

A simplistic approach can read vl and push it through the existing circuitry,

  except when the calculated MAXVL is less than the provided vl set vill

otherwise leave the current vl alone (or overwrite it with itself, whichever).
For simple designs there is a simple implementation that can further be optimized by setting vill  on a slow path.


Andrew Waterman
 



On Wed, Jul 22, 2020 at 11:19 PM David Horner <ds2horner@...> wrote:
TL;DR;

Point of agreement #1 - x0,x0 variant should not change vl.
I believe we are also in agreement on

#2 - if vl would change because of a SEW/LMUL change vill should be set.

Outstanding questions:

#3) If vill is set should vl remain unchanged? (I vote for yes).

Other vsetvl[i] instructions that set vill=1 also set vl=0.  Deviating from that course would be needlessly painful and not especially beneficial.


#4) Should potential change of vl  set vill? Currently that condition is
equivalent to a SEW/LMUL ratio change.

     4a) in all cases? even if vl is zero? even if vl is 1? (this rule
has fringe cases).
     4b) what do we do when another vtype parameter is added that also
would potentially change vl?
             What is the likely formulation of such an algorithm?
             In general something comparable to a simple ratio would be
inadequate.
             I believe this SEW/LMUL formulation is not future proof.

#5) Why not defined the x0,x0 variant that doesn't change vl as
succeeding if vl doesn't change?
        Only setting vill if the resultant new-vl does not match the
previous vl.
           (Point #3 is still relevant, but there are no longer any
corner cases as in 4a and 4b).

Krste below expresses some reasons that lean towards SEW/LMUL invariance
rather than vl invariance be the determinant for setting vill.

Specifically, comparing vl to the new-vl requires reading the old vl and
that is potentially expensive, why not avoid the read of vl altogether?

One approach is based on #4.
Instead read previous(current?) vlmul and vsew, calculate ratio, compare
with new ratio and set vill if different.
We can avoid vlmul/vsew read by retaining the current SEW/LMUL values
(or ratio)
(can be stored locally, only 6 bits for vsew and vlmul)
and compare that to the new SEW/LMUL ratio.
quite efficient.

What of advocating for #5 - what is the overhead here?

A simplistic approach can read vl and push it through the existing
circuitry,
   except when the calculated MAXVL exceeds the calculated vl set vill
otherwise leave the current vl alone (or overwrite it with itself,
whichever).
For simple designs there is a simple implementation that can further be
optimized by setting vill  on a slow path.

Alternatively we can use the SEW/LMUL optimization approach:
We can store the vl info locally.
For standard V minimum (log2(VLEN*8); log2(128) + log2(8) = 7 + 3 = 10
bits,
with an additional bit per doubling of VLEN) .
We compare the calculated vl with that.
This compares favourably to #4 optimization.

But we can do better than that.
We only need compare calculated MAXVL
(comparable computational cost to SEW/LMUL ratio)
which is normally done anyway (so can leverage existing circuitry)
and compare that to locally stored vl information.
MAXVL varies from 1 (in the worst case) to VLEN*8.
As MAXVL is always a power of 2 the number of bits to store is
log2(log2(VLEN*8)) or 4 bits for up to VLEN=2K.
Thus 4 bits for the locally saved vl information which is the minimal
MAXVL for current vl.
(V minimum is ELEN=64 and VLEN=128 which is among the case for which 3
bits suffice)
I'm not a circuit guru, but "MINVL" from vl is inexpensive  to calculate,
  especially as it also does not need to be on the critical path for
non- x0,x0 variants
   that are the only ones that need store vl info locally.
.

It would appear that #5 is a net win for circuitry and a better
formulation of vl unchanged requirements.

It's not just about the cost of the comparators; it's also about avoiding the RAW hazard on the previous value of VL.

The RAW hazard on the previous value of vtype in Krste's proposal is less of a concern, since the previous vtype will usually have been supplied by an immediate operand.  Optimizing for this case, it's straightforward for renamed implementations to maintain a speculative copy of the vtype register in the decode stage.  The same doesn't work for vl, which in most cases was most recently sourced from a register operand.


#5 now has my vote.

I provide further analysis within the replies below.

On 2020-07-22 8:21 p.m., Bill Huffman wrote:
> I agree with Krste's support for Guy's proposal here.
thanks for the response.
>   Loops with
> multiple element widths are likely to have more non-vl-changing
> instructions than vl-changing instructions.  Knowing this from the
> instruction without having to track the sequence involved is likely to
> pay benefits in implementation.
an Important and valid point that I also support.
>         Bill
>
> On 7/22/20 4:35 PM, Krste Asanovic wrote:
>> EXTERNAL MAIL
>>
>>
>>
>> The main issue is whether the current PoR has any useful purpose when
>> vl changes.
I disagree with characterizing this as the main issue.
I agree that it is an important consideration.

The pivotal question as I see it is, what action the instruction should
take when vl would change.
PoR says change it, as any other vsetvl variant would.
>>    I don't subscribe to "field of dreams" approach.  I tried
>> to find some scenarios hoping there would be some useful cases, but
>> struggled to come up with anything substantial with current PoR.
>> There are certainly some possible alternate vl-changing behaviors that
>> could be useful, but those would be a different instruction.  Unless
>> there is a clear use, the additional vl-modifying behavior in PoR
>> cannot really be stated as a positive but only a curiosity.
Until such a use is discovered.

I don't disagree that it is an important consideration, only that it is
secondary.
If explicitly disallowing the "apparently useless" behaviour itself
causes substantial cost, we can live with a meaningless instruction
formulation.
RVI frequently allows formulations that lacking a clear and compelling
use case because
  it is an artifact of the general useful operation, that to exclude it
would increase overhead (instruction decode , etc.)
e.g. bne rs1,rs2,-2, branch to within the same instruction which,
depending upon rs1/rs2 values can be a C.BNEZ infinite loop if a
specific register (x8 through x15) is non-zero.
The same could hold true here. In my opinion, this is substantially why
(this was main part of my reasoning),  the current PoR was adopted.

Bill expresses succinctly:

> Loops with multiple element widths are likely to have more non-vl-changing
> instructions than vl-changing instructions.

It is precisely due to the nature of its expected (lack of) use,
that in other situations we would disregard the low use and esoteric
case as harmless.
Consider the reluctance to reserve RVV simm5/rs1=0 formulations that
match an existing simpler instruction.
However, in this case I agree that the formulation x0,x0 is valuable to
use effectively, solely because vsetvli is so important.
Even as a secondary consideration, lack of usefulness is disturbing for
a dominant feature.
>> On the negative side, a microarchitecture will have to assume
>> vl will be read and written by this instruction, even if it almost
>> never changes.
That is PoR and I believe there is now general agreement (3 to zero so
far) that changing vl is not the desired behaviour.
So, Point of agreement #1 - x0,x0 variant  should not change vl.

>>   Even for simple machines, this will probably cause
>> some extra flops to be clocked.
Let's put this into perspective - all other vsetvl variants write vl,
that is the primary purpose, it is explicitly in the name.
We are proposing an optimization for what we anticipate (reasonably) to
be a common used case, as Bill stated.
The potential is to save some flops by avoiding the write (and delays
caused by its cascade/flow/synch effects) .

>>   For machines with renaming, it can
>> require a new physical vl is allocated early in machine even if vl
>> rarely changes.  There might be microarch techniques to recycle vl
>> regs more quickly once known not to change, but would be much simpler
>> not to have to deal with this.
Agreed.  Another check for the Point of agreement #1
>> The (certification)verification cost alone is a big
>> negative for a feature that could be rarely/never used.
Agreed.  Another check for the Point of agreement #1
>>
>> These instructions will likely be common in loops dealing with
>> multiple element widths (a common loop will have only one vsetvli that
>> changes vl and potentially many that manipulate SEW/LMUL), and so
>> optimizing their implementation is important.  Having a hardware
>> instruction that is "change vtype but not vl, or error" is clearly
>> useful I think.
Agreed.  The above argument restated as for the Point of agreement #1
>> The dynamic debug aspect, I agree is relatively minor, but given the
>> prevalence of "change vtype but not vl" instructions, it is only a
>> positive that bugs are caught even if not always with clear
>> determination of problematic instruction (though I guess it will very
>> rare that the bug will be difficult to find even if only trap on use).
Expressed as not as persuasive, but at least a fraction check for the
Point of agreement #1.




>>
>> Even though I view dynamic debug as a minor benefit, I think even that
>> minor concrete benefit outweighs the unknown abstract benefit of
>> "change vl" behavior, unless there are some great use cases for the
>> existing PoR scheme that we've missed.
I agree.

But we cross a line to believe the objective is "that bugs are caught".
What bug is it that we believe we can design hardware to catch?

As a database analyst, I told the application developers with whom I worked
    that their compiled and running program was not "wrong".
It was doing just fine exactly what they directed it to do.
It was the perfect program for a problem other than the one they wanted
to solve.

Ditto for bugs. Behaviour that one programer wants to avoid another may
intend.

We cannot solve bugs in hardware. CICS attempts to do so are infamous.
All we can do is provide operations that do exactly as they are
stipulated, ideally with no corner cases, with a simple conceptual
definition.

Enforcing a perceived good software/"expected use" policy is rarely
directly achievable or desirable.
Keep  SEW/LMUL ratio invariant is a policy/"expected use case".
I contend there are deliberate exceptions to this policy, or, in the
alternative,  at minimum the policy has a limited domain.
If there are exceptions or the domain is limited it is not a good
characteristic to enforce, even as a special instruction formulation.

Rather, a better characteristic to enforce is the vl in-variance in a
special formulation.
It follows from the instruction formulation in which no explicit AVL is
supplied (X0).
It is the underlying characteristic in the checks above and below.
>>
>> But again, the implementation saving from not having to worry about
>> dynamic vl changes for these instructions to me far outweighs the
>> other issues.

>> Krste
>>
>>
>>>>>>> On Wed, 22 Jul 2020 09:02:03 -0400, "David Horner" <ds2horner@...> said:
>> | I wholeheartedly agree with resolving on the mailing list.
>> | This should be the rule not exception.
>>
>>
>> | On 2020-07-21 11:58 p.m., Krste Asanovic wrote:
>> || I want to bring this to group's attention as I think I've convinced
>> || myself that Guy's suggestion is the correct path to follow, i.e.,
>> ||
>> || vsetvli x0, x0, imm
>> ||
>> || will raise vill if the new SEW'/LMUL' ratio is not the same as the old
>> || SEW/LMUL implying vl might change.  Similarly for vsetvl version.
>> | My considerations for allowing vl to change were
>>
>> | a) having a compelling reason to change PoR.
>> |       vsetvl[i]is extremely important to RVV success.
>> |       It deserves deep scrutiny.
>> |       Challenging each and every change,
>> |      as well as proposing any plausible enhancement
>> |      are equally important to get this feature,
>> |      more so than others, right(tm).
>>
>> | b) tracking assemblers and compilers could present warnings.
>> |       Part of my support was my bias towards encouraging vl tracking
>> | support.
>> |       Tracking vl in code has substantial benefits beyond a replacement
>> | for this x0,x0 behaviour.
>> |       I believe RVV success and adoption will be substantially hampered
>> | without it.I believe that ultimately IOT machines will benefit from RVV
>> |           if we continue to emphasize  simplicity  the design.
>> |       It is however specific to RVV.
>> |       So marginal hardware support that appears to mitigate a need for
>> | vl tracking gets a check in the negative column.
>>
>> | c) A perceived simplicity of PoR for minimal designs.
>> |       I am biased toward ensuring simple machines can efficiently
>> | support for RVV.
>>
>> |       Initial uptake is likely to be in the application/HPC domain, but
>> |       I believe that ultimately IOT machines will benefit from RVV
>> |           if we continue to emphasize  simplicity  the design.
>>
>> | d) Setting vill is excellent as a means to avoid trap behaviour.
>> |       however it requires explicit check after vtype setting ops.
>> |       Opportunistic approaches will rely on the subsequent fault.
>> |       This situation is theoretically impossible to statically backward
>> | trace.
>> |       A given RVV data instruction could be branched to from anywhere,
>> |           conditional execution could have executed any vsetvl instruction
>> |           with virtually any rs1 value.
>> |       This biases me away from setting vill, in the x0,x0 case setting
>> | vl avoid vill set.
>> |       However, in practice branching into a loop will be errant
>> | behaviour and
>> |          RVV data instructions will be paired with a vsetvli instruction.
>> |       My paranoia causes me this too heavily at times. (.... reweighing
>> | risks)
>>
>>
>> | e)  in the x0,x0 formulation, vsetvli cannot determine from immediate
>> | parsing alone vill state.
>> |       we have strived to ensure the immediate format will meet virtually
>> | all in loop use cases.
>> |       Ideally, vsetvl is reserved for context switch (and custom)
>> | situations.
>> |       I considered x0,x0 a punt to vsetvl (potentially slow) path to
>> | allow for the immediate form optimization
>> |       (i.e. no vill setting considerations after parse) .
>> |       However, reweighing the benefit of retaining vl and requiring a
>> | late setting of vill.
>> |       Given vill setting can always be performed on a slow path
>> |       with little real impact to normal code ....  reweighing risks.
>>
>>
>> || Apart from the debugging motivation that Guy presented,
>> | see my point d.
>> || I would add
>> || that this definition effectively removes any read or write of vl from
>> || the instruction, possibly removing hazards and simplifying dependency
>> || tracking and relieving an OoO machine from providing a new rename
>> || register for vl (might still need for vtype).
>> | this does not talk to my point c.
>> ||
>> || I could not find any non-esoteric use for the vl-trimming behavior of
>> || the current PoR for larger SEW/LMUL,
>> | I've found coders and compiler writers collectively more ingenious than I,
>> |   not only more eyes in free software but a spectrum of inner-eye
>> | perceptions and mindsets.
>>
>> | So although relevant to the discussion, in the negative it is not
>> | compelling as a benefit.
>> || so given these benefits I move we
>> || adopt the "sets vill for non-iso SEW/LMUL" meaning.
>>
>> || The circuit has
>> || to calculate (vsew_new-vlmul_new)!=(vsew_old-vlmul_old) to determine
>> || vill, but now never needs to read
>> | I disagree with this behaviour. increasing VLMAX does not invalidate
>> | current vl and should not" raise an exception" even indirectly.
>> | If we are needing a warning , let assembler/compilers do it note b above.
>>
>> | I also disagree that we always set vill if VLMAX reduced but vl is still
>> | < newVLMAX.
>> | Only if the ratio changes do we need to read vl, so in the frequent case
>> | I agree vl read can be avoided.
>> | To avoid a vl read
>> || or write vl.
>> | My principle is hardware should not attempt to debug or correct software.
>> | Although hardware developers may believe a specific
>> | validation/verification facility will be useful to programmers (SEW/LMUL
>> | in-variance checking)
>> | such "policy" should not be imposed but rather a means to electively
>> | support such a policy be provided.
>> | Setting vill when original vl cannot be maintained is valid, enforcing
>> | an invariance policy is not.
>> || ...
>> || As a general optimization guide, software should endeavor to use this
>> || form instead of passing in AVL to avoid the vl update when not
>> || necessary.
>> | I agree.
>> | This is what was envisioned by providing x0,x0.
>> | Further, this encoding implies an intent which makes code clearer.
>> | Someone doing tricks needs to add a comment.
>>
>> | I'm leaning to accepting the proposal as I amended.
>> ||
>> || I hope this is one we can resolve on the mailing list to save time in
>> || the next meeting.
>> | as do I.
>> ||
>> || Krste
>> ||
>> ||
>> ||
>> ||
>>
>>
>> |
>>
>>





Krste Asanovic
 

On Wed, 22 Jul 2020 23:37:02 -0700, Andrew Waterman <andrew@...> said:
| On Wed, Jul 22, 2020 at 11:19 PM David Horner <ds2horner@...> wrote:

| #3) If vill is set should vl remain unchanged? (I vote for yes).

| Other vsetvl[i] instructions that set vill=1 also set vl=0.  Deviating from that course would be needlessly painful and not especially beneficial.

It does add a non-orthogonality, but it is certainly beneficial in
renamed machines to know that vl is never changed by the instruction.

Krste


Andrew Waterman
 



On Wed, Jul 22, 2020 at 11:42 PM <krste@...> wrote:

>>>>> On Wed, 22 Jul 2020 23:37:02 -0700, Andrew Waterman <andrew@...> said:

| On Wed, Jul 22, 2020 at 11:19 PM David Horner <ds2horner@...> wrote:

|     #3) If vill is set should vl remain unchanged? (I vote for yes).

| Other vsetvl[i] instructions that set vill=1 also set vl=0.  Deviating from that course would be needlessly painful and not especially beneficial.

It does add a non-orthogonality, but it is certainly beneficial in
renamed machines to know that vl is never changed by the instruction.

Disagreed. It’s fine to treat vsetvl instructions that set vill as pipeline flushes. Uarch can therefore assume vl isn’t changed.



Krste


David Horner
 

see the rest of the thread for more context.

On 2020-07-23 2:37 a.m., Andrew Waterman wrote:
It would appear that #5 is a net win for circuitry and a better
formulation of vl unchanged requirements.

It's not just about the cost of the comparators; it's also about avoiding the RAW hazard on the previous value of VL.

The RAW hazard on the previous value of vtype in Krste's proposal is less of a concern, since the previous vtype will usually have been supplied by an immediate operand.  Optimizing for this case, it's straightforward for renamed implementations to maintain a speculative copy of the vtype register in the decode stage.  The same doesn't work for vl, which in most cases was most recently sourced from a register operand.

To clarify for the list:
The RAW (Read after Write) hazard already exists for all vl consumers, specifically all RVV data operations and vl csr read.
PoR rules are crafted so that substantial validation can occur without knowing vl. 
   (e.g. register group alignment given lmul and vr1/vr2/vd )
Never-the-less aggressive ooo will have to carry a tentative vl value for at least sets of RVV instructions.
If that value has changed, in flight ops will potentially need to be rolled-back/synched-to-checkpoint, the new vl supplied and execution resheduled/resumed.


A) The x0,x0 formulation potentially adds this vsetvli variant to those instructions that consume vl.
B) The desire is that this variant can also be eliminated as a writer of vl, which could create the RAW hazzard.

Point of agreement # 1 plus #3 guarantees B.
So, as Krste mentions, for some loss of orthogonality we get a guaranteed vl RAW threat avoidance.


Krste's proposal (check SEW/LMUL invariance) handles the majority of the use cases, trading a vl RAW concern for a vsew/lmul RAW concern.
Unfortunately, vsew/lmul RAW hazards also arise from vsetvl register values.
Fortunately these are infrequent so a brute force stall on vsetvl and/or quiesce might be appropriate.

Quiesce is not an appropriate default remedy for vsetvli x0,x0.
However, quiesce for failed  SEW/LMUL invariance check is very appropriate as it is anticipated to be very rare indeed.
(rare to the point, apparently, that some believe it should not be allowed).

My points to this are
a) ooo exceptions are hardly rare and the mechanism to invoke a failsafe is well understood and triggered in many scenarios.
       Handling this x0,x0 case could be an additional hardship, yes, but not uniquely so nor especially arduous.
b) A full quiesce is not required, and the x0,x0 stall waiting for updated vl can be avoided in virtually all cases by the SEW/LMUL check 
c) the stall on x0,x0 can be asynchronous with other downstream processing iff #1 and #3 are both approved.
      i.e. the x0,x0 instruction will not introduce any further hazard than is already present for concurrent processes that consume vl.
d) setting vill also affects downstream (and perhaps concurrently inflight RVV data operations/instructions).
      condition a) has to be available in any case,
       it is only that the extremely rare condition of vl mismatch will be defered
       potentially invoking more rollback or limiting sync opportunities.
e) even this can be mitigated by tagging vsetvli with the minimal bits required for "VLMIN" as a speculative copy.
f) even if the recovery from SEW/LMUL or vl mismatch is abysmal,
   (i.e. full checkpoint, roll back and quiesce),
   the application can avoid this by using the standard formulation of AVL in rs1 on such machines. 
g) in conclusion not setting vill when vl does not change, even if SEW/LMUL ratio does, need not materially introduce or exasperate any RAW hazards. 

My vote is still with #5.
It is consistent with past practice of deference to simplicity of architecture at the potential expense of (ooo) microarchitecture.
Especially as in this case where  multiple reasonable approaches to mitigate the RAW hazards, at low in-practice performance cost, are possible.










David Horner
 



On 2020-07-23 6:27 a.m., Andrew Waterman wrote:


On Wed, Jul 22, 2020 at 11:42 PM <krste@...> wrote:

>>>>> On Wed, 22 Jul 2020 23:37:02 -0700, Andrew Waterman <andrew@...> said:

| On Wed, Jul 22, 2020 at 11:19 PM David Horner <ds2horner@...> wrote:

|     #3) If vill is set should vl remain unchanged? (I vote for yes).

| Other vsetvl[i] instructions that set vill=1 also set vl=0.  Deviating from that course would be needlessly painful and not especially beneficial.

It does add a non-orthogonality, but it is certainly beneficial in
renamed machines to know that vl is never changed by the instruction.

Disagreed. It’s fine to treat vsetvl instructions that set vill as pipeline flushes. Uarch can therefore assume vl isn’t changed.
In my response to your prior post I stated that #1 and #3 are needed to guarantee vl invariance in speculative cases.
I agree with you that such a guarantee is not needed, as assumption is adequate to speculatively proceed.
So, ignore my "iff"s and "buts" about #3 for aggressive ooo.


I will think through diagnostic and recovery value.
And if there is any potential benefit to other Uarch than aggressive ooo.

Thanks.



Krste


David Horner
 

On 2020-07-23 2:42 a.m., krste@... wrote:
On Wed, 22 Jul 2020 23:37:02 -0700, Andrew Waterman <andrew@...> said:
| On Wed, Jul 22, 2020 at 11:19 PM David Horner <ds2horner@...> wrote:

| #3) If vill is set should vl remain unchanged? (I vote for yes).
Not a hill for me to die on, but I believe vsetvli x0,x0 is sufficiently important that even this aspect should be fully vetted.

| Other vsetvl[i] instructions that set vill=1 also set vl=0.
Other vsetvl[i] instructions are essentially different beasts than this variant.
The precedent is not particularly instructive nor persuasive.
Had we anticipated that the potentially dominant instruction for updating vtype fields would be vl invariant
we would have leaned towards error identification leaving vl alone.


Deviating from that course would be needlessly painful
is this pain dominantly due to existing behaviour entrenched in existing designs,  verification and tool chain and documentation?

A) documentation will in any case, now, have to change for this revised behaviour.
    Although updating documentation can be a pain, the increamental cost for either decision on #3 appears minimal to me...

B) I opine that current correctly formulated software behaves thusly:
     The current software tool chain is indifferent to whether vl is zeroed or retained.
     If software checks vill after vsetvl[i] and finds vill set, it ignores vl.
     if software does not check vill it waits for an exception to be triggered.
     Exception handling code does not check for zero vl, but currently assumes zero.
          So it ensures that any restart sequence includes a vsetvl instruction to establish a correct vl.
          But rather, in most cases, it just reports the failure.
     The current debug code, similarly, assumes vl will be zero for the purpose of directing its support.
     But if it reports vl it  renders the result from a read from vl csr even if vill is set.
    If this accurately reflects the current software state (directed as it is by the current vl=0 when vill set).
    then allowing vl to remain unchanged when vill is set has minimal impact.
    Along with the aforementioned documentation update,
    current tool chain is not disrupted,
    but future enhancements can leverage the additional information.
    (Simplifying and augmenting debugging, allowing comprehensive error recovery determination, and simplifying the recovery code sequences).


C) similar opine for verification:
    Current verification check is always for zero when vill set.
    I suspect current verification and validation is minimal and cross dependency checks are few.
    Further, vsetvli error setting behaviour is simple and independent of complex system states.
    i.e. the changes are localized and few to support vl unchanged.
    As a result changing the body of existing code.

D) existing designs
    It goes without saying that committed hardware is least flexible.
    However, it is because entrenchment can so easily occur each version has a substantial warning:
Once the draft label is removed, version 0.x is intended to be stable enough to begin developing toolchains, functional simulators, and initial implementa-
tions,
  We haven't removed that label yet.

  For simple n-order implementations the change from clearing vl coincident with setting vill, to leaving vl alone is less than trivial; it removes some circuitry.
  For aggressive ooo on the other hand resetting vl on a vill triggered flush is much simpler than
    obtaining the correct vl and ensuring it is present in the csr and
    any other place resumption code might be expecting it.
  In the general case, additional/further roll-back/recover to further checkpoints may be needed to do so.

   So I expect it is not just entrenchment but a potential real cost to aggressive ooo.
   I don't believe massively parallel spacial and temporal designs are as adversely affected.
       Krste, Andrew and others can speak to that.


and not especially beneficial.
 I interleaved situations in which I believe retaining vl is beneficial.
 As implied by my examples, it is not just for x0,x0 that the previous vl could be beneficial.
 But the x0,x0 case does benefit more.
     Having undisturbed vl when the program specifically asked for undisturbed vl by using this explicit formulation reduces confusion in programmers.
     It is the no-surprises promise.
     The other forms arguably were asking for a new vl value, so giving zero when the machine says I cannot do what you are asking is not a surprise.
     However, it is comparably unsurprising that the machine would leave the vl value alone if it cannot give you a valid new value.

  Either of the alternatives undisturbed and zeroed (or in the case of ma and ta, ones) are acceptable to programmers, and intuitive.
     They are common outcomes for well behaved instructions.
     Programmers are less accommodating to "indeterminate" results. RV and RVV have done well to avoid such.



Notably, in the privileged spec when two distinct and competing results are possible, both have been allowed.
   cf. allowing both zero and populated for misa csr. ditto for mtval.

I believe it is equally possible to allow both zero or undisturbed vl when vill is set.
The reasons of simplicity if desired and low cost/overheads tradeoffs depending upon Uarch are precisely why they are allowed.


But, one might argue, RVV is a non-privilege spec!
Such indeterminate state is repugnant for user state.

The counter argument is that RVV has Uarch visible state, especially egregious is the vstart settings.
Not only is its behaviour implementation defined, user mode can set vstart to cause unpredictable results.

The csr vl has been reigned in substantially in contrast.
It is not directly writable.
The resultant values from vsetvl[i] in all its variants are well defined/discoverable/predictable.

Allowing undisturbed and zeroed vl is no more challenging for userland to comprehend than the ma/ta machine dependent behaviour.

My vote continues to be #3, leave vl undisturbed.
     I would agree to making vl undisturbed whenever vill is set.
     I would begrudgingly agree to allowing both zeroed and undisturbed vl, if this is the consistent behaviour on a machine whenever vill is set.
     I would reluctantly concede to allowing both to occur depending upon whether x0,x0 or other vsetvl[i] formulations are at play.
     I do not support allowing both to occur in the same hart, one or the other dependent upon some internal Uarch state.


Krste Asanovic
 

On Thu, 23 Jul 2020 03:27:03 -0700, Andrew Waterman <andrew@...> said:
| On Wed, Jul 22, 2020 at 11:42 PM <krste@...> wrote:
|||||| On Wed, 22 Jul 2020 23:37:02 -0700, Andrew Waterman <andrew@...> said:
| | On Wed, Jul 22, 2020 at 11:19 PM David Horner <ds2horner@...> wrote:
| |     #3) If vill is set should vl remain unchanged? (I vote for yes).
| | Other vsetvl[i] instructions that set vill=1 also set vl=0.  Deviating from that course would be needlessly
| painful and not especially beneficial.
| It does add a non-orthogonality, but it is certainly beneficial in
| renamed machines to know that vl is never changed by the instruction.

| Disagreed. It’s fine to treat vsetvl instructions that set vill as pipeline flushes. Uarch can therefore assume vl
| isn’t changed.

There is still a non-trivial complexity/verification cost here versus
never changing vl.

But I think there's another detail we've been overlooking that we need
to consider.

The vsetvl variant with vtype as register operand is used to restore
vector register state after a context swap. It is not currently
clearly specified, but in the case that the restored vtype value has
vill bit set, the current text implies vl should be cleared.

Section 6.1
"If the vtype setting is not supported by the implementation, then the
vill bit is set in vtype, the remaining bits in vtype are set to zero,
and the vl register is also set to zero."

If we are to allow vl to be set to any value when new vtype.vill=1,
then we have to define rule for how source vl value affects the vl
CSR. This would simply be "truncate to number of supported vl bits",
though we need to consider the (small) cost of implementing this rule
correctly when the priv architecture supports emulating shorter VLEN
in lower privileged levels.

I think there are two orthogonal decisions to take:

"vill=1 on SEW/LMUL change" or "vill=1 on vl change" during vsetvl{i} x0, x0
------------------------------------------------------------------------

a) "vill=1 on vl change"

b) "vill=1 on SEW/LMUL change"

Proposal a) The "vill=1 on vl change" form supports additional
functionality. The implied read of vl is a RAW dependency that
microarchitectures have to either resolve ahead of execution, or
speculate that vl doesn't change and flush on mispredict. It's not
clear to me when this additional functionality is useful, as it
overlaps with fractional LMUL functionality, but possibly when it is
known application vectors would fit into non-power-of-2 vector register
groups.

Proposal b) The "vill=1 on SEW/LMUL" change avoids a read of current vl
but limits vtype changes to constant SEW/LMUL ratios.

The current plan of record is option z)
z) "vill=1 only on bad new SEW/LMUL", which allows vl to change without
reporting vill.


vl zeroing on vill
------------------

c) any time vill is set, vl is zeroed.

d) vl never changes even if vtype.vill is set in vsetvl{i} x0, x0. "vsetvl
rd, xavl, xnewvtype" form writes vl with LSBs of xavl when
xnewvtype.vill=1 (otherwise as before). These instruction forms could
be renamed to "vsetvtype{i}" to make this distinction clearer.

Proposal c) is simpler conceptually. In particular, if a SEW/LMUL
configuration is not supported, then no matter which instruction form
is used to set vtype, vill will be set and vl zeroed. But a) requires
uarchs also zero vl for requested vl changes on "vsetvl{} x0, x0,".

Proposal d) adds a little complexity to vsetvl form of instruction,
but there is already a path to write vl from xvl, so vill=1 case would
be same as VLMAX>=xavl case (I think this means the "emulate shorter
VLEN machine" mechanism drops out of the same path). The conceptual
complexity (admittedly, with not much practical impact) is that a bad
SEW/LMUL setting can set vill but not change vl depending on form, but
changing assembly instruction name should make this easier to explain.

Combinations
------------

a) & c) vl read / vl written

a) & d) vl not read / vl written

b) & c) vl read / vl not written

b) & d) vl not read / vl not written



Krste


David Horner
 

On 2020-07-24 6:11 a.m., krste@... wrote:
The vsetvl variant with vtype as register operand is used to restore
vector register state after a context swap. It is not currently
clearly specified, but in the case that the restored vtype value has
vill bit set, the current text implies vl should be cleared.

Section 6.1
"If the vtype setting is not supported by the implementation, then the
vill bit is set in vtype, the remaining bits in vtype are set to zero,
and the vl register is also set to zero."

If we are to allow vl to be set to any value when new vtype.vill=1,
then we have to define rule for how source vl value affects the vl
CSR.
this is in the x0,x0 case? I see this as the only case that needs to be considered.
The EE does not have to both set vill and establish a saved vl value in the same instruction.
A sequence of vsetvl instructions may be necessary to end up with the saved state.
The EE should  verfiy/validate after a "restore sequence" that the save state is established,
 perhaps remigrating to the original hart if unsuccessful.

Or am I missing the point?
This would simply be "truncate to number of supported vl bits",
providing a rs1 with the desired value also allows truncation to "supported vl bits".
I don't know that any special behaviour needs to be defined.
Again I could be missing the point.

though we need to consider the (small) cost of implementing this rule
correctly when the priv architecture supports emulating shorter VLEN
in lower privileged levels.
ditto.


David Horner
 

On 2020-07-24 6:11 a.m., krste@... wrote:
I think there are two orthogonal decisions to take:

"vill=1 on SEW/LMUL change" or "vill=1 on vl change" during vsetvl{i} x0, x0
------------------------------------------------------------------------
To be clear, this is SEW/LMUL ratio change, correct?
All other values being valid and the "SEW and LMUL" combination itself being valid.

Providing an invalid SEW and LMUL combination will set vill for all vsetvl variants.
We still need to determine what should happen with vl (cleared or unchanged) in the overall context of the final resolution of X0,X0.
This too may be an orthogonal consideration ....