Issue categorization - #460
minor typos; substantial correction:
On 2020-06-29 10:12 a.m., David Horner via lists.riscv.org wrote:
Although I agree that the proposal itself can be implemented in a manner consistent with the current vsetvli definition,
also throughout the email
vd should be rd
vs1 should be rs1
And vs2 is completely bogus.
Sorry I didn't catch this sooner.
Can you focus on what would not be possible if we ratified current proposal.toggle quoted messageShow quoted text
Remember EDIV is not in 1.0 and Vlmul=100 is reserved
On 2020-06-29 10:36 a.m., Krste Asanovic wrote:
Can you focus on what would not be possible if we ratified current proposal.
It is a planned extension, that even if not implemented as currently proposed it will likely consume at least 2 bits in vtype.
Do you envision a possibility that it will consume no vtype bits?
The reality is that the 32bit encoding is highly dependent upon
vtype to reduce opcode space by providing common fields
Of course, the ideal candidates for vtype inclusion are modifiers that will remain constant over many vector instructions.
vlmul and vsew fit the requirements.
We don't know yet if vma and vta are going to be sufficiently variable and sufficiently constant over op sequences to be useful in vtype.
But there they are using up 2 bits for an anticipated performance (and in my mind "software precision/accuracy") benefit.
I agree that their inclusion in V1.0 was necessary for their inclusion into the full ecosystem.
However, they are an example of how quickly bits can be consumed by anticipated vs proven need.
Post V1.0 we will be more cautious in inclusion into vtype.
Fortunately, the custom space will allow explicit testing of the relative merit of various extensions vying for inclusion in vsetvli.
However, the fewer bits available in the base vsetvli the more likely we will have to reject equally useful opcode extensions from concurrent use and the most efficient code sequences.
I expect it will be useful for encoding parameters that do not require lmul, e.g. one related to mask ops.
But even trying to think of an potential use for ops that don't use lmul (or always a default lmul value of say 8) is challenging.
So, one might be temped to think that it would then be useful as an extension flag.
However, the very ubiquitous nature of lmul would mean that the
extension would also encode lmul using 3 bits.
Thus the benefit is lost; For an alternate encoding that provides n more bit (such as #460 that provides 6) the net increase is only n-2 bits.
Not a winning solution in the anticipated "lmul is required" encoding space.
For 1.0, we are just trying to fix vsew, vlmul, vma, and vta (and also vill in vtype, but that’s out of vsetvli immediate range).toggle quoted messageShow quoted text
I think it’s clear that vma and vta are not going to change very often in many code sequences, and agnostic provides significant PPA benefit for renamed register machines, especially with long vectors.
I can’t see what you are trying to propose that would affect the 1.0 spec? Are you saying that vsew, vlmul, vma, vta should not be in the vsetvli immediate space?
I think I understand how I confused the situation.
Issue #458 introduced idea of using rd and rs1 values to encode more bits for vsetvli.
I proposed that this become the only vsetvli format.
Krste countered that the current format could be expanded later if needed to adopt the new format as long as a field encoding was otherwise unused.
I agreed that this was technically possible. But I did not raise a concern that this would have potential negative consequences.
In the meantime, I opened #460, which in addition to the rd and rs1 encoding, avoided using a bit within vtype to allow for vl calculation based on lmul of 3,5,6 or 7.
I my mind, #460 raised all the concerns and considerations present in #458.
Further, it provided additional support for the rd/rs1 format by using the novel encoding is a unique way.
As a result I closed #458 to have all the relevant discussion tracked on #460.
It could, however, easily have been inferred that I closed #458 because the "escape mechanism" was perfect.
The closing comments in #458 however explicitly recommend the concern be revisited as V1.0 approaches.
If the “resolve for v1.0’ label had been available then I likely would have suggested it for #458 and definitely for #460.
The intro in #460 also infers the need to give early consideration to this format:
A1.On 2020-06-30 11:12 a.m., Krste Asanovic wrote:
For 1.0, we are just trying to fix vsew, vlmul, vma, and vta (and also vill in vtype, but that’s out of vsetvli immediate range).If this is indeed true, then this makes the fields candidates for vtype fields that are only set by vsetvl (those in range [XLEN-2:11])
and agnostic provides significant PPA benefit for renamed register machines, especially with long vectors.I agree they likely have merit, I advocated for their inclusion in vtype, and in vsetvli.
I am proposing that we seriously consider the consequences of providing a vsetvli instruction that has as limited an immediate field.
There are alternatives, #458 and #460 are two such that increase functionality(complete lmul range) and immediate bit encoding (by up to 6 bits)..
Using vlmul = "100" for vsetvl opcode decoding rather than the immediate sign bit [ bit 31] is another low cost approach that recovers a bit.
And of course there are other alternatives.
Are you saying that vsew, vlmul, vma, vta should not be in the vsetvli immediate space?As reasoned above, vma and vta are candidates to be removed.
Conversely, vsew and vlmul are prime candidates for inclusion in the vsetvli immediate space because:
they are essential to the "set vl" function, and
they are common modifiers to base operations (as in the expected 64bit op-code space) and
they are often used in conjunction with one another and
many code examples show sew/lmul variation within typical loops.
This is another aspect that needs to form part of the reasoning about the sufficiency of vsetvli immediate space:
Pressure on immediate form of the instruction would be drastically diminished If
only those fields that definitively provide an appreciable benefit to code efficiency are included.
(in particular, if the field can be hoisted from the loop it is not a good candidate).
To me, the combination of removing vma and vta from the immediate and
using lmul="100" for vsetvl encoding
removes sufficient pressure that
an immediate bit could be used to expand lmul to 3,5,6 and 7 and
still provide for judicious inclusion of warranted future immediates
for years, without invoking the rd/sr1 encoding.
However, switching to rd/rs1 encoding does provide a substantial margin for error and neatly addresses the lmul=3,5,6 and 7 concern..
Forgive top posting, but the email was long and I want to bring key
points forward for others.
#458/460 propose restricting allowed register numbers in a vsetvli to
give more bits to future vsetvli uses. I just cannot see why we do
this for several reasons:
1) the impact on ABI and compilers and assembly programmers by restricting registers in this way
2) we are far from using all encoding bits in vsetvli yet
3) there are many other cleaner ways to increase configuration space when we need it
4) momentum, people have been working with current scheme and this is a major shift
The only two bits under debate are ta and ma.
I view ta/ma as a real and immediate need versus potential future
hypothetical needs. In some cases, these will be varied inside inner
loop, e.g.,when stripmining through reductions with a vector of
partial sums, or when performing various convolution operations that
edit vector registers. In other cases, they will be relatively
static, when agnostic is always fine.
For the specific use case given in #460, I view non-power-of-2 LMUL as
possibly interesting but not in same importance category as supporting
renaming with mask agnostic. For example, how much is really gained
versus simply rounding down to next power-of-2 LMUL? The largest gain
is from 4 to 7, but at LMUL=4, any reasonable hardware should already
be very efficient and there will then be more registers available for
live values. There are also issues about widening/narrowing, etc.,
when EMUL is not a power-of-two, implying we need EMUL 2.5, 3.5 for
example, or don't support those operations for those LMULs so they're
2nd-class citizens. Also, most of the same effect can be had with an
additional Bitmanip "minu" instruction on AVL that's only needed once
per inner loop with some setup outside loop.
To expand the set of custom types, where we usually don't need to
encode vl setting at same time but rely on existing vl setting (as
with vsetvli x0, x0, imm encoding), we could add a new vsetvlx
encoding based on vsetvl but with bit 30 set.
31 30 25 24 20 19 15 14 12 11 7 6 0
0 | zimm[10:0] | rs1 | 1 1 1 | rd |1010111| vsetvli
1 | 000000 | rs2 | rs1 | 1 1 1 | rd |1010111| vsetvl
1 | 1xxxxx | xxxxx | xxxxx | 1 1 1 | xxxxx |1010111| vsetvlx
This gives 20 bits to change vtype with new vl setting based on
existing vl (or where there may be some fixed known relationships
based on the special data types or shape, where software doesn't need
to see new vl value returned in rd but knows how to count based on other vl
values). Note that existing immediate fields in vsetvli can live in
same bit positions in this instruction.
| I think I understand how I confused the situation.On Wed, 1 Jul 2020 03:02:18 -0400, DSHORNER <ds2horner@...> said:
| Issue #458 introduced idea of using rd and rs1 values to encode more
| bits for vsetvli.
| I proposed that this become the only vsetvli format.
| Krste countered that the current format could be expanded later if
| needed to adopt the new format as long as a field encoding was otherwise
| I agreed that this was technically possible. But I did not raise a
| concern that this would have potential negative consequences.
| In the meantime, I opened #460, which in addition to the rd and rs1
| encoding, avoided using a bit within vtype to allow for vl calculation
| based on lmul of 3,5,6 or 7.
| I my mind, #460 raised all the concerns and considerations present in #458.
| Further, it provided additional support for the rd/rs1 format by using
| the novel encoding is a unique way.
| As a result I closed #458 to have all the relevant discussion tracked on
| It could, however, easily have been inferred that I closed #458 because
| the "escape mechanism" was perfect.
| The closing comments in #458 however explicitly recommend the concern be
| revisited as V1.0 approaches.
|| As we approach v1.0 we should evaluate if we will inevitably exceed 10
|| bits encoding using only immediate bits, and if so reconsider:
|| a) whether the recovered 6 bits of rs1 and rd encoding will be
|| sufficient for the life time of ILEN=32
|| b) we want to support two distinct and competing encodings (for
|| perhaps overlapping settings)
|| c) if the expanded format will in effect supersede the original
|| encoding, and thus result in dead weight of a low use format to be
|| supported in perpetuity.
|| Note, If we defer until we only have one bit available we will use
|| that one bit for selecting a mode that we could have chosen from the
|| beginning without that loss of bit. This will also weigh into the
|| considerations above.
| If the “resolve for v1.0’ label had been available then I likely would
| have suggested it for #458 and definitely for #460.
| The intro in #460 also infers the need to give early consideration to
| this format:
|| There are limited immediate bits in the vsetvli instruction.
|| Early use of bits will become entrenched in the design. Misuse cannot
|| be corrected later.
|| Extensions to RVV will undoubtedly wish a single mechanism to both set
|| vtype and establish the appropriate vl. Extensions such as complex and
|| quaternion numbers affect vl calculation, and can leverage all of
|| integer, fixed-point and especially float data formats for
|| add/substract, multiply, divide (reciprocals) and share conjugate and
|| There may be many other such data-types.
|| It is fully possible that the proliferation of modes and datatypes
|| will exhaust the currently remaining 3 bits. See**
|| SEW and LMUL are essential opcode modifiers. However, together they
|| use 8 [incorrect it is 6] of the 11 available immediate bits in
|| vsetvl, even though a dense encoding is used. This is undesirable.
|| Finalizing this encoding will entrench other bits in the instruction
|| making them unavailable for future use via innovative encoding.
|| An alternate encoding specifically for LMUL is here presented.
|| (Whereas SEW could be similarly encoded, LMUL is proposed as it
|| appears the most constrained. See*** )
| On 2020-06-30 11:12 a.m., Krste Asanovic wrote:
|| For 1.0, we are just trying to fix vsew, vlmul, vma, and vta (and also
|| vill in vtype, but that’s out of vsetvli immediate range).
|| I think it’s clear that vma and vta are not going to change very often
|| in many code sequences,
| If this is indeed true, then this makes the fields candidates for vtype
| fields that are only set by vsetvl (those in range [XLEN-2:11])
|| and agnostic provides significant PPA benefit for renamed register
|| machines, especially with long vectors.
| I agree they likely have merit, I advocated for their inclusion in
| vtype, and in vsetvli.
|| I can’t see what you are trying to propose that would affect the 1.0
| I am proposing that we seriously consider the consequences of providing
| a vsetvli instruction that has as limited an immediate field.
| There are alternatives, #458 and #460 are two such that increase
| functionality(complete lmul range) and immediate bit encoding (by up to
| 6 bits)..
| Using vlmul = "100" for vsetvl opcode decoding rather than the immediate
| sign bit [ bit 31] is another low cost approach that recovers a bit.
| And of course there are other alternatives.
|| Are you saying that vsew, vlmul, vma, vta should not be in the vsetvli
|| immediate space?
| As reasoned above, vma and vta are candidates to be removed.
| Conversely, vsew and vlmul are prime candidates for inclusion in the
| vsetvli immediate space because:
| they are essential to the "set vl" function, and
| they are common modifiers to base operations (as in the expected
| 64bit op-code space) and
| they are often used in conjunction with one another and
| many code examples show sew/lmul variation within typical loops.
| This is another aspect that needs to form part of the reasoning about
| the sufficiency of vsetvli immediate space:
| Pressure on immediate form of the instruction would be drastically
| diminished If
| only those fields that definitively provide an appreciable benefit
| to code efficiency are included.
| (in particular, if the field can be hoisted from the loop it is
| not a good candidate).
| To me, the combination of removing vma and vta from the immediate and
| using lmul="100" for vsetvl encoding
| removes sufficient pressure that
| an immediate bit could be used to expand lmul to 3,5,6 and 7 and
| still provide for judicious inclusion of warranted future immediates
| for years, without invoking the rd/sr1 encoding.
| However, switching to rd/rs1 encoding does provide a substantial margin
| for error and neatly addresses the lmul=3,5,6 and 7 concern..
On 2020-07-01 4:33 a.m., krste@... wrote:
No problem, happy you did.
The very special nature and functionality of vsetvli justifies considering special encoding.
All the following are valid considerations.
None of which have been quantified.
I see no impact to the ABI itself.
The ABI puts usage pressure and constraints on some registers,
however each encoding allows for 4 registers
of which at least 2 do not uniquely participate in the ABI.
(e.g. are user saved, and many are defined as temp).
An assembler macro could readily generate candidates and choose a default.
That therefore is not overly onerous for assembler programmers.
Tracking the used register is more challenging.
But a symbolic register from a reserved set of temps would be one way of diminishing the intellectual challenge.
Assembler programmers mastered 8086 and successors, so this is trivial in comparison.
As for compilers, the cost is much less than the perverse manipulations needed to continue to support the irregular register behaviours of 8086, 286 and 386.
The application programmers using the compilers will be isolated from the impacts regarding code generation.
Application debugging is not impacted as the decision to allocate the registers used is made statically at compile time.
This is the most challenging to quantify. How far is far enough? What latent demand have we missed or misjudged?
I agree there are many other ways.
They should also be quantified and compared to determine the appropriate approach.
Specifically you champion one below, that with tweaks looks very promising.
A very important factor, I agree.
Many technically superior approaches have been abandoned for this reason.
I want to ensure we make an informed decision on the best long term approaches, whether to abandon for short term advantage or not.
I concur and have been an advocate of their inclusion as a feature and in vestvli.
I also agree.
I also believe the frequency of use for work loads in specific domains justifies inclusion in an immediate format.
However, I cannot quantify this, nor assess the tipping point at which inclusion into vtype vs vcsr applies.
I anticipate the vma/vta are not directly coupled to vsew/vlmul.
Indeed, I anticipate that true need for tail unchanged is rare.
At the other end of the spectrum, the overhead to toggle tail.unchanged/agnostic for HPC performance gain is sufficiently low that even including the recalculation of vl it is warranted.
There is continuum through which there are tradeoffs in changing vsew/vlmul and leaving vta/vma unchanged.
There appears to be perfomance benefits to switching vta and vma as immediates in an instruction that does not update vl.
Immediates provide the microarch benefit of instruction decode look ahead (over in-register value vsetvl).
These, too, needs consideration.
For me the most compelling reason for vma/vta inclusion in V1.0 is early visablity, acknowledgement and acceptance of the feature.
For that same reason I leaned towards including them in the immediate of vsetvli.
Now that the feature is already noted and accepted within the eco system, I lean towards reserving vsetvli immediates for those vtype elements that change vl.
A tradeoff and thus no hard numbers for answers.
In other word YMMV. It depends upon the specific algorithm.
My expectation is that most algorithms
( even after maturity, as we can expect initial use to be simplistic)
will be locally simple if not trivial; which will lend themselves to very long vl optimization.
My guess is LMUL of 6 is likely an important sweet spot for complex processes.
Not really, although LMUL of 6 pairs with 3, and provides optimal correlation,
using the technique on AVL below,
LMUL of 5 can partially use LMUL of 3 with an overall space efficiency of 7/8ths, and
LMUL of 7 use LMUL 4 with efficiency of 10/11ths.
Even as second class citizens they can make significant contributions to the eco-system.
Agreed, until you encounter the corner cases,
especially changing from effective LMUL of 6 to 3,
again, what I'd consider the sweet spot for the feature.
How useful LMUL=3,5,6 and 7 will be is hard to assess.
As it can be emulated, ( in many use cases conveniently and
effectively) in the current LMUL = 2**n model, it is therefor not,
on its own, a compelling reason to adopt rd/vs1 encoding.
However, the issue is whether collectively the reasons justify preemptive inclusion
or at minimum, positioning.
As per usual, you have presented a masterful solution.
This is essentially the non-transient variation of #423
I'd suggest that
1) vsettyi (set vtype imm) as its name
2) only 15 bits be provided with 29 through 25 encoded as zeros, the others values reserved
(see #456 ytransient solution)
(we can transparently expand through bits 25 to 29 if that proves to be the best use of the bits)
3) vta and vma be moved to this instruction rather than remain in vsetvli.
(I suspect a variant of ediv will modify vl and thus be most appropriate for vsetvli.)
With this in place, rd/rs1 encoding of immediates is much less compelling (almost dead).