Forgive top posting, but the email was long and I want to bring key
points forward for others.
#458/460 propose restricting allowed register numbers in a vsetvli to
give more bits to future vsetvli uses. I just cannot see why we do
this for several reasons:
1) the impact on ABI and compilers and assembly programmers by restricting registers in this way
2) we are far from using all encoding bits in vsetvli yet
3) there are many other cleaner ways to increase configuration space when we need it
4) momentum, people have been working with current scheme and this is a major shift
The only two bits under debate are ta and ma.
I view ta/ma as a real and immediate need versus potential future
hypothetical needs. In some cases, these will be varied inside inner
loop, e.g.,when stripmining through reductions with a vector of
partial sums, or when performing various convolution operations that
edit vector registers. In other cases, they will be relatively
static, when agnostic is always fine.
For the specific use case given in #460, I view non-power-of-2 LMUL as
possibly interesting but not in same importance category as supporting
renaming with mask agnostic. For example, how much is really gained
versus simply rounding down to next power-of-2 LMUL? The largest gain
is from 4 to 7, but at LMUL=4, any reasonable hardware should already
be very efficient and there will then be more registers available for
live values. There are also issues about widening/narrowing, etc.,
when EMUL is not a power-of-two, implying we need EMUL 2.5, 3.5 for
example, or don't support those operations for those LMULs so they're
2nd-class citizens. Also, most of the same effect can be had with an
additional Bitmanip "minu" instruction on AVL that's only needed once
per inner loop with some setup outside loop.
To expand the set of custom types, where we usually don't need to
encode vl setting at same time but rely on existing vl setting (as
with vsetvli x0, x0, imm encoding), we could add a new vsetvlx
encoding based on vsetvl but with bit 30 set.
31 30 25 24 20 19 15 14 12 11 7 6 0
0 | zimm[10:0] | rs1 | 1 1 1 | rd |1010111| vsetvli
1 | 000000 | rs2 | rs1 | 1 1 1 | rd |1010111| vsetvl
1 | 1xxxxx | xxxxx | xxxxx | 1 1 1 | xxxxx |1010111| vsetvlx
This gives 20 bits to change vtype with new vl setting based on
existing vl (or where there may be some fixed known relationships
based on the special data types or shape, where software doesn't need
to see new vl value returned in rd but knows how to count based on other vl
values). Note that existing immediate fields in vsetvli can live in
same bit positions in this instruction.
Krste
On Wed, 1 Jul 2020 03:02:18 -0400, DSHORNER <ds2horner@...> said:
| I think I understand how I confused the situation.
| Issue #458 introduced idea of using rd and rs1 values to encode more
| bits for vsetvli.
| I proposed that this become the only vsetvli format.
| Krste countered that the current format could be expanded later if
| needed to adopt the new format as long as a field encoding was otherwise
| unused.
| I agreed that this was technically possible. But I did not raise a
| concern that this would have potential negative consequences.
| In the meantime, I opened #460, which in addition to the rd and rs1
| encoding, avoided using a bit within vtype to allow for vl calculation
| based on lmul of 3,5,6 or 7.
| I my mind, #460 raised all the concerns and considerations present in #458.
| Further, it provided additional support for the rd/rs1 format by using
| the novel encoding is a unique way.
| As a result I closed #458 to have all the relevant discussion tracked on
| #460.
| It could, however, easily have been inferred that I closed #458 because
| the "escape mechanism" was perfect.
| The closing comments in #458 however explicitly recommend the concern be
| revisited as V1.0 approaches.
||
|| As we approach v1.0 we should evaluate if we will inevitably exceed 10
|| bits encoding using only immediate bits, and if so reconsider:
|| a) whether the recovered 6 bits of rs1 and rd encoding will be
|| sufficient for the life time of ILEN=32
|| b) we want to support two distinct and competing encodings (for
|| perhaps overlapping settings)
|| c) if the expanded format will in effect supersede the original
|| encoding, and thus result in dead weight of a low use format to be
|| supported in perpetuity.
||
|| Note, If we defer until we only have one bit available we will use
|| that one bit for selecting a mode that we could have chosen from the
|| beginning without that loss of bit. This will also weigh into the
|| considerations above.
||
| If the “resolve for v1.0’ label had been available then I likely would
| have suggested it for #458 and definitely for #460.
| The intro in #460 also infers the need to give early consideration to
| this format:
|| A1.
|| There are limited immediate bits in the vsetvli instruction.
|| Early use of bits will become entrenched in the design. Misuse cannot
|| be corrected later.
|| Extensions to RVV will undoubtedly wish a single mechanism to both set
|| vtype and establish the appropriate vl. Extensions such as complex and
|| quaternion numbers affect vl calculation, and can leverage all of
|| integer, fixed-point and especially float data formats for
|| add/substract, multiply, divide (reciprocals) and share conjugate and
|| norm.
|| There may be many other such data-types.
|| It is fully possible that the proliferation of modes and datatypes
|| will exhaust the currently remaining 3 bits. See**
||
|| A2.
|| SEW and LMUL are essential opcode modifiers. However, together they
|| use 8 [incorrect it is 6] of the 11 available immediate bits in
|| vsetvl, even though a dense encoding is used. This is undesirable.
|| Finalizing this encoding will entrench other bits in the instruction
|| making them unavailable for future use via innovative encoding.
||
|| A3.
|| An alternate encoding specifically for LMUL is here presented.
|| (Whereas SEW could be similarly encoded, LMUL is proposed as it
|| appears the most constrained. See*** )
||
| On 2020-06-30 11:12 a.m., Krste Asanovic wrote:
|| For 1.0, we are just trying to fix vsew, vlmul, vma, and vta (and also
|| vill in vtype, but that’s out of vsetvli immediate range).
||
|| I think it’s clear that vma and vta are not going to change very often
|| in many code sequences,
| If this is indeed true, then this makes the fields candidates for vtype
| fields that are only set by vsetvl (those in range [XLEN-2:11])
|| and agnostic provides significant PPA benefit for renamed register
|| machines, especially with long vectors.
| I agree they likely have merit, I advocated for their inclusion in
| vtype, and in vsetvli.
||
|| I can’t see what you are trying to propose that would affect the 1.0
|| spec?
| I am proposing that we seriously consider the consequences of providing
| a vsetvli instruction that has as limited an immediate field.
| There are alternatives, #458 and #460 are two such that increase
| functionality(complete lmul range) and immediate bit encoding (by up to
| 6 bits)..
| Using vlmul = "100" for vsetvl opcode decoding rather than the immediate
| sign bit [ bit 31] is another low cost approach that recovers a bit.
| And of course there are other alternatives.
|| Are you saying that vsew, vlmul, vma, vta should not be in the vsetvli
|| immediate space?
| As reasoned above, vma and vta are candidates to be removed.
| Conversely, vsew and vlmul are prime candidates for inclusion in the
| vsetvli immediate space because:
| they are essential to the "set vl" function, and
| they are common modifiers to base operations (as in the expected
| 64bit op-code space) and
| they are often used in conjunction with one another and
| many code examples show sew/lmul variation within typical loops.
| This is another aspect that needs to form part of the reasoning about
| the sufficiency of vsetvli immediate space:
| Pressure on immediate form of the instruction would be drastically
| diminished If
| only those fields that definitively provide an appreciable benefit
| to code efficiency are included.
| (in particular, if the field can be hoisted from the loop it is
| not a good candidate).
| To me, the combination of removing vma and vta from the immediate and
| using lmul="100" for vsetvl encoding
| removes sufficient pressure that
| an immediate bit could be used to expand lmul to 3,5,6 and 7 and
| still provide for judicious inclusion of warranted future immediates
| for years, without invoking the rd/sr1 encoding.
| However, switching to rd/rs1 encoding does provide a substantial margin
| for error and neatly addresses the lmul=3,5,6 and 7 concern..
||
|| Krste
||