RISC-V Vector Extension post-public review updates - 32bit opcode decision

David Horner

On 2021-11-16 12:15 p.m., Bill Huffman wrote:



From: Grigorios Magklis <grigorios.magklis@...>
Sent: Tuesday, November 16, 2021 12:03 PM
To: Bill Huffman <huffman@...>; Krste Asanovic <krste@...>; ghost <ghost@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] RISC-V Vector Extension post-public review updates




On Nov 16, 2021, at 17:31, Bill Huffman <huffman@...> wrote:




-----Original Message-----
From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of Krste Asanovic
Sent: Tuesday, November 16, 2021 11:13 AM
To: ghost <ghost@...>
Cc: krste@...; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] RISC-V Vector Extension post-public review updates





>>>>> On Tue, 16 Nov 2021 07:36:40 -0800 (PST), "ghost" <ghost@...> said:


|| 1) Mandate all implementations raise an illegal exception in this

|| case.  This is my preferred route, as this would be a minor errata

|| for existing implementations (doesn't affect software), and we would

|| not reuse this state/encoding for other purposes.


|| 2) Allow either correct execution or illegal exception (as with

|| misaligned).


|| 3) Consider "reserved", implying implementations that support it are

|| non-conforming unless we later go with 2).


|| I'm assuming we're going to push to ratify 1) unless I hear strong

|| objections.


| I agree that #1 is the least unfortunate of the alternatives, but I

| want to raise a flag because I think there are larger considerations.


| AFAIK, the vector extensions are unique among proposed non-privileged

| extensions in their extensive functional dependency on machine state

| other than the instruction.

Yes, absolutely. Many vector models historically have been co-processors with their own internal status.

RVV integration is also a major accomplishment.


The task group had a strong consensus

I was a part of that. However, a  consensus within a TG does not make a justification nor provide a rationale.

The ARC has been tasked with that kind of architectural decision, and to date they have been silent.

We can infer that silence from the ARC is consent. [A motivation for me to speak up.]

in retaining a 32-bit encoding for the vector extension, which led to the separate control state.

The desire to stick with 32-bit encoding was not only to avoid adding a new instruction length,

Not that we should minimize the impact from a new instruction length to additional ratification issues, tool chain, alignment issues and parceling,

not to mention decode complexities/cost about which some on ARC are hyperventilate.

but also to reduce static and dynamic code size.

agreed. >32bit instructions come with a substantial cost. Usage pattern are paramount to making this decision.

The current understanding is that typical target applications will readily amortize vtype settings over multiple operations.

Explicitly providing element length information in the load/store reduces the transition in many use cases. 

It should be noted that fixed-instruction-width RISC vector architectures (ARM SVE2, IBM VMX) have had to adopt a prefix model to accomodate vector encodings, with similar concerns about intermediate control state

The TG has considered "transient" config settings in vtype to eliminate the need to explicitly flip-flop between vtype states.

It remains a post v1.0 "feature", with the design retaining vtype as the sole state location for its information.

(variable-length ISAs just have very long vector instruction encoding).

Yet,  RISCV ostensibly has variable-length encoding.

With obvious bias, I believe the RISC-V solution is cleaner than these others in this regard.

As do I. especially in encapsulating most persistent control [vs data ] information in vtype.

Where the design can be faulted is in not saving vcsr in vtype to minimize context switches concerns.

vstart is essentially transient information that well behaved applications should ignore.

However, a common opportunity to context switch is when waiting for resource ad be part of context switch information.


| Avoiding this kind of dependency seems to have been a consistent and

| important goal (one of many, of course) in previous designs.

| For example, including a rounding mode in every floating point

| instruction, even the FMA group, multiplied the number of code points

| for these instructions by 8, even though it is not clear (at least to

| me) how important the use cases are.  (IMO this might tend to support

| ds2horner's proposal to use 48- or 64-bit instructions for some of the

| vector capability, but that is off topic for the present discussion;

I am obviously making this concern a new thread.

Basically, I am hoping these points will be the salient ones for a response to the Public Review question I raised.

| and I can see a counter-argument that using machine state simplifies

| pipelining setup that might depend on that state.)


A longer 64-bit encoding was always planned for the vector extension as it is clear that the set of desired instruction types could not fit in 32 bits.

vtype is extensible, another of the reasons that this design is superb.

For example, data-type overriding to substitute for relevant integer ops complex float allowing it and real float to coexist through a section of code.  

The main simplification from using the separate control state was in avoiding the longer instruction width, not in pipelining, which it actually complicates.


I think the concern might be unprivileged instructions depending on unprivileged state, which is much less common.  I think the vector situation is different than, for example, round mode.  The difference for vectors is that the added state is used for every vector instruction.  It’s part of executing vectors that the state is set.  A restart point is required to have strided or indexed memory operations and an MMU.  A length is required if we wish to avoid special code to handle vector lengths that are not a multiple of the hardware lengths.  We can’t avoid some of this state even with 48-/64-bit instructions.  We would probably avoid SEW and LMUL with longer vector instructions, but since length has to be set for all vector instructions in some way, setting SEW and LMUL isn’t as big an issue as setting round mode for floating-point operations.





What is the thinking for when we go to >32-bit encodings with respect to vtype and masks? I assume that the longer encoding could encode SEW (and LMUL?) as an override of vtype. What about masks though? If we enable more than one masks (m0…mN) in 48-bit/64-bit encodings, and we want to mix 32-bit and 48-bit/64-bit instructions in the same code, do we still specify that e.g. m0==v0 or do we need to explicitly copy v0 to e.g. m0 before it can be used with 48-bit/64-bit instructions (and vice versa when switching from 48-bit/64-bit instructions to 32-bit instructions)?

The salient point of coexistance is probably why we will expand within 32bit opcode space for the foreseeable future.

It would be nice if we could reclaim v0 (actually v0 through v7 for LMUL=8) from being a mask to being able to hold data,

The mask designation could be in vtype while still using 32bit instruction encoding.


*and* not to have to force the whole code/loop body to use 48-bit/64-bit instructions in order to do this.




I don’t think there’s any agreement at this point on what goes into a longer instruction, but there are a number of candidates, including at least:

  • LMUL
  • SEW
  • VMA and VTA bits
  • Register specifier for the mask register
  • Additional registers – perhaps 128 instead of 32

Additional register designations 64 or 128 are the most likely motivator to >32bit instr.

However, I can imagine a windowing mode in which unaligned register in different LMUL>1 map above the base 32 registers.

Even without modifying vtype this is possible, and with vtype complex windowing is possible.

  • Possibly a fourth register specifier (not counting mask).


If I’m counting correctly, that’s already 28 additional bits.  That’s in the range of the maximum that can be put into a 64-bit instruction set.  There are probably more candidates and discussion about which ones to include will certainly be needed. 😊



For me, the most compelling justification for using 32bit opcodes is the intentional design to provide vector functionality to minimal systems.

The design is not just for the super computers but the vision is that such an integrated vector feature can be used to auto-vectorize standard code logic.

To be amenable to the lowest of the low.

It is this accomplishment above all others that I am most appreciative to the TG.

Thank you all.