On 2021-11-16 12:15 p.m., Bill Huffman
On Nov 16,
2021, at 17:31, Bill Huffman <huffman@...>
On Behalf Of Krste Asanovic
Sent: Tuesday, November 16, 2021 11:13 AM
To: ghost <ghost@...>
Subject: Re: [RISC-V] [tech-vector-ext] RISC-V Vector
Extension post-public review updates
On Tue, 16 Nov 2021 07:36:40 -0800 (PST), "ghost" <ghost@...>
Mandate all implementations raise an illegal exception
case. This is my preferred route, as this would be a
existing implementations (doesn't affect software),
and we would
reuse this state/encoding for other purposes.
Allow either correct execution or illegal exception
Consider "reserved", implying implementations that
support it are
non-conforming unless we later go with 2).
assuming we're going to push to ratify 1) unless I
agree that #1 is the least unfortunate of the
alternatives, but I
to raise a flag because I think there are larger
AFAIK, the vector extensions are unique among proposed
extensions in their extensive functional dependency on
other than the instruction.
Yes, absolutely. Many vector models historically have been
co-processors with their own internal status.
RVV integration is also a major accomplishment.
task group had a strong consensus
I was a part of that. However, a consensus within a TG does not
make a justification nor provide a rationale.
The ARC has been tasked with that kind of architectural decision,
and to date they have been silent.
We can infer that silence from the ARC is consent. [A motivation
for me to speak up.]
retaining a 32-bit encoding for the vector extension,
which led to the separate control state.
desire to stick with 32-bit encoding was not only to
avoid adding a new instruction length,
Not that we should minimize the impact from a new instruction
length to additional ratification issues, tool chain, alignment
issues and parceling,
not to mention decode complexities/cost about which some on ARC
also to reduce static and dynamic code size.
agreed. >32bit instructions come with a substantial cost.
Usage pattern are paramount to making this decision.
The current understanding is that typical target applications
will readily amortize vtype settings over multiple operations.
Explicitly providing element length information in the load/store
reduces the transition in many use cases.
should be noted that fixed-instruction-width RISC
vector architectures (ARM SVE2, IBM VMX) have had to
adopt a prefix model to accomodate vector encodings,
with similar concerns about intermediate control state
The TG has considered "transient" config settings in vtype to
eliminate the need to explicitly flip-flop between vtype states.
It remains a post v1.0 "feature", with the design retaining vtype
as the sole state location for its information.
(variable-length ISAs just have very long vector
Yet, RISCV ostensibly has variable-length encoding.
obvious bias, I believe the RISC-V solution is cleaner
than these others in this regard.
As do I. especially in encapsulating most persistent control [vs
data ] information in vtype.
Where the design can be faulted is in not saving vcsr in vtype to
minimize context switches concerns.
vstart is essentially transient information that well behaved
applications should ignore.
However, a common opportunity to context switch is when waiting
for resource ad be part of context switch information.
Avoiding this kind of dependency seems to have been a
important goal (one of many, of course) in previous
example, including a rounding mode in every floating
instruction, even the FMA group, multiplied the number
of code points
these instructions by 8, even though it is not clear
(at least to
how important the use cases are. (IMO this might tend
ds2horner's proposal to use 48- or 64-bit instructions
for some of the
vector capability, but that is off topic for the
I am obviously making this concern a new thread.
Basically, I am hoping these points will be the salient ones for
a response to the Public Review question I raised.
I can see a counter-argument that using machine state
pipelining setup that might depend on that state.)
longer 64-bit encoding was always planned for the
vector extension as it is clear that the set of
desired instruction types could not fit in 32 bits.
vtype is extensible, another of the reasons that this design is
For example, data-type overriding to substitute for relevant
integer ops complex float allowing it and real float to coexist
through a section of code.
main simplification from using the separate control
state was in avoiding the longer instruction width,
not in pipelining, which it actually complicates.
the concern might be unprivileged instructions
depending on unprivileged state, which is much less
common. I think the vector situation is different
than, for example, round mode. The difference for
vectors is that the added state is used for every
vector instruction. It’s part of executing vectors
that the state is set. A restart point is required to
have strided or indexed memory operations and an MMU.
A length is required if we wish to avoid special code
to handle vector lengths that are not a multiple of
the hardware lengths. We can’t avoid some of this
state even with 48-/64-bit instructions. We would
probably avoid SEW and LMUL with longer vector
instructions, but since length has to be set for all
vector instructions in some way, setting SEW and LMUL
isn’t as big an issue as setting round mode for
What is the
thinking for when we go to >32-bit encodings with
respect to vtype and masks? I assume that the longer
encoding could encode SEW (and LMUL?) as an override of
vtype. What about masks though? If we enable more than one
masks (m0…mN) in 48-bit/64-bit encodings, and we want to
mix 32-bit and 48-bit/64-bit instructions in the same
code, do we still specify that e.g. m0==v0 or do we need
to explicitly copy v0 to e.g. m0 before it can be used
with 48-bit/64-bit instructions (and vice versa when
switching from 48-bit/64-bit instructions to 32-bit
The salient point of coexistance is probably why we will expand
within 32bit opcode space for the foreseeable future.
It would be
nice if we could reclaim v0 (actually v0 through v7 for
LMUL=8) from being a mask to being able to hold data,
The mask designation could be in vtype while still using 32bit
*and* not to
have to force the whole code/loop body to use
48-bit/64-bit instructions in order to do this.
I don’t think there’s any agreement at
this point on what goes into a longer instruction, but there
are a number of candidates, including at least:
- VMA and
specifier for the mask register
registers – perhaps 128 instead of 32
Additional register designations 64 or 128 are the most likely
motivator to >32bit instr.
However, I can imagine a windowing mode in which unaligned
register in different LMUL>1 map above the base 32 registers.
Even without modifying vtype this is possible, and with vtype
complex windowing is possible.
- Possibly a
fourth register specifier (not counting mask).
If I’m counting correctly, that’s already
28 additional bits. That’s in the range of the maximum that
can be put into a 64-bit instruction set. There are
probably more candidates and discussion about which ones to
include will certainly be needed. 😊
For me, the most compelling justification for using 32bit opcodes
is the intentional design to provide vector functionality to
The design is not just for the super computers but the vision is
that such an integrated vector feature can be used to
auto-vectorize standard code logic.
To be amenable to the lowest of the low.
It is this accomplishment above all others that I am most
appreciative to the TG.
Thank you all.