[tech-vector-ext] Some proposals


Guy Lemieux
 

On Sun, Mar 8, 2020 at 1:42 AM Krste Asanovic <krste@...> wrote:


It doesn't look like all these issues were addressed either on mailing
list or on github tracker. In general, it is much better to split
feedback into individual topics. I'm responding all-in-one below, but
if you want to continue any of these threads, please add to github
tracker as independent issues.

Several of these issues might be addressed with anticipated longer
64-bit vector instructions (ILEN=64).

On Fri, 22 Nov 2019 16:17:33 +0100, Dr. Adrián Cristal <adrian.cristal@...> said:
| Dear all,
| We have been involved in RTL design of an accelerator featuring the 0.7.1 standard; in parallel, we were running some vectorized HPC kernels on
| a simulator platform which mimics the accelerator. We have come across the following:

| 1. With respect to new proposed spill instructions, the cost could be high in case we need to spill few elements of a large vector. Instead we
| propose the following: spill v1, rs1, [rs2], recoverspill v2,rs2 and unspill rs1.

| The semantic is the following: spill v1, rs1, [rs2], will store v1 in the address rs1 up to rs2 elements. There is not a warranty that the
| values are stored, but there is the warranty that if they are not stored they will be completely recovered by the recoverspill v2, rs2 (
| otherwise recoverspill v2,rs2 will read the values from memory if they at some time were written to memory). The unspill rs1, will disable the
| capability to recoverspill operation at address rs1. In the case of OoO processor, for spill it can delay the free of the vector physical
| register on spill and assign to the logical register again on recovespill. So the cost will be much less, but it can be saved if the processor
| needs more physical registers. For in order implemenetations, it will save the registers on memory.
On a context switch, the underlying physical registers that hold spill
values need to be saved/restored to memory, as well as the associated
rs1 values. This means we need extra instructions to iterate through
the physical registers, and their associated rs1 values, to also save
as part of the context switch. Not impossible, but it is more complex
than just adding the 3 proposed instructions. Instead, you could just
force the spill to memory of all tracked registers, but then you run
into the delayed memory page faults etc brought up by Krste.

Instead, at the system level you could have a tightly-coupled memory
(TCM) as an addressable scratchpad where vector registers get written
during a spill? (This TCM could be used for any purpose.)


| 4- If a register is defined with a VL, then the values after VL (between VL+1 to VLMAX) could be set to “undefined” or 0, it would benefit
| implementations with longer vectors considerably in an OoO processor. The current text states that the previous values should be preserved.

Ongoing discussion.

| 5- Proposal to add some semantic information to the vectors. In particular it is important to know the VL and the SEW at the time vector was
| created, so when the context is saved we can have a special load/store operation that will only use the minimal storage that is needed in
| memory (plus the length and sew). Also this allows the system to know that after the VL, we do not have to preserve any element.
Except that all values after VL do have to be stored on a context
switch. Just because a VL was used to modify head elements of a
vector, it doesn't mean the tail elements can be ignored (under the
current tail-undisturbed model).

Guy


Krste Asanovic
 

It doesn't look like all these issues were addressed either on mailing
list or on github tracker. In general, it is much better to split
feedback into individual topics. I'm responding all-in-one below, but
if you want to continue any of these threads, please add to github
tracker as independent issues.

Several of these issues might be addressed with anticipated longer
64-bit vector instructions (ILEN=64).

On Fri, 22 Nov 2019 16:17:33 +0100, Dr. Adrián Cristal <adrian.cristal@...> said:
| Dear all,
| We have been involved in RTL design of an accelerator featuring the 0.7.1 standard; in parallel, we were running some vectorized HPC kernels on
| a simulator platform which mimics the accelerator. We have come across the following:

| 1. With respect to new proposed spill instructions, the cost could be high in case we need to spill few elements of a large vector.  Instead we
| propose the following: spill v1, rs1, [rs2], recoverspill v2,rs2 and unspill rs1.

| The semantic is the following: spill v1, rs1, [rs2], will store v1 in the address rs1 up to rs2 elements. There is not a warranty that the
| values are stored, but there is the warranty that if they are not stored they will be completely recovered by the recoverspill v2, rs2 (
| otherwise recoverspill v2,rs2 will read the values from memory if they at some time were written to memory). The unspill rs1, will disable the
| capability to recoverspill operation at address rs1. In the case of OoO processor, for spill it can delay the free of the vector physical
| register on spill and assign to the logical register again on recovespill. So the cost will be much less, but it can be saved if the processor
| needs more physical registers. For in order implemenetations, it will save the registers on memory.

This seems like a very difficult to implement optimization with
strange effects on memory model and page fault/trap handling if memory
operation is deferred to later in program execution. What happens on
context switches if physical register is pushed out in different
context?

Having more arch vector registers should reduce spill frequency, and
this is one part of 64-bit instruction proposal.

| 2. At this moment the mask is only for true values, so if we have an if then else structure first we have to do the if, and after we have to
| negate the mask (in our case an expensive operation, up to 32
| cycles) and, then execute the else.

This is functionality intended for 64-bit encoding. You should be
able to chain and/or fuse this mask negation and run in parallel in
mask unit with 32-bit encoding.

| It would be great to add a CSR which controls
| the  functionaility of the masked operations, it could be beneficial in many use cases. Also using the same mechanism, it may be possible to
| add other mask-related functionality that can simplify some algorithms. This functionality could be to set it to 0 or constant, or sign
| exchange, set positive or set negative. Another option that we can manage is to determine which bit is the mask bit of the register v0 (when
| LMUL=1), this can be convenient to set up it to sign bit instead of LSB, for example if we want to calculate the abs value or the RELU
| operation.

FP ABS (vfsgnx.vv vd, vs2, vs2), and integer/FP RELU (vmax.vx vd, vs2,
x0, vfmax.vf vd, vs2, f0 with f0=0.0) are already directly supported.

| 3- A vrscatter instruction could be useful (vdest[vsrc2[i]] = vsrc1[i]). The implementation cost could be less than vrgather since in the
| vrgather case it is needed to send the index and then the value, however for the vrscatter both index and the value could be sent together, for
| our implementation this is faster than vrgather.

I don't think vrscatter can replace vrgather, except for strict
permutes. Vrscatter I believe is strictly less powerful. The data
movement from a single vrscatter can be replaced by a single vrgather,
but the converse is not true.

The vrscatter instruction might be useful in some cases, but has
complex cases to consider in specification: a) multiple writes to same
element, ordered either by element ordering (which will be expensive
with multiple lanes), or by allowing unordered scatters; b) no writes
to some elements, independent of masking.

| 4- If a register is defined with a VL, then the values after VL (between VL+1 to VLMAX) could be set to “undefined” or 0, it would benefit
| implementations with longer vectors considerably in an OoO processor. The current text states that the previous values should be preserved.

Ongoing discussion.

| 5- Proposal to add some semantic information to the vectors. In particular it is important to know the VL and the SEW at the time vector was
| created, so when the context is saved we can have a special load/store operation that will only use the minimal storage that is needed in
| memory (plus the length and sew). Also this allows the system to know that after the VL, we do not have to preserve any element.

I had a brief discussion of some save/restore optimization techniques
in Section 5.1.5 of my thesis, and VS bits are at least a start there
to reduce amount of save/restore if not the size of the memory buffer
(though most OS will need memory space for full hart register context
anyway). Some form of microarchitectural tracking can be done,
observing undisturbed (VL=VLMAX) or agnostic (VL) options of valid
length.

I'd guess could have possibly some custom visible state (visible where
privilege allows) recording active vector length in bytes to
save/restore. But for these techniques, hardware can't always do the
right thing on vector save/restore, so need software to know that it's
worth interrogating the custom state to figure out what to do. But
this could penalize simple implementatons where software is OK doing
naive save/restore and doesn't want to spend cycles checking.

Regards,
Krste


| Best regards, Adrian and Osman

| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain
| information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended
| recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing,
| distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy
| and delete any copies you may have received.

| http://www.bsc.es/disclaimer
|