Re: Vector Task Group minutes 2020/12/04
I am totally in agreement with Krste. Adding the mask load/store is an improvement but adding the new mask registers is too disruptive and increasing in area.toggle quoted messageShow quoted text
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Thursday, December 17, 2020 1:43 AM
To: Grant Martin <gmartin15@...>
Cc: Steven Wallach <steven.wallach@...>; Roger Espasa <roger.espasa@...>; Alex Solomatnikov <sols@...>; Bill Huffman <huffman@...>; Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04
I'm not contemplating changing mask design (yet again) at this point in process. I don't see any great advantage to any of these last round of proposals, as they all have significant downsides for some part of implementation space. The current design, like any real design, is not perfect, but does balance a lot of competing concerns coming from different design points.
@sols: The mask load instructions are being added to allow a microarchitecture to see all common mask writes, enabling complex microarchitectures to perform mask optimizations. In particular, for wide datapaths and for renamed registers.
Without renaming, and without deep temporal registers, having v0 be only mask source reduces cost of mask read port.
@swallach: The mask logical operations can be fused with masked operations in more complex machines to reduce software cost of only allowing v0 be mask.
@sols,lidawei: Adding more dedicated mask register state increase cost/complexity for all machines. Long LMUL needs a lot of bits to hold mask. Dropping longer LMUL would reduce efficiency of simple machines.
@roger: Using x registers for masks breaks vector-length agnostic goal and would limit LMUL.
@lidawei: Fractional LMUL helps with case where you want widening operations and lots of mask registers. If uarch utlization is low with lower LMUL, then one solution is to increase VLEN for same physical datapath width.
@swallach: ARM SVE uses predicates to implement vector length, so unsurprisingly ends up needing more mask resources. RVV vl can be considered additional mask that is AND-ed in with each mask.
| Having been a silent observer of this group for what seems like aOn Wed, 16 Dec 2020 14:08:22 -0800, Grant Martin <gmartin15@...> said:
| very long time, but now recently liberated from previous constraints,
| I will observe that I have seen the use in DSPs of both dedicated mask register files and use of general vector type registers to serve this purpose.
| Along with operations for manipulating them.
| While there are pros and cons for both, I lean to the side of not
| having a special mask register file and special operations, but instead use existing resources and operations.
| However I have a process observation as well - it has taken RV Vector
| proposal a long time to converge to a near 1.0 specification. Would
| going down a different route cause enough delay and debate that it would derange the process and significantly delay the standardization that is desired? As opposed to more modest suggestions.
| Thanks and best regards
| Grant Martin
| Mobile +1.510.703.7470
| Home +1.925.846.8683
| Sent from my iPad
| On Dec 16, 2020, at 12:54 PM, swallach <steven.wallach@...> wrote:
| i guess i am looking at the wrong set of apps.
| in any case VM registers NOT in the vector registers permits a robust and performance optimized operations under mask.
| wrt extra instructions. i am neutral.
| On Dec 16, 2020, at 3:49 PM, Roger Espasa <roger.espasa@...> wrote:
| 8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is
| Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e
| :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and
| scalar unit. That’s better than a new set of registers and
| On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@...> wrote:
| in my experience only only one maybe two vm registers are
| nested loops under if statements is rare.
|| On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:
|| I don’t think a separate mask register will do at all. It would take
|| a mask register file with at least 8 and
| maybe 16 registers. Lots of compare results need to be kept and operations need to be done on mask registers. I
| don't think we should have a separate mask register file.
|| -----Original Message-----
|| From: tech-vector-ext@...
|| <tech-vector-ext@...> On Behalf Of swallach
|| Sent: Wednesday, December 16, 2020 12:26 PM
|| To: Alex Solomatnikov <sols@...>
|| Cc: Krste Asanovic <krste@...>;
|| Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes
|| EXTERNAL MAIL
|| i totally agree. if this is done, then instructions like: count
|| bits, etc can directly apply to the mask
|| also, from a hardware implementation, the VM register can be implemented with LATÇHES. this facilitates a
| better implementation (imho) for operations under mask
|| and yes load and store VM are required
|| If separate loads and stores are introduced for mask, then separate
|| vmask register can be introduced to avoid
| dual use of v0 (as a regular vector register and as a mask register) and its complications.
| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and
| may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If
| you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are
| strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this
| communication in error, please notify the sender and destroy and delete any copies you may have received.