Vector Task Group minutes 2020/12/04


Thang Tran
 

I am totally in agreement with Krste. Adding the mask load/store is an improvement but adding the new mask registers is too disruptive and increasing in area.
Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Thursday, December 17, 2020 1:43 AM
To: Grant Martin <gmartin15@...>
Cc: Steven Wallach <steven.wallach@...>; Roger Espasa <roger.espasa@...>; Alex Solomatnikov <sols@...>; Bill Huffman <huffman@...>; Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04


I'm not contemplating changing mask design (yet again) at this point in process. I don't see any great advantage to any of these last round of proposals, as they all have significant downsides for some part of implementation space. The current design, like any real design, is not perfect, but does balance a lot of competing concerns coming from different design points.

@sols: The mask load instructions are being added to allow a microarchitecture to see all common mask writes, enabling complex microarchitectures to perform mask optimizations. In particular, for wide datapaths and for renamed registers.

Without renaming, and without deep temporal registers, having v0 be only mask source reduces cost of mask read port.

@swallach: The mask logical operations can be fused with masked operations in more complex machines to reduce software cost of only allowing v0 be mask.

@sols,lidawei: Adding more dedicated mask register state increase cost/complexity for all machines. Long LMUL needs a lot of bits to hold mask. Dropping longer LMUL would reduce efficiency of simple machines.

@roger: Using x registers for masks breaks vector-length agnostic goal and would limit LMUL.

@lidawei: Fractional LMUL helps with case where you want widening operations and lots of mask registers. If uarch utlization is low with lower LMUL, then one solution is to increase VLEN for same physical datapath width.

@swallach: ARM SVE uses predicates to implement vector length, so unsurprisingly ends up needing more mask resources. RVV vl can be considered additional mask that is AND-ed in with each mask.

Krste


On Wed, 16 Dec 2020 14:08:22 -0800, Grant Martin <gmartin15@...> said:
| Having been a silent observer of this group for what seems like a
| very long time, but now recently liberated from previous constraints,
| I will observe that I have seen the use in DSPs of both dedicated mask register files and use of general vector type registers to serve this purpose.

| Along with operations for manipulating them.

| While there are pros and cons for both, I lean to the side of not
| having a special mask register file and special operations, but instead use existing resources and operations.

| However I have a process observation as well - it has taken RV Vector
| proposal a long time to converge to a near 1.0 specification. Would
| going down a different route cause enough delay and debate that it would derange the process and significantly delay the standardization that is desired? As opposed to more modest suggestions.

| Thanks and best regards

| Grant Martin
| gmartin@...
| (gmartin15@...)
| Mobile +1.510.703.7470
| Home +1.925.846.8683
| Sent from my iPad

| On Dec 16, 2020, at 12:54 PM, swallach <steven.wallach@...> wrote:

| i guess i am looking at the wrong set of apps.

| in any case VM registers NOT in the vector registers permits a robust and performance optimized operations under mask.

| wrt extra instructions. i am neutral.

| On Dec 16, 2020, at 3:49 PM, Roger Espasa <roger.espasa@...> wrote:

| 8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is
| overkill.

| Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e
| :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and
| scalar unit. That’s better than a new set of registers and
| instructions

| Roger.

| On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@...> wrote:

| in my experience only only one maybe two vm registers are
| needed

| nested loops under if statements is rare.

|| On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:
||
|| I don’t think a separate mask register will do at all. It would take
|| a mask register file with at least 8 and
| maybe 16 registers. Lots of compare results need to be kept and operations need to be done on mask registers. I
| don't think we should have a separate mask register file.
||
|| Bill
||
|| -----Original Message-----
|| From: tech-vector-ext@...
|| <tech-vector-ext@...> On Behalf Of swallach
|| Sent: Wednesday, December 16, 2020 12:26 PM
|| To: Alex Solomatnikov <sols@...>
|| Cc: Krste Asanovic <krste@...>;
|| tech-vector-ext@...
|| Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes
|| 2020/12/04
||
|| EXTERNAL MAIL
||
||
|| i totally agree. if this is done, then instructions like: count
|| bits, etc can directly apply to the mask
| register.
||
|| also, from a hardware implementation, the VM register can be implemented with LATÇHES. this facilitates a
| better implementation (imho) for operations under mask
||
|| and yes load and store VM are required
||
|| ——
||
||
|| If separate loads and stores are introduced for mask, then separate
|| vmask register can be introduced to avoid
| dual use of v0 (as a regular vector register and as a mask register) and its complications.
||
|| Alex
||
|
| https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1l
| A!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
||
||
||
||
||

| http://bsc.es/disclaimer

| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and
| may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If
| you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are
| strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this
| communication in error, please notify the sender and destroy and delete any copies you may have received.

| http://www.bsc.es/disclaimer
|


Krste Asanovic
 

I'm not contemplating changing mask design (yet again) at this point
in process. I don't see any great advantage to any of these last
round of proposals, as they all have significant downsides for some
part of implementation space. The current design, like any real
design, is not perfect, but does balance a lot of competing concerns
coming from different design points.

@sols: The mask load instructions are being added to allow a
microarchitecture to see all common mask writes, enabling complex
microarchitectures to perform mask optimizations. In particular, for
wide datapaths and for renamed registers.

Without renaming, and without deep temporal registers, having v0 be
only mask source reduces cost of mask read port.

@swallach: The mask logical operations can be fused with masked
operations in more complex machines to reduce software cost of only
allowing v0 be mask.

@sols,lidawei: Adding more dedicated mask register state increase
cost/complexity for all machines. Long LMUL needs a lot of bits to
hold mask. Dropping longer LMUL would reduce efficiency of simple
machines.

@roger: Using x registers for masks breaks vector-length agnostic goal and
would limit LMUL.

@lidawei: Fractional LMUL helps with case where you want widening
operations and lots of mask registers. If uarch utlization is low
with lower LMUL, then one solution is to increase VLEN for same
physical datapath width.

@swallach: ARM SVE uses predicates to implement vector length, so
unsurprisingly ends up needing more mask resources. RVV vl can be
considered additional mask that is AND-ed in with each mask.

Krste


On Wed, 16 Dec 2020 14:08:22 -0800, Grant Martin <gmartin15@...> said:
| Having been a silent observer of this group for what seems like a very long time, but now recently liberated from previous
| constraints, I will observe that I have seen the use in DSPs of both dedicated mask register files and use of general vector
| type registers to serve this purpose.

| Along with operations for manipulating them.

| While there are pros and cons for both, I lean to the side of not having a special mask register file and special operations,
| but instead use existing resources and operations.

| However I have a process observation as well - it has taken RV Vector proposal a long time to converge to a near 1.0
| specification. Would going down a different route cause enough delay and debate that it would derange the process and
| significantly delay the standardization that is desired? As opposed to more modest suggestions.

| Thanks and best regards

| Grant Martin
| gmartin@...
| (gmartin15@...)
| Mobile +1.510.703.7470
| Home +1.925.846.8683
| Sent from my iPad

| On Dec 16, 2020, at 12:54 PM, swallach <steven.wallach@...> wrote:

| i guess i am looking at the wrong set of apps.

| in any case VM registers NOT in the vector registers permits a robust and performance optimized operations under mask.

| wrt extra instructions. i am neutral.

| On Dec 16, 2020, at 3:49 PM, Roger Espasa <roger.espasa@...> wrote:

| 8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is
| overkill.

| Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e
| :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and
| scalar unit. That’s better than a new set of registers and instructions

| Roger.

| On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@...> wrote:

| in my experience only only one maybe two vm registers are needed

| nested loops under if statements is rare.

|| On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:
||
|| I don’t think a separate mask register will do at all. It would take a mask register file with at least 8 and
| maybe 16 registers. Lots of compare results need to be kept and operations need to be done on mask registers. I
| don't think we should have a separate mask register file.
||
|| Bill
||
|| -----Original Message-----
|| From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
|| Sent: Wednesday, December 16, 2020 12:26 PM
|| To: Alex Solomatnikov <sols@...>
|| Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
|| Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04
||
|| EXTERNAL MAIL
||
||
|| i totally agree. if this is done, then instructions like: count bits, etc can directly apply to the mask
| register.
||
|| also, from a hardware implementation, the VM register can be implemented with LATÇHES. this facilitates a
| better implementation (imho) for operations under mask
||
|| and yes load and store VM are required
||
|| ——
||
||
|| If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid
| dual use of v0 (as a regular vector register and as a mask register) and its complications.
||
|| Alex
||
| https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
||
||
||
||
||

| http://bsc.es/disclaimer

| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and
| may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If
| you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are
| strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this
| communication in error, please notify the sender and destroy and delete any copies you may have received.

| http://www.bsc.es/disclaimer
|


swallach
 

imho, since are trying to both address the embedded market and the hpc market, we have conflicts wrt logic, power, and cost

addressing the hpc market, 8 extra registers for VM, appropriately defined, that increases the performance of loops with conditionals, is not an issue.

on the other hand, for embedded these registers may be un necessary overhead.

attached is a paper on what ARM and fujitsu have implemented. just for a reference. worth a read





http://bsc.es/disclaimer


lidawei14@...
 

In some cases we have widening computations with large LMUL settings, we will quickly run out of v0-v31 if we also have to keep masks in these registers.  


Grant Martin
 

 Having been a silent observer of this group for what seems like a very long time, but now recently liberated from previous constraints, I will observe that I have seen the use in DSPs of both dedicated mask register files and use of general vector type registers to serve this purpose.

Along with operations for manipulating them.

While there are pros and cons for both, I lean to the side of not having a special mask register file and special operations, but instead use existing resources and operations.

However I have a process observation as well - it has taken RV Vector proposal a long time to converge to a near 1.0 specification.  Would going down a different route cause enough delay and debate that it would derange the process and significantly delay the standardization that is desired?  As opposed to more modest suggestions.

Thanks and best regards 

Grant Martin
gmartin@...
(gmartin15@...)
Mobile +1.510.703.7470
Home +1.925.846.8683

On Dec 16, 2020, at 12:54 PM, swallach <steven.wallach@...> wrote:


i guess i am looking at the wrong set of apps.  

in any case VM registers NOT in the vector registers permits a robust and performance optimized operations under mask. 

wrt extra instructions. i am neutral.  


On Dec 16, 2020, at 3:49 PM, Roger Espasa <roger.espasa@...> wrote:


8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is overkill. 

Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and scalar unit.  That’s better than a new set of registers and instructions

Roger. 

On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@...> wrote:
in my experience only only one maybe two vm registers are needed

nested loops under if statements is rare.   



> On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:
>
> I don’t think a separate mask register will do at all.  It would take a mask register file with at least 8 and maybe 16 registers.  Lots of compare results need to be kept and operations need to be done on mask registers.  I don't think we should have a separate mask register file.
>
>      Bill
>
> -----Original Message-----
> From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
> Sent: Wednesday, December 16, 2020 12:26 PM
> To: Alex Solomatnikov <sols@...>
> Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
> Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04
>
> EXTERNAL MAIL
>
>
> i  totally agree.  if this is done,  then instructions like:  count bits,  etc can directly apply to the mask register.
>
> also,  from a hardware implementation,   the VM register can be implemented with LATÇHES.  this facilitates a better implementation (imho) for operations under mask
>
> and yes load and store VM are required
>
> ——
>
>
> If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.
>
> Alex
> https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
>
>
>
>
>


http://bsc.es/disclaimer







WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Alex Solomatnikov
 

One option is to allow mask generating instructions (compares) to write either to regular vector regs or to vmask and to provide move instructions between vector regs and vmask.

But mask consuming instructions can use only vmask as a mask. Mask load and store are also only for vmask.

This is no worse than current design.

Alex


On Wed, Dec 16, 2020 at 12:29 PM Bill Huffman <huffman@...> wrote:
I don’t think a separate mask register will do at all.  It would take a mask register file with at least 8 and maybe 16 registers.  Lots of compare results need to be kept and operations need to be done on mask registers.  I don't think we should have a separate mask register file.

      Bill

-----Original Message-----
From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
Sent: Wednesday, December 16, 2020 12:26 PM
To: Alex Solomatnikov <sols@...>
Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04

EXTERNAL MAIL


i  totally agree.  if this is done,  then instructions like:  count bits,  etc can directly apply to the mask register.

also,  from a hardware implementation,   the VM register can be implemented with LATÇHES.  this facilitates a better implementation (imho) for operations under mask

and yes load and store VM are required

——


If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.

Alex
https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$






swallach
 

i guess i am looking at the wrong set of apps.  

in any case VM registers NOT in the vector registers permits a robust and performance optimized operations under mask. 

wrt extra instructions. i am neutral.  


On Dec 16, 2020, at 3:49 PM, Roger Espasa <roger.espasa@...> wrote:


8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is overkill. 

Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and scalar unit.  That’s better than a new set of registers and instructions

Roger. 

On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@...> wrote:
in my experience only only one maybe two vm registers are needed

nested loops under if statements is rare.   



> On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:
>
> I don’t think a separate mask register will do at all.  It would take a mask register file with at least 8 and maybe 16 registers.  Lots of compare results need to be kept and operations need to be done on mask registers.  I don't think we should have a separate mask register file.
>
>      Bill
>
> -----Original Message-----
> From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
> Sent: Wednesday, December 16, 2020 12:26 PM
> To: Alex Solomatnikov <sols@...>
> Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
> Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04
>
> EXTERNAL MAIL
>
>
> i  totally agree.  if this is done,  then instructions like:  count bits,  etc can directly apply to the mask register.
>
> also,  from a hardware implementation,   the VM register can be implemented with LATÇHES.  this facilitates a better implementation (imho) for operations under mask
>
> and yes load and store VM are required
>
> ——
>
>
> If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.
>
> Alex
> https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
>
>
>
>
>


http://bsc.es/disclaimer







WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Roger Espasa
 

8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is overkill. 

Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and scalar unit.  That’s better than a new set of registers and instructions

Roger. 

On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@...> wrote:
in my experience only only one maybe two vm registers are needed

nested loops under if statements is rare.   



> On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:
>
> I don’t think a separate mask register will do at all.  It would take a mask register file with at least 8 and maybe 16 registers.  Lots of compare results need to be kept and operations need to be done on mask registers.  I don't think we should have a separate mask register file.
>
>      Bill
>
> -----Original Message-----
> From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
> Sent: Wednesday, December 16, 2020 12:26 PM
> To: Alex Solomatnikov <sols@...>
> Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
> Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04
>
> EXTERNAL MAIL
>
>
> i  totally agree.  if this is done,  then instructions like:  count bits,  etc can directly apply to the mask register.
>
> also,  from a hardware implementation,   the VM register can be implemented with LATÇHES.  this facilitates a better implementation (imho) for operations under mask
>
> and yes load and store VM are required
>
> ——
>
>
> If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.
>
> Alex
> https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
>
>
>
>
>


http://bsc.es/disclaimer






swallach
 

i would also  add,  that if 8 or 16 registers are needed, why do we only have one register, , V0.  if this were true we would need to multi-plex  between varius vector registers and V0

i believe i have interpreted your commengt of 8 or 16 registers, correctly



------------------------------------


I don’t think a separate mask register will do at all.  It would take a mask register file with at least 8 and maybe 16 registers.  Lots of compare results need to be kept and operations need to be done on mask registers. I don't think we should have a separate mask register file.


     Bill

WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


swallach
 

in my experience only only one maybe two vm registers are needed

nested loops under if statements is rare.



On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:

I don’t think a separate mask register will do at all. It would take a mask register file with at least 8 and maybe 16 registers. Lots of compare results need to be kept and operations need to be done on mask registers. I don't think we should have a separate mask register file.

Bill

-----Original Message-----
From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
Sent: Wednesday, December 16, 2020 12:26 PM
To: Alex Solomatnikov <sols@...>
Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04

EXTERNAL MAIL


i totally agree. if this is done, then instructions like: count bits, etc can directly apply to the mask register.

also, from a hardware implementation, the VM register can be implemented with LATÇHES. this facilitates a better implementation (imho) for operations under mask

and yes load and store VM are required

——


If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.

Alex
https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$





http://bsc.es/disclaimer


Bill Huffman
 

I don’t think a separate mask register will do at all. It would take a mask register file with at least 8 and maybe 16 registers. Lots of compare results need to be kept and operations need to be done on mask registers. I don't think we should have a separate mask register file.

Bill

-----Original Message-----
From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
Sent: Wednesday, December 16, 2020 12:26 PM
To: Alex Solomatnikov <sols@...>
Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04

EXTERNAL MAIL


i totally agree. if this is done, then instructions like: count bits, etc can directly apply to the mask register.

also, from a hardware implementation, the VM register can be implemented with LATÇHES. this facilitates a better implementation (imho) for operations under mask

and yes load and store VM are required

——


If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.

Alex
https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$


swallach
 

i totally agree. if this is done, then instructions like: count bits, etc can directly apply to the mask register.

also, from a hardware implementation, the VM register can be implemented with LATÇHES. this facilitates a better implementation (imho) for operations under mask

and yes load and store VM are required

——


If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.

Alex
http://bsc.es/disclaimer


Alex Solomatnikov
 

If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.

Alex

On Fri, Dec 4, 2020 at 7:01 PM Krste Asanovic <krste@...> wrote:


Date: 2020/12/04
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~12
Current issues on github: https://github.com/riscv/riscv-v-spec

Note: No meeting week Dec 11 due to Summit week.

Issues discussed:

# Memory ordering

Most of the meeting was spent discussing a strategy to handle vector
memory ordering wrt to the RISC-V MCM (RVWMO, RVTSO), i.e., ordering
as observed/influenced by other harts in the system.

A big concern is ordering of younger scalar loads after older vector
loads when both are the same address, as this complicates
high-performance in-order implementations (OoO implementations already
have to deal with ordering around unknown addresses in any case, so
not considered a significant additional burden there).  This load-load
ordering is required for the existing MCM, and the discussion was
around how it would be difficult to remove this ordering guarantee on
current vector load instructions while preserve existing software view
of memory, possibly either complicating mapping of standard languages
or requiring software to add fences that would hurt performance on a
large class of machines.

One possible approach that was discussed was to add separate vector
memory instructions with weaker memory ordering, either encoded as new
opcodes or with some CSR field that modifies behavior of existing
instruction encodings.  This might only be required for gather
operations, but some discussion was whether even greater weakening,
including intra-thread ordering should be considered.

It was felt defining and experimenting with these variants on
memory ordering would delay the vector spec even further, and so the
consensus was to enter public review with the current PoR that follows
standard RVWMO (or really, the standard MCM including TSO) at the
instruction level with the current instructions (intra-instruction
ordering was already relaxed per current draft spec), and consider
weaker instruction forms as a later extension.

# Mask handling

We further discussed the challenges of distributing mask register
values for machines with spatial wide datapaths using internal dynamic
data striping.  In particular, all common instructions used to produce
mask values are explicitly encoded in the ISA, except for loads from
memory.  Machines with internal dynamic data striping will therefore require
hiccups (additional microops) in the pipeline to rearrange load data
whenever used as a mask (heuristics/predictors might be possible to
reduce hiccups).

The most important case is that of mask register spill/refill, but
another important case is loading of packed bit vectors from memory
for use as masks.

Oblivious context save/restore would still likely require hiccups as
the save/restore code would not know data type assumption for next use
of a register, but these hiccups would be rare.

To help reduce these hiccups, we discussed the addition of new
unit-stride loads and stores that would use the lumop/sumop field to
encode EEW=1, and also use effective vl = ceil(vl/8) (implying
effectively EMUL<=1).  Proposed instructions would be:

       vle1.v vd, (rs1)    # Byte load with effective vl = ceil(vl/8)
       vse1.v vs2, (rs1)   # Byte store with effective vl = ceil(vl/8)

For context
switch, or where multiple vector lengths are present in a loop, whole
register versions would also be useful, and might be simpler to
provide and could be only alternative.

       vl1re1.v vd, (rs1) # Whole register load

These options, and whether any should be added to v1.0 for public
review to be discussed further on email.  How to treat the extra bits
in a byte loaded from memory is an open issue (1) 0s, or 2) 1s to match
tail-agnostic, or 3) use whole byte from memory - 3) is probably simplest for
implementations).








David Horner
 



On Thu, Dec 10, 2020, 04:44 Bill Huffman, <huffman@...> wrote:
On the issue of what bits to load for vle1.v, we need to decide whether
these are byte loads of length ceil(vl/8) or whether they are bit loads
of length vl.  Bit loads _can_ have the additional bits as tail-agnostic
but must not have them as tail-undisturbed. 
I concur.
Software can effect tail-undisturbed by
A pre conditioning the load,
B loading into temp register then use bitwise logic into target,
C save last byte of target , lde1, read last byte, write the last byte of the merged two saved
In most cases this 'need' could be avoided by other means.
It would be nice if these
were bit loads, but it will be a little more complex for implementation
and I expect we may run into other issues down the line.  I think I lean
toward byte loads.
+1

We have a similar issue for vse1.v as the remaining bits in the memory
byte _must_ be stored with something.  Here it seems simpler and perhaps
more logical to say this is a byte store with length ceil(vl/8) - which
helps re-enforce the choice of byte load for vle1.v.
+again

      Bill

On 12/4/20 7:01 PM, Krste Asanovic wrote:

-


Bill Huffman
 

On the issue of what bits to load for vle1.v, we need to decide whether
these are byte loads of length ceil(vl/8) or whether they are bit loads
of length vl. Bit loads _can_ have the additional bits as tail-agnostic
but must not have them as tail-undisturbed. It would be nice if these
were bit loads, but it will be a little more complex for implementation
and I expect we may run into other issues down the line. I think I lean
toward byte loads.

We have a similar issue for vse1.v as the remaining bits in the memory
byte _must_ be stored with something. Here it seems simpler and perhaps
more logical to say this is a byte store with length ceil(vl/8) - which
helps re-enforce the choice of byte load for vle1.v.

Bill

On 12/4/20 7:01 PM, Krste Asanovic wrote:
EXTERNAL MAIL




Date: 2020/12/04
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~12
Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!Sg27GCQ1RctXuVdF8U71W-XBbf5km_mIznppsPpyX6Am_WvpWUSXdHoHSmu7Td0$

Note: No meeting week Dec 11 due to Summit week.

Issues discussed:

# Memory ordering

Most of the meeting was spent discussing a strategy to handle vector
memory ordering wrt to the RISC-V MCM (RVWMO, RVTSO), i.e., ordering
as observed/influenced by other harts in the system.

A big concern is ordering of younger scalar loads after older vector
loads when both are the same address, as this complicates
high-performance in-order implementations (OoO implementations already
have to deal with ordering around unknown addresses in any case, so
not considered a significant additional burden there). This load-load
ordering is required for the existing MCM, and the discussion was
around how it would be difficult to remove this ordering guarantee on
current vector load instructions while preserve existing software view
of memory, possibly either complicating mapping of standard languages
or requiring software to add fences that would hurt performance on a
large class of machines.

One possible approach that was discussed was to add separate vector
memory instructions with weaker memory ordering, either encoded as new
opcodes or with some CSR field that modifies behavior of existing
instruction encodings. This might only be required for gather
operations, but some discussion was whether even greater weakening,
including intra-thread ordering should be considered.

It was felt defining and experimenting with these variants on
memory ordering would delay the vector spec even further, and so the
consensus was to enter public review with the current PoR that follows
standard RVWMO (or really, the standard MCM including TSO) at the
instruction level with the current instructions (intra-instruction
ordering was already relaxed per current draft spec), and consider
weaker instruction forms as a later extension.

# Mask handling

We further discussed the challenges of distributing mask register
values for machines with spatial wide datapaths using internal dynamic
data striping. In particular, all common instructions used to produce
mask values are explicitly encoded in the ISA, except for loads from
memory. Machines with internal dynamic data striping will therefore require
hiccups (additional microops) in the pipeline to rearrange load data
whenever used as a mask (heuristics/predictors might be possible to
reduce hiccups).

The most important case is that of mask register spill/refill, but
another important case is loading of packed bit vectors from memory
for use as masks.

Oblivious context save/restore would still likely require hiccups as
the save/restore code would not know data type assumption for next use
of a register, but these hiccups would be rare.

To help reduce these hiccups, we discussed the addition of new
unit-stride loads and stores that would use the lumop/sumop field to
encode EEW=1, and also use effective vl = ceil(vl/8) (implying
effectively EMUL<=1). Proposed instructions would be:

vle1.v vd, (rs1) # Byte load with effective vl = ceil(vl/8)
vse1.v vs2, (rs1) # Byte store with effective vl = ceil(vl/8)

For context
switch, or where multiple vector lengths are present in a loop, whole
register versions would also be useful, and might be simpler to
provide and could be only alternative.

vl1re1.v vd, (rs1) # Whole register load

These options, and whether any should be added to v1.0 for public
review to be discussed further on email. How to treat the extra bits
in a byte loaded from memory is an open issue (1) 0s, or 2) 1s to match
tail-agnostic, or 3) use whole byte from memory - 3) is probably simplest for
implementations).







lidawei14@...
 

Hi Krste,

This mask loading instruction is exactly the one we look forward.

I got some confusion on hiccups, why machines with internal dynamic data striping require hiccups whenever used as a mask?
Does it mean we have different arrangements of mask register and normal vector registers then we have to distinguish it while loading?
How the proposed instructions help reduce these hiccups?

Thank you,
Dawei


Krste Asanovic
 

Date: 2020/12/04
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~12
Current issues on github: https://github.com/riscv/riscv-v-spec

Note: No meeting week Dec 11 due to Summit week.

Issues discussed:

# Memory ordering

Most of the meeting was spent discussing a strategy to handle vector
memory ordering wrt to the RISC-V MCM (RVWMO, RVTSO), i.e., ordering
as observed/influenced by other harts in the system.

A big concern is ordering of younger scalar loads after older vector
loads when both are the same address, as this complicates
high-performance in-order implementations (OoO implementations already
have to deal with ordering around unknown addresses in any case, so
not considered a significant additional burden there). This load-load
ordering is required for the existing MCM, and the discussion was
around how it would be difficult to remove this ordering guarantee on
current vector load instructions while preserve existing software view
of memory, possibly either complicating mapping of standard languages
or requiring software to add fences that would hurt performance on a
large class of machines.

One possible approach that was discussed was to add separate vector
memory instructions with weaker memory ordering, either encoded as new
opcodes or with some CSR field that modifies behavior of existing
instruction encodings. This might only be required for gather
operations, but some discussion was whether even greater weakening,
including intra-thread ordering should be considered.

It was felt defining and experimenting with these variants on
memory ordering would delay the vector spec even further, and so the
consensus was to enter public review with the current PoR that follows
standard RVWMO (or really, the standard MCM including TSO) at the
instruction level with the current instructions (intra-instruction
ordering was already relaxed per current draft spec), and consider
weaker instruction forms as a later extension.

# Mask handling

We further discussed the challenges of distributing mask register
values for machines with spatial wide datapaths using internal dynamic
data striping. In particular, all common instructions used to produce
mask values are explicitly encoded in the ISA, except for loads from
memory. Machines with internal dynamic data striping will therefore require
hiccups (additional microops) in the pipeline to rearrange load data
whenever used as a mask (heuristics/predictors might be possible to
reduce hiccups).

The most important case is that of mask register spill/refill, but
another important case is loading of packed bit vectors from memory
for use as masks.

Oblivious context save/restore would still likely require hiccups as
the save/restore code would not know data type assumption for next use
of a register, but these hiccups would be rare.

To help reduce these hiccups, we discussed the addition of new
unit-stride loads and stores that would use the lumop/sumop field to
encode EEW=1, and also use effective vl = ceil(vl/8) (implying
effectively EMUL<=1). Proposed instructions would be:

vle1.v vd, (rs1) # Byte load with effective vl = ceil(vl/8)
vse1.v vs2, (rs1) # Byte store with effective vl = ceil(vl/8)

For context
switch, or where multiple vector lengths are present in a loop, whole
register versions would also be useful, and might be simpler to
provide and could be only alternative.

vl1re1.v vd, (rs1) # Whole register load

These options, and whether any should be added to v1.0 for public
review to be discussed further on email. How to treat the extra bits
in a byte loaded from memory is an open issue (1) 0s, or 2) 1s to match
tail-agnostic, or 3) use whole byte from memory - 3) is probably simplest for
implementations).