Date   

v0.10 release of vector spec

Krste Asanovic
 

I cut a v0.10 release after adding all the substantial pending updates. There is still a bunch of work to do before public review, but this is a convenient milestone for toolchain developers,
Krste


Next Vector TG meeting tomorrow, Friday Jan 29

Krste Asanovic
 

I scheduled next vector TG meeting tomorrow in usual slot with usual zoom link on TG Google calendar.

I hope to push out updated spec sometime before then,

Krste


Re: Restarting vector TG meetings next week

Jeffrey Osier-Mixon <josiermixon@...>
 

Hi Krste - was this ever scheduled? 

thanks

On Thu, Jan 21, 2021 at 4:12 PM Krste Asanovic <krste@...> wrote:
I was going to restart the vector TG meetings next week (Jan 29), and have a goal of having most pending updates added to the spec a few days before then.

Krste








--
Jeffrey Osier-Mixon  |  jefro@...   •   jefro@... 
Linux Foundation | linuxfoundation.org   •  RISC-V International | riscv.org 


About vmv.x.s should be vs1 = 0?

yahan@...
 

I see it on riscv-v-spec   commit: 0e8cdeb26bb98de2b1089d79a681af2c5a65e712
vmv.x.s rd, vs2 # x[rd] = vs2[0] (rs1=0)
vmv.x.s  belong to VWXUNARY0  and OPMVV 
But OPMVV has only vs1 no rs1, see :
 
  funct6   | vm  |   vs2    |    vs1   | 0 1 0 |  vd/rd  |1010111| OP-V (OPMVV)
So i think `vmv.x.s rd, vs2  # x[rd] = vs2[0] (rs1=0)` should be fixed to `vmv.x.s rd, vs2  # x[rd] = vs2[0] (vs1=0)` or 
encode vmv.x.s to VRXUNARY0  and OPMVX?
 
  funct6   | vm  |   vs2    |    rs1   | 1 1 0 |  vd/rd  |1010111| OP-V (OPMVX)
see also : https://github.com/riscv/riscv-v-spec/issues/625


Restarting vector TG meetings next week

Krste Asanovic
 

I was going to restart the vector TG meetings next week (Jan 29), and have a goal of having most pending updates added to the spec a few days before then.

Krste


Re: Vector TG minutes for 2020/12/18 meeting

lidawei14@...
 

Perhaps for explicit naming conventions of mask operations, we can name "vle1.v" to "vmle1.v" instead.


Vector Extension Workgroup Meeting

Krste Asanovic
 

I was going to restart meetings probably in two weeks. I hope to have
almost updated draft before then,
Krste

On Fri, 8 Jan 2021 16:21:31 +0000, "Bill Huffman" <huffman@cadence.com> said:
| I don’t see Vector Extension meetings on the calendar. Is the group meeting?
| Bill

| Bill Huffman
| CadenceLogoRed185Regcopy1583174817new51584636989.png Distinguished Engineer UIcorrectsize1583179003.png FortuneLogonew1584637099.png
| T: 408.944.7613

|


Vector Extension Workgroup Meeting

Bill Huffman
 

I don’t see Vector Extension meetings on the calendar.  Is the group meeting?

 

       Bill

 

CadenceLogoRed185Regcopy1583174817new51584636989.png

Bill Huffman
Distinguished Engineer
T: 408.944.7613   

UIcorrectsize1583179003.png

FortuneLogonew1584637099.png

 



 


Re: Vector TG minutes for 2020/12/18 meeting

Zalman Stern
 

Does it get easier if the specification is just the immediate value plus one?

I really don't understand how this encoding is particularly great for immediates as many of the values are likely very rarely or even never used and it seems like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:
    (first_bit ? 3 : 1) << rest_of_the_bits
or:
    map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8
    map[first_two_bits] << rest_of_the_bits;

I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

-Z-


On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@...> wrote:
for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software anticipated, then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example, implementations may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power), or matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR). software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by vsetivli)

g


On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@...> wrote:

# vsetivli

A new variant of vsetvl was proposed providing an immediate as the AVL
in rs1[4:0].  The immediate encoding is the same as for CSR immediate
instructions. The instruction would have bit 31:30 = 11 and bits 29:20
would be encoded same as vsetvli.

This would be used when AVL was statically known, and known to fit
inside vector register group.  Compared with existing PoR, it removes
need to load immediate into a spare scalar register before executing
vsetvli, and is useful for handling scalar values in vector register
(vl=1) and other cases where short fixed-sized vectors are the
datatype (e.g., graphics).

There was discussion on whether uimm=00000 should represent 32 or be
reserved.  32 is more useful, but adds a little complexity to
hardware.

There was also discussion on whether instruction should set vill if
selected AVL is not supported, or whether should clip vl to VLMAX as
with other instructions, or if behavior should be reserved.  Group
generally favored writing vill to expose software errors.


Re: Vector TG minutes for 2020/12/18 meeting

Guy Lemieux
 

for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software anticipated, then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example, implementations may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power), or matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR). software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by vsetivli)

g


On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@...> wrote:

# vsetivli

A new variant of vsetvl was proposed providing an immediate as the AVL
in rs1[4:0].  The immediate encoding is the same as for CSR immediate
instructions. The instruction would have bit 31:30 = 11 and bits 29:20
would be encoded same as vsetvli.

This would be used when AVL was statically known, and known to fit
inside vector register group.  Compared with existing PoR, it removes
need to load immediate into a spare scalar register before executing
vsetvli, and is useful for handling scalar values in vector register
(vl=1) and other cases where short fixed-sized vectors are the
datatype (e.g., graphics).

There was discussion on whether uimm=00000 should represent 32 or be
reserved.  32 is more useful, but adds a little complexity to
hardware.

There was also discussion on whether instruction should set vill if
selected AVL is not supported, or whether should clip vl to VLMAX as
with other instructions, or if behavior should be reserved.  Group
generally favored writing vill to expose software errors.


Vector TG minutes for 2020/12/18 meeting

Krste Asanovic
 

Date: 2020/12/18
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~10
Current issues on github: https://github.com/riscv/riscv-v-spec

Note: No more meetings schedule until next year. Time slot may have
to change.

Issues discussed:

# Freeze process

We are close to freezing the spec. There is a waiver from chairs for
SAIL model and compatibility tests, but we will need to complete these
before ratification.

# auto pdf generation

There was a request to have the repo automatically generate a pdf
version on commits to avoid users having to install formatting tools.

# Mask handling

Continuing discussion, the concrete proposal is to add new unit-stride
loads and stores that would use the lumop/sumop field to encode byte
load/stores used for masks, and also use effective vl = ceil(vl/8)
(implying effectively EMUL<=1). Proposed instructions would be:

vle1.v vd, (rs1) # Byte load with effective vl = ceil(vl/8)
vse1.v vs2, (rs1) # Byte store with effective vl = ceil(vl/8)

Encoded with lumop/sumop = 00011.

We discussed adding whole vector register load version with
lumop=01011, which would only be a mask hint, but for now, this seems
less necessary so is not on PoR.

vl1re1.v vd, (rs1) # Whole register load

# vsetivli

A new variant of vsetvl was proposed providing an immediate as the AVL
in rs1[4:0]. The immediate encoding is the same as for CSR immediate
instructions. The instruction would have bit 31:30 = 11 and bits 29:20
would be encoded same as vsetvli.

This would be used when AVL was statically known, and known to fit
inside vector register group. Compared with existing PoR, it removes
need to load immediate into a spare scalar register before executing
vsetvli, and is useful for handling scalar values in vector register
(vl=1) and other cases where short fixed-sized vectors are the
datatype (e.g., graphics).

There was discussion on whether uimm=00000 should represent 32 or be
reserved. 32 is more useful, but adds a little complexity to
hardware.

There was also discussion on whether instruction should set vill if
selected AVL is not supported, or whether should clip vl to VLMAX as
with other instructions, or if behavior should be reserved. Group
generally favored writing vill to expose software errors.


Last vector TG meeting of 2020, usual time, Friday Dec 17

Krste Asanovic
 

Agenda is hopefully clearing up any remaining major issues before 1.0 draft can go out,

Krste


Re: Vector Task Group minutes 2020/12/04

Thang Tran
 

I am totally in agreement with Krste. Adding the mask load/store is an improvement but adding the new mask registers is too disruptive and increasing in area.
Thanks, Thang

-----Original Message-----
From: tech-vector-ext@lists.riscv.org [mailto:tech-vector-ext@lists.riscv.org] On Behalf Of Krste Asanovic
Sent: Thursday, December 17, 2020 1:43 AM
To: Grant Martin <gmartin15@pacbell.net>
Cc: Steven Wallach <steven.wallach@bsc.es>; Roger Espasa <roger.espasa@semidynamics.com>; Alex Solomatnikov <sols@sifive.com>; Bill Huffman <huffman@cadence.com>; Krste Asanovic <krste@berkeley.edu>; tech-vector-ext@lists.riscv.org
Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04


I'm not contemplating changing mask design (yet again) at this point in process. I don't see any great advantage to any of these last round of proposals, as they all have significant downsides for some part of implementation space. The current design, like any real design, is not perfect, but does balance a lot of competing concerns coming from different design points.

@sols: The mask load instructions are being added to allow a microarchitecture to see all common mask writes, enabling complex microarchitectures to perform mask optimizations. In particular, for wide datapaths and for renamed registers.

Without renaming, and without deep temporal registers, having v0 be only mask source reduces cost of mask read port.

@swallach: The mask logical operations can be fused with masked operations in more complex machines to reduce software cost of only allowing v0 be mask.

@sols,lidawei: Adding more dedicated mask register state increase cost/complexity for all machines. Long LMUL needs a lot of bits to hold mask. Dropping longer LMUL would reduce efficiency of simple machines.

@roger: Using x registers for masks breaks vector-length agnostic goal and would limit LMUL.

@lidawei: Fractional LMUL helps with case where you want widening operations and lots of mask registers. If uarch utlization is low with lower LMUL, then one solution is to increase VLEN for same physical datapath width.

@swallach: ARM SVE uses predicates to implement vector length, so unsurprisingly ends up needing more mask resources. RVV vl can be considered additional mask that is AND-ed in with each mask.

Krste


On Wed, 16 Dec 2020 14:08:22 -0800, Grant Martin <gmartin15@pacbell.net> said:
| Having been a silent observer of this group for what seems like a
| very long time, but now recently liberated from previous constraints,
| I will observe that I have seen the use in DSPs of both dedicated mask register files and use of general vector type registers to serve this purpose.

| Along with operations for manipulating them.

| While there are pros and cons for both, I lean to the side of not
| having a special mask register file and special operations, but instead use existing resources and operations.

| However I have a process observation as well - it has taken RV Vector
| proposal a long time to converge to a near 1.0 specification. Would
| going down a different route cause enough delay and debate that it would derange the process and significantly delay the standardization that is desired? As opposed to more modest suggestions.

| Thanks and best regards

| Grant Martin
| gmartin@ieee.org
| (gmartin15@pacbell.net)
| Mobile +1.510.703.7470
| Home +1.925.846.8683
| Sent from my iPad

| On Dec 16, 2020, at 12:54 PM, swallach <steven.wallach@bsc.es> wrote:

| i guess i am looking at the wrong set of apps.

| in any case VM registers NOT in the vector registers permits a robust and performance optimized operations under mask.

| wrt extra instructions. i am neutral.

| On Dec 16, 2020, at 3:49 PM, Roger Espasa <roger.espasa@semidynamics.com> wrote:

| 8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is
| overkill.

| Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e
| :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and
| scalar unit. That’s better than a new set of registers and
| instructions

| Roger.

| On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@bsc.es> wrote:

| in my experience only only one maybe two vm registers are
| needed

| nested loops under if statements is rare.

|| On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@cadence.com> wrote:
||
|| I don’t think a separate mask register will do at all. It would take
|| a mask register file with at least 8 and
| maybe 16 registers. Lots of compare results need to be kept and operations need to be done on mask registers. I
| don't think we should have a separate mask register file.
||
|| Bill
||
|| -----Original Message-----
|| From: tech-vector-ext@lists.riscv.org
|| <tech-vector-ext@lists.riscv.org> On Behalf Of swallach
|| Sent: Wednesday, December 16, 2020 12:26 PM
|| To: Alex Solomatnikov <sols@sifive.com>
|| Cc: Krste Asanovic <krste@berkeley.edu>;
|| tech-vector-ext@lists.riscv.org
|| Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes
|| 2020/12/04
||
|| EXTERNAL MAIL
||
||
|| i totally agree. if this is done, then instructions like: count
|| bits, etc can directly apply to the mask
| register.
||
|| also, from a hardware implementation, the VM register can be implemented with LATÇHES. this facilitates a
| better implementation (imho) for operations under mask
||
|| and yes load and store VM are required
||
|| ——
||
||
|| If separate loads and stores are introduced for mask, then separate
|| vmask register can be introduced to avoid
| dual use of v0 (as a regular vector register and as a mask register) and its complications.
||
|| Alex
||
|
| https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1l
| A!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
||
||
||
||
||

| http://bsc.es/disclaimer

| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and
| may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If
| you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are
| strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this
| communication in error, please notify the sender and destroy and delete any copies you may have received.

| http://www.bsc.es/disclaimer
|


Re: Vector Task Group minutes 2020/12/04

Krste Asanovic
 

I'm not contemplating changing mask design (yet again) at this point
in process. I don't see any great advantage to any of these last
round of proposals, as they all have significant downsides for some
part of implementation space. The current design, like any real
design, is not perfect, but does balance a lot of competing concerns
coming from different design points.

@sols: The mask load instructions are being added to allow a
microarchitecture to see all common mask writes, enabling complex
microarchitectures to perform mask optimizations. In particular, for
wide datapaths and for renamed registers.

Without renaming, and without deep temporal registers, having v0 be
only mask source reduces cost of mask read port.

@swallach: The mask logical operations can be fused with masked
operations in more complex machines to reduce software cost of only
allowing v0 be mask.

@sols,lidawei: Adding more dedicated mask register state increase
cost/complexity for all machines. Long LMUL needs a lot of bits to
hold mask. Dropping longer LMUL would reduce efficiency of simple
machines.

@roger: Using x registers for masks breaks vector-length agnostic goal and
would limit LMUL.

@lidawei: Fractional LMUL helps with case where you want widening
operations and lots of mask registers. If uarch utlization is low
with lower LMUL, then one solution is to increase VLEN for same
physical datapath width.

@swallach: ARM SVE uses predicates to implement vector length, so
unsurprisingly ends up needing more mask resources. RVV vl can be
considered additional mask that is AND-ed in with each mask.

Krste


On Wed, 16 Dec 2020 14:08:22 -0800, Grant Martin <gmartin15@pacbell.net> said:
| Having been a silent observer of this group for what seems like a very long time, but now recently liberated from previous
| constraints, I will observe that I have seen the use in DSPs of both dedicated mask register files and use of general vector
| type registers to serve this purpose.

| Along with operations for manipulating them.

| While there are pros and cons for both, I lean to the side of not having a special mask register file and special operations,
| but instead use existing resources and operations.

| However I have a process observation as well - it has taken RV Vector proposal a long time to converge to a near 1.0
| specification. Would going down a different route cause enough delay and debate that it would derange the process and
| significantly delay the standardization that is desired? As opposed to more modest suggestions.

| Thanks and best regards

| Grant Martin
| gmartin@ieee.org
| (gmartin15@pacbell.net)
| Mobile +1.510.703.7470
| Home +1.925.846.8683
| Sent from my iPad

| On Dec 16, 2020, at 12:54 PM, swallach <steven.wallach@bsc.es> wrote:

| i guess i am looking at the wrong set of apps.

| in any case VM registers NOT in the vector registers permits a robust and performance optimized operations under mask.

| wrt extra instructions. i am neutral.

| On Dec 16, 2020, at 3:49 PM, Roger Espasa <roger.espasa@semidynamics.com> wrote:

| 8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is
| overkill.

| Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e
| :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and
| scalar unit. That’s better than a new set of registers and instructions

| Roger.

| On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@bsc.es> wrote:

| in my experience only only one maybe two vm registers are needed

| nested loops under if statements is rare.

|| On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@cadence.com> wrote:
||
|| I don’t think a separate mask register will do at all. It would take a mask register file with at least 8 and
| maybe 16 registers. Lots of compare results need to be kept and operations need to be done on mask registers. I
| don't think we should have a separate mask register file.
||
|| Bill
||
|| -----Original Message-----
|| From: tech-vector-ext@lists.riscv.org <tech-vector-ext@lists.riscv.org> On Behalf Of swallach
|| Sent: Wednesday, December 16, 2020 12:26 PM
|| To: Alex Solomatnikov <sols@sifive.com>
|| Cc: Krste Asanovic <krste@berkeley.edu>; tech-vector-ext@lists.riscv.org
|| Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04
||
|| EXTERNAL MAIL
||
||
|| i totally agree. if this is done, then instructions like: count bits, etc can directly apply to the mask
| register.
||
|| also, from a hardware implementation, the VM register can be implemented with LATÇHES. this facilitates a
| better implementation (imho) for operations under mask
||
|| and yes load and store VM are required
||
|| ——
||
||
|| If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid
| dual use of v0 (as a regular vector register and as a mask register) and its complications.
||
|| Alex
||
| https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
||
||
||
||
||

| http://bsc.es/disclaimer

| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and
| may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If
| you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are
| strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this
| communication in error, please notify the sender and destroy and delete any copies you may have received.

| http://www.bsc.es/disclaimer
|


Re: Vector Task Group minutes 2020/12/04

swallach
 

imho, since are trying to both address the embedded market and the hpc market, we have conflicts wrt logic, power, and cost

addressing the hpc market, 8 extra registers for VM, appropriately defined, that increases the performance of loops with conditionals, is not an issue.

on the other hand, for embedded these registers may be un necessary overhead.

attached is a paper on what ARM and fujitsu have implemented. just for a reference. worth a read





http://bsc.es/disclaimer


Re: Vector Task Group minutes 2020/12/04

lidawei14@...
 

In some cases we have widening computations with large LMUL settings, we will quickly run out of v0-v31 if we also have to keep masks in these registers.  


Re: Vector Task Group minutes 2020/12/04

Grant Martin
 

 Having been a silent observer of this group for what seems like a very long time, but now recently liberated from previous constraints, I will observe that I have seen the use in DSPs of both dedicated mask register files and use of general vector type registers to serve this purpose.

Along with operations for manipulating them.

While there are pros and cons for both, I lean to the side of not having a special mask register file and special operations, but instead use existing resources and operations.

However I have a process observation as well - it has taken RV Vector proposal a long time to converge to a near 1.0 specification.  Would going down a different route cause enough delay and debate that it would derange the process and significantly delay the standardization that is desired?  As opposed to more modest suggestions.

Thanks and best regards 

Grant Martin
gmartin@...
(gmartin15@...)
Mobile +1.510.703.7470
Home +1.925.846.8683

On Dec 16, 2020, at 12:54 PM, swallach <steven.wallach@...> wrote:


i guess i am looking at the wrong set of apps.  

in any case VM registers NOT in the vector registers permits a robust and performance optimized operations under mask. 

wrt extra instructions. i am neutral.  


On Dec 16, 2020, at 3:49 PM, Roger Espasa <roger.espasa@...> wrote:


8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is overkill. 

Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and scalar unit.  That’s better than a new set of registers and instructions

Roger. 

On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@...> wrote:
in my experience only only one maybe two vm registers are needed

nested loops under if statements is rare.   



> On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:
>
> I don’t think a separate mask register will do at all.  It would take a mask register file with at least 8 and maybe 16 registers.  Lots of compare results need to be kept and operations need to be done on mask registers.  I don't think we should have a separate mask register file.
>
>      Bill
>
> -----Original Message-----
> From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
> Sent: Wednesday, December 16, 2020 12:26 PM
> To: Alex Solomatnikov <sols@...>
> Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
> Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04
>
> EXTERNAL MAIL
>
>
> i  totally agree.  if this is done,  then instructions like:  count bits,  etc can directly apply to the mask register.
>
> also,  from a hardware implementation,   the VM register can be implemented with LATÇHES.  this facilitates a better implementation (imho) for operations under mask
>
> and yes load and store VM are required
>
> ——
>
>
> If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.
>
> Alex
> https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
>
>
>
>
>


http://bsc.es/disclaimer







WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: Vector Task Group minutes 2020/12/04

Alex Solomatnikov
 

One option is to allow mask generating instructions (compares) to write either to regular vector regs or to vmask and to provide move instructions between vector regs and vmask.

But mask consuming instructions can use only vmask as a mask. Mask load and store are also only for vmask.

This is no worse than current design.

Alex


On Wed, Dec 16, 2020 at 12:29 PM Bill Huffman <huffman@...> wrote:
I don’t think a separate mask register will do at all.  It would take a mask register file with at least 8 and maybe 16 registers.  Lots of compare results need to be kept and operations need to be done on mask registers.  I don't think we should have a separate mask register file.

      Bill

-----Original Message-----
From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
Sent: Wednesday, December 16, 2020 12:26 PM
To: Alex Solomatnikov <sols@...>
Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04

EXTERNAL MAIL


i  totally agree.  if this is done,  then instructions like:  count bits,  etc can directly apply to the mask register.

also,  from a hardware implementation,   the VM register can be implemented with LATÇHES.  this facilitates a better implementation (imho) for operations under mask

and yes load and store VM are required

——


If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.

Alex
https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$






Re: Vector Task Group minutes 2020/12/04

swallach
 

i guess i am looking at the wrong set of apps.  

in any case VM registers NOT in the vector registers permits a robust and performance optimized operations under mask. 

wrt extra instructions. i am neutral.  


On Dec 16, 2020, at 3:49 PM, Roger Espasa <roger.espasa@...> wrote:


8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is overkill. 

Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and scalar unit.  That’s better than a new set of registers and instructions

Roger. 

On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@...> wrote:
in my experience only only one maybe two vm registers are needed

nested loops under if statements is rare.   



> On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:
>
> I don’t think a separate mask register will do at all.  It would take a mask register file with at least 8 and maybe 16 registers.  Lots of compare results need to be kept and operations need to be done on mask registers.  I don't think we should have a separate mask register file.
>
>      Bill
>
> -----Original Message-----
> From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
> Sent: Wednesday, December 16, 2020 12:26 PM
> To: Alex Solomatnikov <sols@...>
> Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
> Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04
>
> EXTERNAL MAIL
>
>
> i  totally agree.  if this is done,  then instructions like:  count bits,  etc can directly apply to the mask register.
>
> also,  from a hardware implementation,   the VM register can be implemented with LATÇHES.  this facilitates a better implementation (imho) for operations under mask
>
> and yes load and store VM are required
>
> ——
>
>
> If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.
>
> Alex
> https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
>
>
>
>
>


http://bsc.es/disclaimer







WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: Vector Task Group minutes 2020/12/04

Roger Espasa
 

8 Maks registers are quite needed in modern outer-vectorized loops. Also in graphic shaders. I would say 16 is overkill. 

Now, and I am not defending this, if we had to go this route, I would seriously fight for masks-in-x-registers. I.e :no new state , no new instructions. Only a few arch tricks to try to avoid loss of decoupling between vector unit and scalar unit.  That’s better than a new set of registers and instructions

Roger. 

On Wed, 16 Dec 2020 at 21:34, swallach <steven.wallach@...> wrote:
in my experience only only one maybe two vm registers are needed

nested loops under if statements is rare.   



> On Dec 16, 2020, at 3:29 PM, Bill Huffman <huffman@...> wrote:
>
> I don’t think a separate mask register will do at all.  It would take a mask register file with at least 8 and maybe 16 registers.  Lots of compare results need to be kept and operations need to be done on mask registers.  I don't think we should have a separate mask register file.
>
>      Bill
>
> -----Original Message-----
> From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of swallach
> Sent: Wednesday, December 16, 2020 12:26 PM
> To: Alex Solomatnikov <sols@...>
> Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
> Subject: Re: [RISC-V] [tech-vector-ext] Vector Task Group minutes 2020/12/04
>
> EXTERNAL MAIL
>
>
> i  totally agree.  if this is done,  then instructions like:  count bits,  etc can directly apply to the mask register.
>
> also,  from a hardware implementation,   the VM register can be implemented with LATÇHES.  this facilitates a better implementation (imho) for operations under mask
>
> and yes load and store VM are required
>
> ——
>
>
> If separate loads and stores are introduced for mask, then separate vmask register can be introduced to avoid dual use of v0 (as a regular vector register and as a mask register) and its complications.
>
> Alex
> https://urldefense.com/v3/__http://bsc.es/disclaimer__;!!EHscmS1ygiU1lA!RJHAWw-769bPQyIHjTxb9o5uKdCXTVYJl2Bab73oZY-l_MvY1RgkMuZPnlTs5wU$
>
>
>
>
>


http://bsc.es/disclaimer





141 - 160 of 696