Date   

Re: Smaller embedded version of the Vector extension

Bruce Hoult
 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:
Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

See section 5.2. Vector Operands

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.

Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.



On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <tariq.kurd@...>
Cc: tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Re: Smaller embedded version of the Vector extension

Bruce Hoult
 

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

See section 5.2. Vector Operands

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.

Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.



On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <tariq.kurd@...>
Cc: tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Re: Smaller embedded version of the Vector extension

Tony Cole
 

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <tariq.kurd@...>
Cc: tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Re: Smaller embedded version of the Vector extension

Bruce Hoult
 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite
recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,
especially in embedded.

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Re: Smaller embedded version of the Vector extension

Andrew Waterman
 

The uppercase-V V extension is meant to cater to apps processors, where the VLEN >= 128 constraint is not inappropriate and is sometimes beneficial.  But there's nothing fundamental about the ISA design that prohibits VLEN < 128.  A minimal configuration is VLEN=ELEN=32, giving the same total amount of state as MVE.  (And if you set LMUL=4, then you even get the same shape: 8 registers of 128 bits apiece.)

Such a thing wouldn't be called V, but perhaps something like Zvmin.  Other than agreeing on a feature set and assigning it a name, the architecting is already done.

(If you search the spec for Zfinx, you'll see that a Zfinx variant is planned, but only barely sketched out.)


On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Smaller embedded version of the Vector extension

Tariq Kurd
 

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Re: Check mask all ones / all zeros

Guy Lemieux
 

yeeesh glad i don’t have to stare at that code too long.

i know it’s not your code ...

i think it could use a abs followed by a max reduction, then do the rest as scalar ops?

these macros appear to be targeted towards fixed-width simd. in particular i think they are making an assumption of very short vectors. in this snippet, it appears to want to compute all elements of the vector the same way ... with longer vectors, i would expect to use masks to separate the different computation types so it can be individualized for each element.

i haven’t studied the code in depth, but on the surface the all-mask-ones case seems to be not very useful here, nor does it really help with performance.

guy



On Thu, May 20, 2021 at 4:02 AM Roger Ferrer Ibanez <roger.ferrer@...> wrote:
Hi Guy,



On 20/5/21 12:09, Guy Lemieux wrote:

> so, what exactly do you plan to do after knowing the result is all-0

> or all-1 ?  do you want to initiate a branch or something else? does a

> precise (synchronized) result matter, or can you tolerate decoupling

> delays?



The code I've been looking at, uses this for a branch.



FWIW: this is the SLEEF library (vector math library). An example of how

it uses the check can be found at

https://github.com/shibatch/sleef/blob/master/src/libm/sleefsimddp.c#L340



(Not claiming that this specific library as written is a good or bad fit

for RVV, just looking at the code to get an idea of what are its

expectations)



Kind regards,



--

Roger Ferrer Ibáñez - roger.ferrer@...

Barcelona Supercomputing Center - Centro Nacional de Supercomputación





http://bsc.es/disclaimer


Re: Check mask all ones / all zeros

Roger Ferrer Ibanez
 

Hi Guy,

On 20/5/21 12:09, Guy Lemieux wrote:
so, what exactly do you plan to do after knowing the result is all-0
or all-1 ? do you want to initiate a branch or something else? does a
precise (synchronized) result matter, or can you tolerate decoupling
delays?
The code I've been looking at, uses this for a branch.

FWIW: this is the SLEEF library (vector math library). An example of how it uses the check can be found at https://github.com/shibatch/sleef/blob/master/src/libm/sleefsimddp.c#L340

(Not claiming that this specific library as written is a good or bad fit for RVV, just looking at the code to get an idea of what are its expectations)

Kind regards,

--
Roger Ferrer Ibáñez - roger.ferrer@bsc.es
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


http://bsc.es/disclaimer


Re: Check mask all ones / all zeros

Guy Lemieux
 

It depends -- exactly what do you plan to do after determining if a
mask is all-0 or all-1 or other?

vpopc and vfirst can both special-case these common results via
precomputation, so they both take minimal cycles. in that regard, they are
equivalent and there is no need to add your special instruction.

the problem is that both vpopc.m and vfirst.m write to the X register
file, which forces synchronization between scalar and vector units.
this may cost extra cycles of stalling ... which may negatively affect
performance. you could introduce a new instruction or a CSR read which
checks the mask result in an asynchronous fashion (or not).

so, what exactly do you plan to do after knowing the result is all-0
or all-1 ? do you want to initiate a branch or something else? does a
precise (synchronized) result matter, or can you tolerate decoupling
delays?

for example, it could be possible to specify that a CSR contains the
result of a mask being all-0, all-1, or otherwise, and that this CSR
is asynchronously updated. hence, a scalar control loop may operate
until the all-0 result is finally true without causing any hard
synchronization with the vector unit. this sort of approach would work
for some computaitons, eg mandelbrot, which require a change in the
control flow after all units have achieved a certain status, and where
there is no harm to continuing an extra iteration or two due to
latency between vector instructions and the CSR.


g



On Wed, May 19, 2021 at 10:49 PM Roger Ferrer Ibanez
<roger.ferrer@bsc.es> wrote:

Hi all,

I could not find any instruction that immediately computes this. Apologies if I missed the obvious here.

Two options came to mind:

vpopc.m and check whether the result is 0 (all zeros) or VLMAX(SEW, LMUL). I am under the impression that population count is not a fast operation (though I guess it depends on the actual VLEN)
vfirst.m, returns -1 it the mask is all zeros. For all ones we can do vmnot.m first and then vfirst.m. Might not be much faster than vpopc.m but (at expense of vmnot.m) does not need to compute VLMAX(SEW,LMUL).

Perhaps there are other alternatives?

Thoughts on whether it'd make sense to have a specific instruction for these checks? As in one instruction that returns one of three possible results (e.g. 1 for all ones, -1 for all zeros, 0 otherwise) in a GPR.

Thank you very much,

--
Roger Ferrer Ibáñez - roger.ferrer@bsc.es
Barcelona Supercomputing Center - Centro Nacional de Supercomputación



WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: Check mask all ones / all zeros

Andrew Waterman
 



On Thu, May 20, 2021 at 12:27 AM Andrew Waterman <andrew@...> wrote:


On Thu, May 20, 2021 at 12:16 AM Krste Asanovic <krste@...> wrote:
Actually, vfirst,m can be implemented with an early out on long temporal vector machines, whereas vpopc.m has to process all bits.

If the common case for the input data is that all bits would be set/clear, then choice doesn’t really matter, but if common to be able to early out (i.e. test fails), I’d go with vfirst.m

Yeah, it would've been more precise of me to have compared vpopc.m against Roger's hypothetical new instruction, which also must process all bits.

Er, nevermind, I got that wrong again.  Roger's instruction can also early-out with slightly more complexity (if at least one 1 and at least one 0 is detected).



Krste

On May 19, 2021, at 11:30 PM, Roger Ferrer Ibanez <roger.ferrer@...> wrote:

Hi Andrew,

thanks for the prompt and insightful answer. I'll use vpopc.m then.

On 20/5/21 8:25, Andrew Waterman wrote:
PS. You probably already have the current vector length in a GPR, and that quantity is probably the more appropriate thing to compare against than VLMAX.  So you probably don't need to go to the trouble of materializing VLMAX.

Indeed, my question was motivated while looking at some code that operates on whole registers but it can definitely be generalised to any vector length.

Kind regards,

-- 
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: Check mask all ones / all zeros

Andrew Waterman
 



On Thu, May 20, 2021 at 12:16 AM Krste Asanovic <krste@...> wrote:
Actually, vfirst,m can be implemented with an early out on long temporal vector machines, whereas vpopc.m has to process all bits.

If the common case for the input data is that all bits would be set/clear, then choice doesn’t really matter, but if common to be able to early out (i.e. test fails), I’d go with vfirst.m

Yeah, it would've been more precise of me to have compared vpopc.m against Roger's hypothetical new instruction, which also must process all bits.


Krste

On May 19, 2021, at 11:30 PM, Roger Ferrer Ibanez <roger.ferrer@...> wrote:

Hi Andrew,

thanks for the prompt and insightful answer. I'll use vpopc.m then.

On 20/5/21 8:25, Andrew Waterman wrote:
PS. You probably already have the current vector length in a GPR, and that quantity is probably the more appropriate thing to compare against than VLMAX.  So you probably don't need to go to the trouble of materializing VLMAX.

Indeed, my question was motivated while looking at some code that operates on whole registers but it can definitely be generalised to any vector length.

Kind regards,

-- 
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: Check mask all ones / all zeros

Krste Asanovic
 

Actually, vfirst,m can be implemented with an early out on long temporal vector machines, whereas vpopc.m has to process all bits.

If the common case for the input data is that all bits would be set/clear, then choice doesn’t really matter, but if common to be able to early out (i.e. test fails), I’d go with vfirst.m

Krste

On May 19, 2021, at 11:30 PM, Roger Ferrer Ibanez <roger.ferrer@...> wrote:

Hi Andrew,

thanks for the prompt and insightful answer. I'll use vpopc.m then.

On 20/5/21 8:25, Andrew Waterman wrote:
PS. You probably already have the current vector length in a GPR, and that quantity is probably the more appropriate thing to compare against than VLMAX.  So you probably don't need to go to the trouble of materializing VLMAX.

Indeed, my question was motivated while looking at some code that operates on whole registers but it can definitely be generalised to any vector length.

Kind regards,

-- 
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: Check mask all ones / all zeros

Roger Ferrer Ibanez
 

Hi Andrew,

thanks for the prompt and insightful answer. I'll use vpopc.m then.

On 20/5/21 8:25, Andrew Waterman wrote:
PS. You probably already have the current vector length in a GPR, and that quantity is probably the more appropriate thing to compare against than VLMAX.  So you probably don't need to go to the trouble of materializing VLMAX.

Indeed, my question was motivated while looking at some code that operates on whole registers but it can definitely be generalised to any vector length.

Kind regards,

-- 
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: Check mask all ones / all zeros

Andrew Waterman
 



On Wed, May 19, 2021 at 10:49 PM Roger Ferrer Ibanez <roger.ferrer@...> wrote:

Hi all,

I could not find any instruction that immediately computes this. Apologies if I missed the obvious here.

Two options came to mind:

  • vpopc.m and check whether the result is 0 (all zeros) or VLMAX(SEW, LMUL). I am under the impression that population count is not a fast operation (though I guess it depends on the actual VLEN)
I think this approach is sufficient, actually.

On the machines I've worked on so far, vpopc.m is no slower than vfirst.m.

For machines with very wide spatial vectors, you could imagine vpopc.m being slightly higher latency than vfirst.m (say, one extra clock cycle) because of the depth of the reduction tree.  But this shouldn't be a dominant effect: in a machine like that, surely the data movement latency will be a more prominent factor than the reduction latency, since the latter scales logarithmically.

PS. You probably already have the current vector length in a GPR, and that quantity is probably the more appropriate thing to compare against than VLMAX.  So you probably don't need to go to the trouble of materializing VLMAX.
  • vfirst.m, returns -1 it the mask is all zeros. For all ones we can do vmnot.m first and then vfirst.m. Might not be much faster than vpopc.m but (at expense of vmnot.m) does not need to compute VLMAX(SEW,LMUL).

Perhaps there are other alternatives?

Thoughts on whether it'd make sense to have a specific instruction for these checks? As in one instruction that returns one of three possible results (e.g. 1 for all ones, -1 for all zeros, 0 otherwise) in a GPR.

Thank you very much,

-- 
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Check mask all ones / all zeros

Roger Ferrer Ibanez
 

Hi all,

I could not find any instruction that immediately computes this. Apologies if I missed the obvious here.

Two options came to mind:

  • vpopc.m and check whether the result is 0 (all zeros) or VLMAX(SEW, LMUL). I am under the impression that population count is not a fast operation (though I guess it depends on the actual VLEN)
  • vfirst.m, returns -1 it the mask is all zeros. For all ones we can do vmnot.m first and then vfirst.m. Might not be much faster than vpopc.m but (at expense of vmnot.m) does not need to compute VLMAX(SEW,LMUL).

Perhaps there are other alternatives?

Thoughts on whether it'd make sense to have a specific instruction for these checks? As in one instruction that returns one of three possible results (e.g. 1 for all ones, -1 for all zeros, 0 otherwise) in a GPR.

Thank you very much,

-- 
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: LLVM with RVV intrinsic support

David Horner
 

Excellent.
Congratulations, and thank you!!

On Fri, May 14, 2021, 05:21 Kai Wang, <kai.wang@...> wrote:
Hi,

We would like to announce that the RISC-V V-extension v0.10 has been implemented in LLVM and the work has been committed upstream.


Barcelona Supercomputing Center (BSC), Codeplay Software, and SiFive have worked together to implement the RVV C API intrinsics for the V-extension and have implemented the foundation of CodeGen for Vector Length Specific (VLS) and Vector Length Agnostic (VLA) autovectorization for RISC-V. 


What we have committed to LLVM upstream:

* Support for the v0.10 V-extension specification

* Support for the RVV C intrinsics in https://github.com/riscv/rvv-intrinsic-doc/tree/v0.10

* Implement the draft vector calling convention in https://github.com/riscv/riscv-elf-psabi-doc/pull/171


Known issues:

* C intrinsics for Zvlsseg implementation is under discussion:

 - https://lists.llvm.org/pipermail/llvm-dev/2021-March/149518.html

* What type we should use for fp16 is under discussion:

 - https://github.com/riscv/rvv-intrinsic-doc/issues/18#issuecomment-818472454


RISC-V RVV example:

https://github.com/riscv/rvv-intrinsic-doc/blob/master/rvv_saxpy.c


Build command:

clang --target=riscv64-unknown-elf -march=rv64gcv0p10 -menable-experimental-extensions rvv_saxpy.c -o rvv_saxpy.elf



LLVM with RVV intrinsic support

Kai Wang
 

Hi,

We would like to announce that the RISC-V V-extension v0.10 has been implemented in LLVM and the work has been committed upstream.


Barcelona Supercomputing Center (BSC), Codeplay Software, and SiFive have worked together to implement the RVV C API intrinsics for the V-extension and have implemented the foundation of CodeGen for Vector Length Specific (VLS) and Vector Length Agnostic (VLA) autovectorization for RISC-V. 


What we have committed to LLVM upstream:

* Support for the v0.10 V-extension specification

* Support for the RVV C intrinsics in https://github.com/riscv/rvv-intrinsic-doc/tree/v0.10

* Implement the draft vector calling convention in https://github.com/riscv/riscv-elf-psabi-doc/pull/171


Known issues:

* C intrinsics for Zvlsseg implementation is under discussion:

 - https://lists.llvm.org/pipermail/llvm-dev/2021-March/149518.html

* What type we should use for fp16 is under discussion:

 - https://github.com/riscv/rvv-intrinsic-doc/issues/18#issuecomment-818472454


RISC-V RVV example:

https://github.com/riscv/rvv-intrinsic-doc/blob/master/rvv_saxpy.c


Build command:

clang --target=riscv64-unknown-elf -march=rv64gcv0p10 -menable-experimental-extensions rvv_saxpy.c -o rvv_saxpy.elf



Re: vector intrinsics for both RV32/RV64

Jim Wilson
 

On Wed, May 12, 2021 at 10:52 AM Guy Lemieux <guy.lemieux@...> wrote:
I’m starting a project where we want to use vector intrinsics and generate both 64b and 32b code (for RV64 and RV32).
It looks line the best way to do this right now is with GCC, where we were able to find up-to-date intrinsics for the v0.10 spec:

There is a gcc RVV port from SiFive, but it has been dormant for months, and is not being actively maintained at the moment.  You are better off using LLVM instead which is actively being worked on by multiple parties including SiFive.
Jim


Re: vector intrinsics for both RV32/RV64

Craig Topper
 

Hi Guy,

The latest LLVM git repository should have support for all intrinsics except segment load/store. The intrinsics missed the branch window for the LLVM 12 release, but should be in LLVM 13 when it is released in the second half of the year.

The riscv_vector.h header is autogenerated from other files when clang is built so you won’t find the header in the repository.

~Craig

On May 12, 2021, at 10:52 AM, Guy Lemieux <guy.lemieux@...> wrote:

Hi,

I’m starting a project where we want to use vector intrinsics and generate both 64b and 32b code (for RV64 and RV32).

It looks line the best way to do this right now is with GCC, where we were able to find up-to-date intrinsics for the v0.10 spec:


Is there a similar ability with LLVM? Vector support seems to be added, but no up to date intrinsics yet. This is the closest I could find, but it appears to be a bit out of date (vector spec 0.8) and only for RV32:


Sorry if this is an obvious question — I haven’t dug very deeply into this yet, but I thought this group would be able to give me better answers and save me a bit of time.

Thanks for any pointers.

Guy




vector intrinsics for both RV32/RV64

Guy Lemieux
 

Hi,

I’m starting a project where we want to use vector intrinsics and generate both 64b and 32b code (for RV64 and RV32).

It looks line the best way to do this right now is with GCC, where we were able to find up-to-date intrinsics for the v0.10 spec:


Is there a similar ability with LLVM? Vector support seems to be added, but no up to date intrinsics yet. This is the closest I could find, but it appears to be a bit out of date (vector spec 0.8) and only for RV32:


Sorry if this is an obvious question — I haven’t dug very deeply into this yet, but I thought this group would be able to give me better answers and save me a bit of time.

Thanks for any pointers.

Guy


41 - 60 of 651