Smaller embedded version of the Vector extension


Tariq Kurd <tariq.kurd@...>
 

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


andrew@...
 

The uppercase-V V extension is meant to cater to apps processors, where the VLEN >= 128 constraint is not inappropriate and is sometimes beneficial.  But there's nothing fundamental about the ISA design that prohibits VLEN < 128.  A minimal configuration is VLEN=ELEN=32, giving the same total amount of state as MVE.  (And if you set LMUL=4, then you even get the same shape: 8 registers of 128 bits apiece.)

Such a thing wouldn't be called V, but perhaps something like Zvmin.  Other than agreeing on a feature set and assigning it a name, the architecting is already done.

(If you search the spec for Zfinx, you'll see that a Zfinx variant is planned, but only barely sketched out.)


On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Bruce Hoult
 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite
recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,
especially in embedded.

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Tony Cole
 

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <tariq.kurd@...>
Cc: tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Bruce Hoult
 

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

See section 5.2. Vector Operands

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.

Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.



On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <tariq.kurd@...>
Cc: tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Bruce Hoult
 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:
Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

See section 5.2. Vector Operands

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.

Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.



On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <tariq.kurd@...>
Cc: tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Guy Lemieux
 

Allowing VLEN<128 would allow for smaller vector register files, bit it would also result in a profile that is not forward-compatible with the V spec. This would produce another fracture the software ecosystem.

To avoid such a fracture, there are two choices:
(1) go with P instead
(2) relax the V spec to allow smaller implementations

So the key question for this group is whether to relax the minimum VLEN to 32 or 64?

note: a possible justification for keeping 128 might be to recommend (1) instead. I don’t know anything about P, but it seems like it could be speced in a way that is competitive/comparable with Helium.

Guy

PS — I have started to design an “RVV-lite” profile which would be more amenable to embedded implementations. However, I have adopted a stance that it must remain forward compatible with the full V spec, so I have not considered VLEN below 128. I am happy to share my work on this and involve other contributors — email me if you would like to see a copy.



On Wed, Jun 2, 2021 at 3:15 AM Andrew Waterman <andrew@...> wrote:
The uppercase-V V extension is meant to cater to apps processors, where the VLEN >= 128 constraint is not inappropriate and is sometimes beneficial.  But there's nothing fundamental about the ISA design that prohibits VLEN < 128.  A minimal configuration is VLEN=ELEN=32, giving the same total amount of state as MVE.  (And if you set LMUL=4, then you even get the same shape: 8 registers of 128 bits apiece.)

Such a thing wouldn't be called V, but perhaps something like Zvmin.  Other than agreeing on a feature set and assigning it a name, the architecting is already done.

(If you search the spec for Zfinx, you'll see that a Zfinx variant is planned, but only barely sketched out.)

On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:
















Hi everyone,



 



Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if

the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).



 



ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.



https://en.wikichip.org/wiki/arm/helium



 



What’s the approach here? Should embedded applications implement the P-extension instead?



 



Tariq



 



Tariq Kurd



Processor Design

I RISC-V Cores, Bristol



E-mail:

Tariq.Kurd@...



Company:

Huawei technologies R&D (UK) Ltd

I Address: 290

Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      



 



315px-Huawei   

http://www.huawei.com





This e-mail and its attachments contain confidential information from HUAWEI, which

is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s)

is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !



本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!



 




























Tony Cole
 

So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively), but not any of other instructions, correct?

 

Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

 

Therefore, this code would not work on a 32x 32-bit vector register machine.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 12:18
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

 

See section 5.2. Vector Operands

 

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.


Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.

 

 

 

On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <
tariq.kurd@...>
Cc:
tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Bruce Hoult
 

I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.

I think I saw somewhere they are largely intended for debuggers.

On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:

So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively), but not any of other instructions, correct?

 

Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

 

Therefore, this code would not work on a 32x 32-bit vector register machine.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 12:18
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

 

See section 5.2. Vector Operands

 

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.


Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.

 

 

 

On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <
tariq.kurd@...>
Cc:
tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Tariq Kurd <tariq.kurd@...>
 

OK, so it seems that to run our software (which Tony Cole referred to) we need VLEN>=64 for our embedded application.

Is there any scope for reducing the number of V registers? Could RV32E_Vmin have 16 X and V registers?

I know it doesn’t affect the number of F registers, which is tackled by having Zfinx instead to save area – but it seems that we need another solution for the vectors.

 

Then we can match ARM MVE for area – 8x128-bit compared to 16x64-bit

 

Tariq

 

From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of Bruce Hoult
Sent: 02 June 2021 13:34
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.

 

I think I saw somewhere they are largely intended for debuggers.

 

On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:

So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively), but not any of other instructions, correct?

 

Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

 

Therefore, this code would not work on a 32x 32-bit vector register machine.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 12:18
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

 

See section 5.2. Vector Operands

 

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.


Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.

 

 

 

On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <
tariq.kurd@...>
Cc:
tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


andrew@...
 

It’s actually not fundamental to the ISA design that VLEN >= ELEN. An implementation with VLEN=32 could support SEW=64 whenever LMUL >= 2. This approach starts to pose code-generation headaches, but it is at least theoretically viable.

As compared to cutting the number of registers in half, the above approach has the advantage of offering more vector registers when longer elements are not needed, even though the total storage cost is the same.

On Wed, Jun 2, 2021 at 8:21 AM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:
















OK, so it seems that to run our software (which Tony Cole referred to) we need VLEN>=64 for our embedded application.



Is there any scope for reducing the number of V registers? Could RV32E_Vmin have 16 X and V registers?



I know it doesn’t affect the number of F registers, which is tackled by having Zfinx instead to save area – but it seems that we need another solution for the vectors.



 



Then we can match ARM MVE for area – 8x128-bit compared to 16x64-bit



 



Tariq



 



From: tech-vector-ext@... <tech-vector-ext@...>

On Behalf Of Bruce Hoult


Sent: 02 June 2021 13:34


To: Tony Cole <tony.cole@...>


Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>


Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension



 







I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most

code would be better off storing and loading with an offset.







 







I think I saw somewhere they are largely intended for debuggers.







 







On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:











So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively),

but not any of other instructions, correct?



 



Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8,

vredsum_vs_i64m8, etc.



 



Therefore, this code would not work on a 32x 32-bit vector register machine.



 



 



Tony



 



 



From:

tech-vector-ext@... [mailto:tech-vector-ext@...]

On Behalf Of Bruce Hoult


Sent: 02 June 2021 12:18


To: Tony Cole <tony.cole@...>


Cc: Tariq Kurd <tariq.kurd@...>;

tech-vector-ext@...; Shaofei (B) <shaofei1@...>


Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension



 







Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128

bit. 







 







On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:











Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers

(higher effective LMUL)







 







See section 5.2. Vector Operands







 







Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location

of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.








Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions

have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.





Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered

vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.







 







 







 







On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:











Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?



I don't think it does, but it’s not clear in the spec.



 



I use 64-bit elements for “wide” and “quad” accumulators.



 



 



From:

tech-vector-ext@... [mailto:tech-vector-ext@...]

On Behalf Of Bruce Hoult


Sent: 02 June 2021 11:19


To: Tariq Kurd <
tariq.kurd@...>


Cc:
tech-vector-ext@...; Shaofei (B) <shaofei1@...>


Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension



 







There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite







recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,







especially in embedded.







 







With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.







 







If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.







 







Of course profiles (or just custom chips for custom applications) can define subsets of instructions.







 







On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via

lists.riscv.org <tariq.kurd=huawei.com@...> wrote:











Hi everyone,



 



Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class

cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).



 



ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.



https://en.wikichip.org/wiki/arm/helium



 



What’s the approach here? Should embedded applications implement the P-extension instead?



 



Tariq



 



Tariq Kurd



Processor Design

I RISC-V Cores, Bristol



E-mail:

Tariq.Kurd@...



Company:

Huawei technologies R&D (UK) Ltd

I Address: 290

Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR,
 UK      



 



315px-Huawei   

http://www.huawei.com







This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any

use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify

the sender by phone or email immediately and delete it !





本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!



 




















































Guy Lemieux
 

For widening and narrowing instructions to work, the V spec depends upon changing SEW (to EEW) and LMUL (to EMUL),  such that EEW/EMUL ==  SEW/LMUL. That is, to change the element size (widen or narrow) to EEW, one must also change the EMUL setting accordingly.

In my RVV-lite proposal, I recommend a simplification where the only settings permitted are SEW/LMUL = 8/1, 16/2, 32/4, and 64/8, thereby creating 32 named registers of bytes, 16 halfs, 8 words, and 4 dwords. This allows the widening and narrowing to work, and it ensures that VLMAX is the same for all element sizes. The primary negative side effect is named registers available for the larger sizes, but this seems an acceptable simplification of both hardware and software.

In other words, if you want to further reduce the number of named registers below the 32 specified by V, then you will have to consider the impact on the narrowing/widening instructions. For example, you could fix SEW/LMUL at 16, eg SEW/LMUL = 8/0.5 which under-utilizes vector data storage by 50% if you are operating on bytes. Or, you could remove widening/narrowing instructions entirely. Or, you could introduce new widening/narrowing instructions that do not use EEW and/or EMUL (eg, they fix EMUL==LMUL, and deal with the shortening of VLMAX somehow).

Guy


On Wed, Jun 2, 2021 at 8:21 AM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

OK, so it seems that to run our software (which Tony Cole referred to) we need VLEN>=64 for our embedded application.

Is there any scope for reducing the number of V registers? Could RV32E_Vmin have 16 X and V registers?

I know it doesn’t affect the number of F registers, which is tackled by having Zfinx instead to save area – but it seems that we need another solution for the vectors.

 

Then we can match ARM MVE for area – 8x128-bit compared to 16x64-bit

 

Tariq

 

From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of Bruce Hoult
Sent: 02 June 2021 13:34
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.

 

I think I saw somewhere they are largely intended for debuggers.

 

On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:

So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively), but not any of other instructions, correct?

 

Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

 

Therefore, this code would not work on a 32x 32-bit vector register machine.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 12:18
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

 

See section 5.2. Vector Operands

 

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.


Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.

 

 

 

On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <
tariq.kurd@...>
Cc:
tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Guy Lemieux
 



On Wed, Jun 2, 2021 at 8:38 AM Andrew Waterman <andrew@...> wrote:
It’s actually not fundamental to the ISA design that VLEN >= ELEN. An implementation with VLEN=32 could support SEW=64 whenever LMUL >= 2. 

I think the concern here is lack of a clearly defined data layout pattern for such cases.

eg, should the LSBs be in the odd or even register half, or should it be implementation-defined?

Guy


Tony Cole
 

Hi Bruce,

 

“I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.”

 

I don't see what you mean, please can you elaborate with examples of why/how it exposes the size of the vector register in a very unfortunate way and breaking down if VLEN=1 (do you mean LMUL=1??).

 

The vslide instruction speeds up my code a lot as it reduce reloading (mostly the same) data over and over again.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 13:34
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.

 

I think I saw somewhere they are largely intended for debuggers.

 

On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:

So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively), but not any of other instructions, correct?

 

Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

 

Therefore, this code would not work on a 32x 32-bit vector register machine.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 12:18
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

 

See section 5.2. Vector Operands

 

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.


Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.

 

 

 

On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <
tariq.kurd@...>
Cc:
tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Thang Tran
 

It seems that restriction of minimum LMUL=2 would be half number of vector registers and LMUL=4 would be 8 vector registers.

Thang

 

From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of Tariq Kurd via lists.riscv.org
Sent: Wednesday, June 2, 2021 8:21 AM
To: Bruce Hoult <bruce@...>; Tony Cole <tony.cole@...>
Cc: tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

OK, so it seems that to run our software (which Tony Cole referred to) we need VLEN>=64 for our embedded application.

Is there any scope for reducing the number of V registers? Could RV32E_Vmin have 16 X and V registers?

I know it doesn’t affect the number of F registers, which is tackled by having Zfinx instead to save area – but it seems that we need another solution for the vectors.

 

Then we can match ARM MVE for area – 8x128-bit compared to 16x64-bit

 

Tariq

 

From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of Bruce Hoult
Sent: 02 June 2021 13:34
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.

 

I think I saw somewhere they are largely intended for debuggers.

 

On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:

So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively), but not any of other instructions, correct?

 

Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

 

Therefore, this code would not work on a 32x 32-bit vector register machine.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 12:18
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

 

See section 5.2. Vector Operands

 

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.


Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.

 

 

 

On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <tariq.kurd@...>
Cc: tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Krste Asanovic
 

The VLEN>=128 constraint is only for the application processor "V"
extension for the app profile - not for embedded vectors which can
have VLEN=32.

From spec Introduction:
'
The term base vector extension is used informally to describe the standard set of vector ISA components that will be required for the single-letter "V" extension, which is intended for use in standard server and application-processor platform profiles. The set of mandatory instructions and supported element widths will vary with the base ISA (RV32I, RV64I) as described below.

Other profiles, including embedded profiles, may choose to mandate only subsets of these extensions. The exact set of mandatory supported instructions for an implementation to be compliant with a given profile will only be determined when each profile spec is ratified. For convenience in defining subset profiles, vector instruction subsets are given ISA string names beginning with the "Zv" prefix.
'

There are a set Zve* names for the embedded subsets (see github issue
#550).

A minimal embedded implementaton using RV32E+Zfinx+vectors would be
same state size as ARM MVE.

P extension does not have floating-point, but for short
integer/fixed-point SIMD makes sense as alternative.

The software fragmentation issue is that some library routines that
expose VLEN might not be portable between app cores and embedded
cores, but these are different software ecosystems (e.g. ABI/calling
convention might be different) and only a few kinds of routine rely on
VLEN.

For app cores that can afford VLEN>=128, the advantage is the removal
of stripmining code in cases that operate on fixed-size vectors.

Krste



On Wed, 2 Jun 2021 05:10:32 -0700, "Guy Lemieux" <guy.lemieux@...> said:
| Allowing VLEN<128 would allow for smaller vector register files, bit it would
| also result in a profile that is not forward-compatible with the V spec. This
| would produce another fracture the software ecosystem.

| To avoid such a fracture, there are two choices:
| (1) go with P instead
| (2) relax the V spec to allow smaller implementations

| So the key question for this group is whether to relax the minimum VLEN to 32
| or 64?

| note: a possible justification for keeping 128 might be to recommend (1)
| instead. I don’t know anything about P, but it seems like it could be speced
| in a way that is competitive/comparable with Helium.

| Guy

| PS — I have started to design an “RVV-lite” profile which would be more
| amenable to embedded implementations. However, I have adopted a stance that it
| must remain forward compatible with the full V spec, so I have not considered
| VLEN below 128. I am happy to share my work on this and involve other
| contributors — email me if you would like to see a copy.

| On Wed, Jun 2, 2021 at 3:15 AM Andrew Waterman <andrew@...> wrote:

| The uppercase-V V extension is meant to cater to apps processors, where
| the VLEN >= 128 constraint is not inappropriate and is sometimes
| beneficial.  But there's nothing fundamental about the ISA design that
| prohibits VLEN < 128.  A minimal configuration is VLEN=ELEN=32, giving the
| same total amount of state as MVE.  (And if you set LMUL=4, then you even
| get the same shape: 8 registers of 128 bits apiece.)

| Such a thing wouldn't be called V, but perhaps something like Zvmin. 
| Other than agreeing on a feature set and assigning it a name, the
| architecting is already done.

| (If you search the spec for Zfinx, you'll see that a Zfinx variant is
| planned, but only barely sketched out.)

| On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=
| huawei.com@...> wrote:

| Hi everyone,

|  

| Are there any plans for a cut-down configuration of the vector
| extension suitable for embedded cores? It seems that the 32x128-bit
| register file is suitable for application class cores but it very
| large for embedded cores, especially if

| the F registers also need to be implemented (which I think is the
| case, unless a Zfinx version is specified).

|  

| ARM MVE only has 8x128-bit registers for FP and Vector, so it much
| more suitable for embedded applications.

| https://en.wikichip.org/wiki/arm/helium

|  

| What’s the approach here? Should embedded applications implement the
| P-extension instead?

|  

| Tariq

|  

| Tariq Kurd

| Processor Design

| I RISC-V Cores, Bristol

| E-mail:

| Tariq.Kurd@...

| Company:

| Huawei technologies R&D (UK) Ltd

| I Address: 290

| Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32
| 4TR, UK      

|  

| 315px-Huawei   

| http://www.huawei.com

| cid:image002.jpg@...

| This e-mail and its attachments contain confidential information from
| HUAWEI, which

| is intended only for the person or entity whose address is listed
| above. Any use of the information contained herein in any way
| (including, but not limited to, total or partial
| disclosure,reproduction, or dissemination) by persons other than the
| intended recipient(s)

| is prohibited. If you receive this e-mail in error, please notify the
| sender by phone or email immediately and delete it !

| 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人
| 或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复
| 制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知
| 发件人并删除本邮件!

|  

|
| x[DELETED ATTACHMENT image001.png, PNG image]
| x[DELETED ATTACHMENT image002.jpg, JPEG image]


Krste Asanovic
 

We do allow supported SEW to vary with LMUL, so implementation can
support single-width operations on SEW=64. See section 4.5,

Krste

On Wed, 2 Jun 2021 12:14:33 +0000, "Tony Cole via lists.riscv.org" <tony.cole=huawei.com@...> said:
| So, (on a 32x 32-bit vector register machine) the widening and narrowing
| instructions can use 64-bit elements (for destination and source
| respectively), but not any of other instructions, correct?

| Note: I use many instructions while processing 64-bit “wide” and “quad”
| elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8,
| vredsum_vs_i64m8, etc.

| Therefore, this code would not work on a 32x 32-bit vector register machine.

| Tony

| From: tech-vector-ext@... [mailto:tech-vector-ext@...]
| On Behalf Of Bruce Hoult
| Sent: 02 June 2021 12:18
| To: Tony Cole <tony.cole@...>
| Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...;
| Shaofei (B) <shaofei1@...>
| Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector
| extension

| Note that the effective LMUL is limited to 8, the same as the actual LMUL, so
| if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to
| 64 bit results, not 128 bit.

| On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

| Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the
| widening multiplies and accumulates produce the same number of wider
| results using multiple registers (higher effective LMUL)

| See section 5.2. Vector Operands

| Each vector operand has an effective element width (EEW) and an effective
| LMUL (EMUL) that is used to determine the size and location of all the
| elements within a vector register group. By default, for most operands of
| most instructions, EEW=SEW and EMUL=LMUL.

| Some vector instructions have source and destination vector operands with
| the same number of elements but different widths, so that EEW and EMUL
| differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For
| example, most widening arithmetic instructions have a source group with
| EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=
| 2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW
| and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

| Vector operands or results may occupy one or more vector registers
| depending on EMUL, but are always specified using the lowest-numbered
| vector register in the group. Using other than the lowest-numbered vector
| register to specify a vector register group is a reserved encoding.

| On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

| Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does
| this allow for 64-bit elements?

| I don't think it does, but it’s not clear in the spec.

| I use 64-bit elements for “wide” and “quad” accumulators.

| From: tech-vector-ext@... [mailto:
| tech-vector-ext@...] On Behalf Of Bruce Hoult
| Sent: 02 June 2021 11:19
| To: Tariq Kurd <tariq.kurd@...>
| Cc: tech-vector-ext@...; Shaofei (B) <
| shaofei1@...>
| Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of
| the Vector extension

| There is nothing to prevent implementing 32x 32 bit registers on a 32
| bit CPU. The application processor spec has quite

| recently (a few months) specified a 128 bit minimum register size but
| I don't think there's any good reason for this,

| especially in embedded.

| With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

| If floating point is desired then Zfinx is available, sharing int & fp
| scalar registers instead of fp and vector registers.

| Of course profiles (or just custom chips for custom applications) can
| define subsets of instructions.

| On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org
| <tariq.kurd=huawei.com@...> wrote:

| Hi everyone,

| Are there any plans for a cut-down configuration of the vector
| extension suitable for embedded cores? It seems that the
| 32x128-bit register file is suitable for application class cores
| but it very large for embedded cores, especially if the F
| registers also need to be implemented (which I think is the case,
| unless a Zfinx version is specified).

| ARM MVE only has 8x128-bit registers for FP and Vector, so it much
| more suitable for embedded applications.

| https://en.wikichip.org/wiki/arm/helium

| What’s the approach here? Should embedded applications implement
| the P-extension instead?

| Tariq

| Tariq Kurd

| Processor Design I RISC-V Cores, Bristol

| E-mail: Tariq.Kurd@...

| Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park
| Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK

| 315px-Huawei http://www.huawei.com

| cid:image002.jpg@...

| This e-mail and its attachments contain confidential information
| from HUAWEI, which is intended only for the person or entity whose
| address is listed above. Any use of the information contained
| herein in any way (including, but not limited to, total or partial
| disclosure,reproduction, or dissemination) by persons other than
| the intended recipient(s) is prohibited. If you receive this
| e-mail in error, please notify the sender by phone or email
| immediately and delete it !

| 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的
| 个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地
| 泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电
| 话或邮件通知发件人并删除本邮件!

|
| x[DELETED ATTACHMENT image001.png, PNG image]
| x[DELETED ATTACHMENT image002.jpg, JPEG image]


mark
 

could an extension just change state like the number of vector registers?

On Wed, Jun 2, 2021 at 11:13 AM Krste Asanovic <krste@...> wrote:

The VLEN>=128 constraint is only for the application processor "V"
extension for the app profile - not for embedded vectors which can
have VLEN=32.

From spec Introduction:
'
The term base vector extension is used informally to describe the standard set of vector ISA components that will be required for the single-letter "V" extension, which is intended for use in standard server and application-processor platform profiles. The set of mandatory instructions and supported element widths will vary with the base ISA (RV32I, RV64I) as described below.

Other profiles, including embedded profiles, may choose to mandate only subsets of these extensions. The exact set of mandatory supported instructions for an implementation to be compliant with a given profile will only be determined when each profile spec is ratified. For convenience in defining subset profiles, vector instruction subsets are given ISA string names beginning with the "Zv" prefix.
'

There are a set Zve* names for the embedded subsets (see github issue
#550).

A minimal embedded implementaton using RV32E+Zfinx+vectors would be
same state size as ARM MVE.

P extension does not have floating-point, but for short
integer/fixed-point SIMD makes sense as alternative.

The software fragmentation issue is that some library routines that
expose VLEN might not be portable between app cores and embedded
cores, but these are different software ecosystems (e.g. ABI/calling
convention might be different) and only a few kinds of routine rely on
VLEN.

For app cores that can afford VLEN>=128, the advantage is the removal
of stripmining code in cases that operate on fixed-size vectors.

Krste



>>>>> On Wed, 2 Jun 2021 05:10:32 -0700, "Guy Lemieux" <guy.lemieux@...> said:

| Allowing VLEN<128 would allow for smaller vector register files, bit it would
| also result in a profile that is not forward-compatible with the V spec. This
| would produce another fracture the software ecosystem.

| To avoid such a fracture, there are two choices:
| (1) go with P instead
| (2) relax the V spec to allow smaller implementations

| So the key question for this group is whether to relax the minimum VLEN to 32
| or 64?

| note: a possible justification for keeping 128 might be to recommend (1)
| instead. I don’t know anything about P, but it seems like it could be speced
| in a way that is competitive/comparable with Helium.

| Guy

| PS — I have started to design an “RVV-lite” profile which would be more
| amenable to embedded implementations. However, I have adopted a stance that it
| must remain forward compatible with the full V spec, so I have not considered
| VLEN below 128. I am happy to share my work on this and involve other
| contributors — email me if you would like to see a copy.

| On Wed, Jun 2, 2021 at 3:15 AM Andrew Waterman <andrew@...> wrote:

|     The uppercase-V V extension is meant to cater to apps processors, where
|     the VLEN >= 128 constraint is not inappropriate and is sometimes
|     beneficial.  But there's nothing fundamental about the ISA design that
|     prohibits VLEN < 128.  A minimal configuration is VLEN=ELEN=32, giving the
|     same total amount of state as MVE.  (And if you set LMUL=4, then you even
|     get the same shape: 8 registers of 128 bits apiece.)

|     Such a thing wouldn't be called V, but perhaps something like Zvmin. 
|     Other than agreeing on a feature set and assigning it a name, the
|     architecting is already done.

|     (If you search the spec for Zfinx, you'll see that a Zfinx variant is
|     planned, but only barely sketched out.)

|     On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=
|     huawei.com@...> wrote:

|         Hi everyone,

|          

|         Are there any plans for a cut-down configuration of the vector
|         extension suitable for embedded cores? It seems that the 32x128-bit
|         register file is suitable for application class cores but it very
|         large for embedded cores, especially if

|         the F registers also need to be implemented (which I think is the
|         case, unless a Zfinx version is specified).

|          

|         ARM MVE only has 8x128-bit registers for FP and Vector, so it much
|         more suitable for embedded applications.

|         https://en.wikichip.org/wiki/arm/helium

|          

|         What’s the approach here? Should embedded applications implement the
|         P-extension instead?

|          

|         Tariq

|          

|         Tariq Kurd

|         Processor Design

|         I RISC-V Cores, Bristol

|         E-mail:

|         Tariq.Kurd@...

|         Company:

|         Huawei technologies R&D (UK) Ltd

|         I Address: 290

|         Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32
|         4TR, UK      

|          

|         315px-Huawei   

|         http://www.huawei.com

|         cid:image002.jpg@...

|         This e-mail and its attachments contain confidential information from
|         HUAWEI, which

|         is intended only for the person or entity whose address is listed
|         above. Any use of the information contained herein in any way
|         (including, but not limited to, total or partial
|         disclosure,reproduction, or dissemination) by persons other than the
|         intended recipient(s)

|         is prohibited. If you receive this e-mail in error, please notify the
|         sender by phone or email immediately and delete it !

|         本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人
|         或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复
|         制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知
|         发件人并删除本邮件!

|          

|
| x[DELETED ATTACHMENT image001.png, PNG image]
| x[DELETED ATTACHMENT image002.jpg, JPEG image]






Krste Asanovic
 

Section 4.5,

Krste

On Wed, 2 Jun 2021 08:41:52 -0700, "Guy Lemieux" <guy.lemieux@...> said:
| On Wed, Jun 2, 2021 at 8:38 AM Andrew Waterman <andrew@...> wrote:
| It’s actually not fundamental to the ISA design that VLEN >= ELEN. An
| implementation with VLEN=32 could support SEW=64 whenever LMUL >= 2. 

| I think the concern here is lack of a clearly defined data layout pattern for
| such cases.

| eg, should the LSBs be in the odd or even register half, or should it be
| implementation-defined?

| Guy
|


Tony Cole
 

Thanks, I must have missed this bit:

"4.5. Mapping with LMUL > 1 and ELEN > VLEN
If vector registers are grouped to support larger SEW, with ELEN > VLEN, the vector registers in the group are concatenated
to form a single array of bytes, with the lowest-numbered register in the group holding the lowest-addressed bytes from the
memory layout."

-----Original Message-----
From: krste@... [mailto:krste@...]
Sent: 02 June 2021 19:17
To: Tony Cole <tony.cole@...>
Cc: Bruce Hoult <bruce@...>; Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension


We do allow supported SEW to vary with LMUL, so implementation can support single-width operations on SEW=64. See section 4.5,

Krste

On Wed, 2 Jun 2021 12:14:33 +0000, "Tony Cole via lists.riscv.org" <tony.cole=huawei.com@...> said:
| So, (on a 32x 32-bit vector register machine) the widening and
| narrowing instructions can use 64-bit elements (for destination and
| source respectively), but not any of other instructions, correct?

| Note: I use many instructions while processing 64-bit “wide” and “quad”
| elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4,
| vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

| Therefore, this code would not work on a 32x 32-bit vector register machine.

| Tony

| From: tech-vector-ext@...
| [mailto:tech-vector-ext@...]
| On Behalf Of Bruce Hoult
| Sent: 02 June 2021 12:18
| To: Tony Cole <tony.cole@...>
| Cc: Tariq Kurd <tariq.kurd@...>;
| tech-vector-ext@...; Shaofei (B) <shaofei1@...>
| Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of
| the Vector extension

| Note that the effective LMUL is limited to 8, the same as the actual
| LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you
| can only widen to
| 64 bit results, not 128 bit.

| On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

| Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the
| widening multiplies and accumulates produce the same number of wider
| results using multiple registers (higher effective LMUL)

| See section 5.2. Vector Operands

| Each vector operand has an effective element width (EEW) and an effective
| LMUL (EMUL) that is used to determine the size and location of all the
| elements within a vector register group. By default, for most operands of
| most instructions, EEW=SEW and EMUL=LMUL.

| Some vector instructions have source and destination vector operands with
| the same number of elements but different widths, so that EEW and EMUL
| differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For
| example, most widening arithmetic instructions have a source group with
| EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=
| 2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW
| and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

| Vector operands or results may occupy one or more vector registers
| depending on EMUL, but are always specified using the lowest-numbered
| vector register in the group. Using other than the lowest-numbered vector
| register to specify a vector register group is a reserved encoding.

| On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

| Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does
| this allow for 64-bit elements?

| I don't think it does, but it’s not clear in the spec.

| I use 64-bit elements for “wide” and “quad” accumulators.

| From: tech-vector-ext@... [mailto:
| tech-vector-ext@...] On Behalf Of Bruce Hoult
| Sent: 02 June 2021 11:19
| To: Tariq Kurd <tariq.kurd@...>
| Cc: tech-vector-ext@...; Shaofei (B) <
| shaofei1@...>
| Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of
| the Vector extension

| There is nothing to prevent implementing 32x 32 bit registers on a 32
| bit CPU. The application processor spec has quite

| recently (a few months) specified a 128 bit minimum register size but
| I don't think there's any good reason for this,

| especially in embedded.

| With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

| If floating point is desired then Zfinx is available, sharing int & fp
| scalar registers instead of fp and vector registers.

| Of course profiles (or just custom chips for custom applications) can
| define subsets of instructions.

| On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org
| <tariq.kurd=huawei.com@...> wrote:

| Hi everyone,

| Are there any plans for a cut-down configuration of the vector
| extension suitable for embedded cores? It seems that the
| 32x128-bit register file is suitable for application class cores
| but it very large for embedded cores, especially if the F
| registers also need to be implemented (which I think is the case,
| unless a Zfinx version is specified).

| ARM MVE only has 8x128-bit registers for FP and Vector, so it much
| more suitable for embedded applications.

| https://en.wikichip.org/wiki/arm/helium

| What’s the approach here? Should embedded applications implement
| the P-extension instead?

| Tariq

| Tariq Kurd

| Processor Design I RISC-V Cores, Bristol

| E-mail: Tariq.Kurd@...

| Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park
| Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK

| 315px-Huawei http://www.huawei.com

| cid:image002.jpg@...

| This e-mail and its attachments contain confidential information
| from HUAWEI, which is intended only for the person or entity whose
| address is listed above. Any use of the information contained
| herein in any way (including, but not limited to, total or partial
| disclosure,reproduction, or dissemination) by persons other than
| the intended recipient(s) is prohibited. If you receive this
| e-mail in error, please notify the sender by phone or email
| immediately and delete it !

| 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的
| 个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地
| 泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电
| 话或邮件通知发件人并删除本邮件!

| x[DELETED ATTACHMENT image001.png, PNG
| image] x[DELETED ATTACHMENT image002.jpg, JPEG image]