Smaller embedded version of the Vector extension


Krste Asanovic
 

On Wed, 2 Jun 2021 11:19:36 -0700, Mark Himelstein <markhimelstein@...> said:
| could an extension just change state like the number of vector registers?
|

Don't understand tbis question - please elaborate.

Krste


Tony Cole
 

Hi Bruce,

 

Do you mean vrgather instead of vslide?

 

I use vrgather_vx_* and vslidedown to perform a vector element rotate (and other things), see:

 

        https://github.com/riscv/riscv-v-spec/issues/671#issuecomment-837035001

 

-        I use vrgather_vx_i64m8( vec, 0, vl ) to splat the scalar in element 0 of vec to all elements in the result, I just want it in the top element but there isn’t a better instruction for that.

 

I think you are referring to: vrgather_vv_*  ??

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Tony Cole via lists.riscv.org
Sent: 02 June 2021 18:13
To: Bruce Hoult <bruce@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Hi Bruce,

 

“I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.”

 

I don't see what you mean, please can you elaborate with examples of why/how it exposes the size of the vector register in a very unfortunate way and breaking down if VLEN=1 (do you mean LMUL=1??).

 

The vslide instruction speeds up my code a lot as it reduce reloading (mostly the same) data over and over again.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 13:34
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.

 

I think I saw somewhere they are largely intended for debuggers.

 

On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:

So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively), but not any of other instructions, correct?

 

Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

 

Therefore, this code would not work on a 32x 32-bit vector register machine.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 12:18
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

 

See section 5.2. Vector Operands

 

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.


Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.

 

 

 

On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <
tariq.kurd@...>
Cc:
tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Nick Knight
 

Hi Tony,

All of the vector permutation instructions can be simulated using the memory system. For example, vslide can be simulated by storing the vector register and loading it at an offset; vrgather can be simulated by an indexed store followed by a unit-stride load (or unit-stride store and indexed load); etc. Whether or not this is more efficient depends on details of the microarchitecture and particular workload.

Best,
Nick Knight


On Wed, Jun 2, 2021 at 1:35 PM Tony Cole via lists.riscv.org <tony.cole=huawei.com@...> wrote:

Hi Bruce,

 

Do you mean vrgather instead of vslide?

 

I use vrgather_vx_* and vslidedown to perform a vector element rotate (and other things), see:

 

        https://github.com/riscv/riscv-v-spec/issues/671#issuecomment-837035001

 

-        I use vrgather_vx_i64m8( vec, 0, vl ) to splat the scalar in element 0 of vec to all elements in the result, I just want it in the top element but there isn’t a better instruction for that.

 

I think you are referring to: vrgather_vv_*  ??

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Tony Cole via lists.riscv.org
Sent: 02 June 2021 18:13
To: Bruce Hoult <bruce@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Hi Bruce,

 

“I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.”

 

I don't see what you mean, please can you elaborate with examples of why/how it exposes the size of the vector register in a very unfortunate way and breaking down if VLEN=1 (do you mean LMUL=1??).

 

The vslide instruction speeds up my code a lot as it reduce reloading (mostly the same) data over and over again.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 13:34
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.

 

I think I saw somewhere they are largely intended for debuggers.

 

On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:

So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively), but not any of other instructions, correct?

 

Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

 

Therefore, this code would not work on a 32x 32-bit vector register machine.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 12:18
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

 

See section 5.2. Vector Operands

 

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.


Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.

 

 

 

On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <
tariq.kurd@...>
Cc:
tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Guy Lemieux
 

What is the advantage to RVV requiring VLEN >= 128?

I think this should be changed to VLEN >= 64 because:

1) VLEN = 64 is more likely for small implementations; creating a
mandatory expectation to improve software portability

2) two implementations, each with VLEN >= 64, do not expose anything
new to software that is not already exposed by VLEN >= 128

3) allowing VLEN =32 would expose something new to software (register
file data layout when SEW=64)

4) are there any disadvantages to VLEN >= 64 (versus the current VLEN
= 128)? (I can't see any)
Guy


On Wed, Jun 2, 2021 at 11:13 AM <krste@...> wrote:


The VLEN>=128 constraint is only for the application processor "V"
extension for the app profile - not for embedded vectors which can
have VLEN=32.

From spec Introduction:
'
The term base vector extension is used informally to describe the standard set of vector ISA components that will be required for the single-letter "V" extension, which is intended for use in standard server and application-processor platform profiles. The set of mandatory instructions and supported element widths will vary with the base ISA (RV32I, RV64I) as described below.

Other profiles, including embedded profiles, may choose to mandate only subsets of these extensions. The exact set of mandatory supported instructions for an implementation to be compliant with a given profile will only be determined when each profile spec is ratified. For convenience in defining subset profiles, vector instruction subsets are given ISA string names beginning with the "Zv" prefix.
'

There are a set Zve* names for the embedded subsets (see github issue
#550).

A minimal embedded implementaton using RV32E+Zfinx+vectors would be
same state size as ARM MVE.

P extension does not have floating-point, but for short
integer/fixed-point SIMD makes sense as alternative.

The software fragmentation issue is that some library routines that
expose VLEN might not be portable between app cores and embedded
cores, but these are different software ecosystems (e.g. ABI/calling
convention might be different) and only a few kinds of routine rely on
VLEN.

For app cores that can afford VLEN>=128, the advantage is the removal
of stripmining code in cases that operate on fixed-size vectors.

Krste



On Wed, 2 Jun 2021 05:10:32 -0700, "Guy Lemieux" <guy.lemieux@...> said:
| Allowing VLEN<128 would allow for smaller vector register files, bit it would
| also result in a profile that is not forward-compatible with the V spec. This
| would produce another fracture the software ecosystem.

| To avoid such a fracture, there are two choices:
| (1) go with P instead
| (2) relax the V spec to allow smaller implementations

| So the key question for this group is whether to relax the minimum VLEN to 32
| or 64?

| note: a possible justification for keeping 128 might be to recommend (1)
| instead. I don’t know anything about P, but it seems like it could be speced
| in a way that is competitive/comparable with Helium.

| Guy

| PS — I have started to design an “RVV-lite” profile which would be more
| amenable to embedded implementations. However, I have adopted a stance that it
| must remain forward compatible with the full V spec, so I have not considered
| VLEN below 128. I am happy to share my work on this and involve other
| contributors — email me if you would like to see a copy.

| On Wed, Jun 2, 2021 at 3:15 AM Andrew Waterman <andrew@...> wrote:

| The uppercase-V V extension is meant to cater to apps processors, where
| the VLEN >= 128 constraint is not inappropriate and is sometimes
| beneficial. But there's nothing fundamental about the ISA design that
| prohibits VLEN < 128. A minimal configuration is VLEN=ELEN=32, giving the
| same total amount of state as MVE. (And if you set LMUL=4, then you even
| get the same shape: 8 registers of 128 bits apiece.)

| Such a thing wouldn't be called V, but perhaps something like Zvmin.
| Other than agreeing on a feature set and assigning it a name, the
| architecting is already done.

| (If you search the spec for Zfinx, you'll see that a Zfinx variant is
| planned, but only barely sketched out.)

| On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=
| huawei.com@...> wrote:

| Hi everyone,

|

| Are there any plans for a cut-down configuration of the vector
| extension suitable for embedded cores? It seems that the 32x128-bit
| register file is suitable for application class cores but it very
| large for embedded cores, especially if

| the F registers also need to be implemented (which I think is the
| case, unless a Zfinx version is specified).

|

| ARM MVE only has 8x128-bit registers for FP and Vector, so it much
| more suitable for embedded applications.

| https://en.wikichip.org/wiki/arm/helium

|

| What’s the approach here? Should embedded applications implement the
| P-extension instead?

|

| Tariq

|

| Tariq Kurd

| Processor Design

| I RISC-V Cores, Bristol

| E-mail:

| Tariq.Kurd@...

| Company:

| Huawei technologies R&D (UK) Ltd

| I Address: 290

| Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32
| 4TR, UK

|

| 315px-Huawei

| http://www.huawei.com

| cid:image002.jpg@...

| This e-mail and its attachments contain confidential information from
| HUAWEI, which

| is intended only for the person or entity whose address is listed
| above. Any use of the information contained herein in any way
| (including, but not limited to, total or partial
| disclosure,reproduction, or dissemination) by persons other than the
| intended recipient(s)

| is prohibited. If you receive this e-mail in error, please notify the
| sender by phone or email immediately and delete it !

| 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人
| 或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复
| 制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知
| 发件人并删除本邮件!

|

|
| x[DELETED ATTACHMENT image001.png, PNG image]
| x[DELETED ATTACHMENT image002.jpg, JPEG image]


Krste Asanovic
 

On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote:

What is the advantage to RVV requiring VLEN >= 128?

I think this should be changed to VLEN >= 64 because:

1) VLEN = 64 is more likely for small implementations; creating a
mandatory expectation to improve software portability
This is the requirement for app processors, which are not generally small cores.
Most competing SIMD extensions are at least 128b per vector register.


2) two implementations, each with VLEN >= 64, do not expose anything
new to software that is not already exposed by VLEN >= 128

3) allowing VLEN =32 would expose something new to software (register
file data layout when SEW=64)

4) are there any disadvantages to VLEN >= 64 (versus the current VLEN
= 128)? (I can't see any)
Lower performance on codes that work well on other app architectures.

Krste


Guy


On Wed, Jun 2, 2021 at 11:13 AM <krste@...> wrote:


The VLEN>=128 constraint is only for the application processor "V"
extension for the app profile - not for embedded vectors which can
have VLEN=32.

From spec Introduction:
'
The term base vector extension is used informally to describe the standard set of vector ISA components that will be required for the single-letter "V" extension, which is intended for use in standard server and application-processor platform profiles. The set of mandatory instructions and supported element widths will vary with the base ISA (RV32I, RV64I) as described below.

Other profiles, including embedded profiles, may choose to mandate only subsets of these extensions. The exact set of mandatory supported instructions for an implementation to be compliant with a given profile will only be determined when each profile spec is ratified. For convenience in defining subset profiles, vector instruction subsets are given ISA string names beginning with the "Zv" prefix.
'

There are a set Zve* names for the embedded subsets (see github issue
#550).

A minimal embedded implementaton using RV32E+Zfinx+vectors would be
same state size as ARM MVE.

P extension does not have floating-point, but for short
integer/fixed-point SIMD makes sense as alternative.

The software fragmentation issue is that some library routines that
expose VLEN might not be portable between app cores and embedded
cores, but these are different software ecosystems (e.g. ABI/calling
convention might be different) and only a few kinds of routine rely on
VLEN.

For app cores that can afford VLEN>=128, the advantage is the removal
of stripmining code in cases that operate on fixed-size vectors.

Krste



On Wed, 2 Jun 2021 05:10:32 -0700, "Guy Lemieux" <guy.lemieux@...> said:
| Allowing VLEN<128 would allow for smaller vector register files, bit it would
| also result in a profile that is not forward-compatible with the V spec. This
| would produce another fracture the software ecosystem.

| To avoid such a fracture, there are two choices:
| (1) go with P instead
| (2) relax the V spec to allow smaller implementations

| So the key question for this group is whether to relax the minimum VLEN to 32
| or 64?

| note: a possible justification for keeping 128 might be to recommend (1)
| instead. I don’t know anything about P, but it seems like it could be speced
| in a way that is competitive/comparable with Helium.

| Guy

| PS — I have started to design an “RVV-lite” profile which would be more
| amenable to embedded implementations. However, I have adopted a stance that it
| must remain forward compatible with the full V spec, so I have not considered
| VLEN below 128. I am happy to share my work on this and involve other
| contributors — email me if you would like to see a copy.

| On Wed, Jun 2, 2021 at 3:15 AM Andrew Waterman <andrew@...> wrote:

| The uppercase-V V extension is meant to cater to apps processors, where
| the VLEN >= 128 constraint is not inappropriate and is sometimes
| beneficial. But there's nothing fundamental about the ISA design that
| prohibits VLEN < 128. A minimal configuration is VLEN=ELEN=32, giving the
| same total amount of state as MVE. (And if you set LMUL=4, then you even
| get the same shape: 8 registers of 128 bits apiece.)

| Such a thing wouldn't be called V, but perhaps something like Zvmin.
| Other than agreeing on a feature set and assigning it a name, the
| architecting is already done.

| (If you search the spec for Zfinx, you'll see that a Zfinx variant is
| planned, but only barely sketched out.)

| On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=
| huawei.com@...> wrote:

| Hi everyone,

|

| Are there any plans for a cut-down configuration of the vector
| extension suitable for embedded cores? It seems that the 32x128-bit
| register file is suitable for application class cores but it very
| large for embedded cores, especially if

| the F registers also need to be implemented (which I think is the
| case, unless a Zfinx version is specified).

|

| ARM MVE only has 8x128-bit registers for FP and Vector, so it much
| more suitable for embedded applications.

| https://en.wikichip.org/wiki/arm/helium

|

| What’s the approach here? Should embedded applications implement the
| P-extension instead?

|

| Tariq

|

| Tariq Kurd

| Processor Design

| I RISC-V Cores, Bristol

| E-mail:

| Tariq.Kurd@...

| Company:

| Huawei technologies R&D (UK) Ltd

| I Address: 290

| Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32
| 4TR, UK

|

| 315px-Huawei

| http://www.huawei.com

| cid:image002.jpg@...

| This e-mail and its attachments contain confidential information from
| HUAWEI, which

| is intended only for the person or entity whose address is listed
| above. Any use of the information contained herein in any way
| (including, but not limited to, total or partial
| disclosure,reproduction, or dissemination) by persons other than the
| intended recipient(s)

| is prohibited. If you receive this e-mail in error, please notify the
| sender by phone or email immediately and delete it !

| 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人
| 或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复
| 制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知
| 发件人并删除本邮件!

|

|
| x[DELETED ATTACHMENT image001.png, PNG image]
| x[DELETED ATTACHMENT image002.jpg, JPEG image]


Tony Cole
 

Software should still work with VLEN>=64 if written correctly, as it should be VLEN agnostic.
Maybe it should be a recommendation that VLEN>=128, with a minimum of 64 for app processors?

Lower performance is an implementation cost/benefit decision.

Tony

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: 03 June 2021 17:24
To: Guy Lemieux <guy.lemieux@...>
Cc: Andrew Waterman <andrew@...>; Tariq Kurd <tariq.kurd@...>; Shaofei (B) <shaofei1@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension



On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote:

What is the advantage to RVV requiring VLEN >= 128?

I think this should be changed to VLEN >= 64 because:

1) VLEN = 64 is more likely for small implementations; creating a
mandatory expectation to improve software portability
This is the requirement for app processors, which are not generally small cores.
Most competing SIMD extensions are at least 128b per vector register.


2) two implementations, each with VLEN >= 64, do not expose anything
new to software that is not already exposed by VLEN >= 128

3) allowing VLEN =32 would expose something new to software (register
file data layout when SEW=64)

4) are there any disadvantages to VLEN >= 64 (versus the current VLEN
= 128)? (I can't see any)
Lower performance on codes that work well on other app architectures.

Krste


Guy


On Wed, Jun 2, 2021 at 11:13 AM <krste@...> wrote:


The VLEN>=128 constraint is only for the application processor "V"
extension for the app profile - not for embedded vectors which can
have VLEN=32.

From spec Introduction:
'
The term base vector extension is used informally to describe the standard set of vector ISA components that will be required for the single-letter "V" extension, which is intended for use in standard server and application-processor platform profiles. The set of mandatory instructions and supported element widths will vary with the base ISA (RV32I, RV64I) as described below.

Other profiles, including embedded profiles, may choose to mandate only subsets of these extensions. The exact set of mandatory supported instructions for an implementation to be compliant with a given profile will only be determined when each profile spec is ratified. For convenience in defining subset profiles, vector instruction subsets are given ISA string names beginning with the "Zv" prefix.
'

There are a set Zve* names for the embedded subsets (see github issue
#550).

A minimal embedded implementaton using RV32E+Zfinx+vectors would be
same state size as ARM MVE.

P extension does not have floating-point, but for short
integer/fixed-point SIMD makes sense as alternative.

The software fragmentation issue is that some library routines that
expose VLEN might not be portable between app cores and embedded
cores, but these are different software ecosystems (e.g. ABI/calling
convention might be different) and only a few kinds of routine rely
on VLEN.

For app cores that can afford VLEN>=128, the advantage is the removal
of stripmining code in cases that operate on fixed-size vectors.

Krste



On Wed, 2 Jun 2021 05:10:32 -0700, "Guy Lemieux" <guy.lemieux@...> said:
| Allowing VLEN<128 would allow for smaller vector register files,
| bit it would also result in a profile that is not
| forward-compatible with the V spec. This would produce another fracture the software ecosystem.

| To avoid such a fracture, there are two choices:
| (1) go with P instead
| (2) relax the V spec to allow smaller implementations

| So the key question for this group is whether to relax the minimum
| VLEN to 32 or 64?

| note: a possible justification for keeping 128 might be to
| recommend (1) instead. I don’t know anything about P, but it seems
| like it could be speced in a way that is competitive/comparable with Helium.

| Guy

| PS — I have started to design an “RVV-lite” profile which would be
| more amenable to embedded implementations. However, I have adopted
| a stance that it must remain forward compatible with the full V
| spec, so I have not considered VLEN below 128. I am happy to share
| my work on this and involve other contributors — email me if you would like to see a copy.

| On Wed, Jun 2, 2021 at 3:15 AM Andrew Waterman <andrew@...> wrote:

| The uppercase-V V extension is meant to cater to apps processors, where
| the VLEN >= 128 constraint is not inappropriate and is sometimes
| beneficial. But there's nothing fundamental about the ISA design that
| prohibits VLEN < 128. A minimal configuration is VLEN=ELEN=32, giving the
| same total amount of state as MVE. (And if you set LMUL=4, then you even
| get the same shape: 8 registers of 128 bits apiece.)

| Such a thing wouldn't be called V, but perhaps something like Zvmin.
| Other than agreeing on a feature set and assigning it a name, the
| architecting is already done.

| (If you search the spec for Zfinx, you'll see that a Zfinx variant is
| planned, but only barely sketched out.)

| On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=
| huawei.com@...> wrote:

| Hi everyone,

|

| Are there any plans for a cut-down configuration of the vector
| extension suitable for embedded cores? It seems that the 32x128-bit
| register file is suitable for application class cores but it very
| large for embedded cores, especially if

| the F registers also need to be implemented (which I think is the
| case, unless a Zfinx version is specified).

|

| ARM MVE only has 8x128-bit registers for FP and Vector, so it much
| more suitable for embedded applications.

| https://en.wikichip.org/wiki/arm/helium

|

| What’s the approach here? Should embedded applications implement the
| P-extension instead?

|

| Tariq

|

| Tariq Kurd

| Processor Design

| I RISC-V Cores, Bristol

| E-mail:

| Tariq.Kurd@...

| Company:

| Huawei technologies R&D (UK) Ltd

| I Address: 290

| Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32
| 4TR, UK

|

| 315px-Huawei

| http://www.huawei.com

| cid:image002.jpg@...

| This e-mail and its attachments contain confidential information from
| HUAWEI, which

| is intended only for the person or entity whose address is listed
| above. Any use of the information contained herein in any way
| (including, but not limited to, total or partial
| disclosure,reproduction, or dissemination) by persons other than the
| intended recipient(s)

| is prohibited. If you receive this e-mail in error, please notify the
| sender by phone or email immediately and delete it !

| 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人
| 或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复
| 制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知
| 发件人并删除本邮件!

|

|
| x[DELETED ATTACHMENT image001.png, PNG image] x[DELETED ATTACHMENT
| image002.jpg, JPEG image]


Guy Lemieux
 

Krste, to be clear,The issue



On Thu, Jun 3, 2021 at 9:24 AM Krste Asanovic <krste@...> wrote:
On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote:

What is the advantage to RVV requiring VLEN >= 128?

I think this should be changed to VLEN >= 64 because:

1) VLEN = 64 is more likely for small implementations; creating a
mandatory expectation to improve software portability
This is the requirement for app processors, which are not generally small cores.
Most competing SIMD extensions are at least 128b per vector register.

The RVV spec should be inclusive, rather than exclusive. Setting VLEN
= 128 is a higher threshold that makes it less inclusive.

4) are there any disadvantages to VLEN >= 64 (versus the current VLEN
= 128)? (I can't see any)
Lower performance on codes that work well on other app architectures.
Sorry I wasn't clear. Of course, an implementation with VLEN=64 would
likely be slower than one with VLEN=128.

To clarify: are there any disadvantages to allowing VLEN=64 in the
spec as a minimum threshold?

Software should be agnostic of VLEN, but the truth is programmers will
squeeze out every last bit where they can and they will latch on to
this minimum value when doing things like re-using LSBs of pointers,
setting minimum chunk sizes, etc. Hence, asking them to expect VLEN=64
as a minimum would be better (more inclusive).

I can't see how this would hurt performance.

Guy


Zalman Stern
 

If the minimum VLEN is at least 128-bits, one can translate NEON/SSE intrinsics directly without having to have every vector instruction dominated by a loop over the vector length.

-Z-


On Thu, Jun 3, 2021 at 9:38 AM Guy Lemieux <guy.lemieux@...> wrote:
Krste, to be clear,The issue



On Thu, Jun 3, 2021 at 9:24 AM Krste Asanovic <krste@...> wrote:
> > On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote:
> >
> > What is the advantage to RVV requiring VLEN >= 128?
> >
> > I think this should be changed to VLEN >= 64 because:
> >
> > 1) VLEN = 64 is more likely for small implementations; creating a
> > mandatory expectation to improve software portability
>
> This is the requirement for app processors, which are not generally small cores.
> Most competing SIMD extensions are at least 128b per vector register.


The RVV spec should be inclusive, rather than exclusive. Setting VLEN
>= 128 is a higher threshold that makes it less inclusive.


> > 4) are there any disadvantages to VLEN >= 64 (versus the current VLEN
> >> = 128)? (I can't see any)
>
> Lower performance on codes that work well on other app architectures.

Sorry I wasn't clear. Of course, an implementation with VLEN=64 would
likely be slower than one with VLEN=128.

To clarify: are there any disadvantages to allowing VLEN=64 in the
spec as a minimum threshold?

Software should be agnostic of VLEN, but the truth is programmers will
squeeze out every last bit where they can and they will latch on to
this minimum value when doing things like re-using LSBs of pointers,
setting minimum chunk sizes, etc. Hence, asking them to expect VLEN=64
as a minimum would be better (more inclusive).

I can't see how this would hurt performance.

Guy






Zalman Stern
 

"...if written correctly" is precisely the point. If VLEN is specified as >=128, code that targets 128-bits explicitly by setting VL to an appropriate constant for a large swath *is* correct. This allows one to do basically what NEON/SSE do today as a baseline for performance.

Whether this is worthwhile or not may be debated, but insisting that everything should be completely vector length agnostic or it is broken is missing the point. Ideally there would be a lot more quantitative data on this, but I'm not going to tilt at that windmill right now. The worst case for the overhead of hardware vector length independence occurs at the smallest sizes as well.

In general it's pretty dubious that the same set of fully lowered instruction bits can efficiently cover everything from the bottom of the embedded space to HPC. Ideally we'd be moving to more sophisticated lowering -- e.g. dynamic and multi-stage compilation -- rather than forcing the issue in the ISA design.

Another way to go would be to split 32-bit and 64-bit implementations such that the VLEN >= 64 for 32-bit implementations and VLEN >= 128 for 64-bit ones. (Application code is rarely going to target 32-bit these days. Minimal embedded implementations are probably 32-bit.) Though truth be told, code likely needs a scalar fallback anyway unless the V extension is required. (Which it almost certainly won't be if we're talking embedded space.) As such, VLEN not being large enough for the expectations code was compiled to is the same as not having the vector unit.

-Z-

On Thu, Jun 3, 2021 at 9:33 AM Tony Cole via lists.riscv.org <tony.cole=huawei.com@...> wrote:
Software should still work with VLEN>=64 if written correctly, as it should be VLEN agnostic.
Maybe it should be a recommendation that VLEN>=128, with a minimum of 64 for app processors?

Lower performance is an implementation cost/benefit decision.

Tony

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: 03 June 2021 17:24
To: Guy Lemieux <guy.lemieux@...>
Cc: Andrew Waterman <andrew@...>; Tariq Kurd <tariq.kurd@...>; Shaofei (B) <shaofei1@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension



> On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote:
>
> What is the advantage to RVV requiring VLEN >= 128?
>
> I think this should be changed to VLEN >= 64 because:
>
> 1) VLEN = 64 is more likely for small implementations; creating a
> mandatory expectation to improve software portability

This is the requirement for app processors, which are not generally small cores.
Most competing SIMD extensions are at least 128b per vector register.

>
> 2) two implementations, each with VLEN >= 64, do not expose anything
> new to software that is not already exposed by VLEN >= 128
>
> 3) allowing VLEN =32 would expose something new to software (register
> file data layout when SEW=64)
>
> 4) are there any disadvantages to VLEN >= 64 (versus the current VLEN
>> = 128)? (I can't see any)

Lower performance on codes that work well on other app architectures.

Krste

>
> Guy
>
>
> On Wed, Jun 2, 2021 at 11:13 AM <krste@...> wrote:
>>
>>
>> The VLEN>=128 constraint is only for the application processor "V"
>> extension for the app profile - not for embedded vectors which can
>> have VLEN=32.
>>
>> From spec Introduction:
>> '
>> The term base vector extension is used informally to describe the standard set of vector ISA components that will be required for the single-letter "V" extension, which is intended for use in standard server and application-processor platform profiles. The set of mandatory instructions and supported element widths will vary with the base ISA (RV32I, RV64I) as described below.
>>
>> Other profiles, including embedded profiles, may choose to mandate only subsets of these extensions. The exact set of mandatory supported instructions for an implementation to be compliant with a given profile will only be determined when each profile spec is ratified. For convenience in defining subset profiles, vector instruction subsets are given ISA string names beginning with the "Zv" prefix.
>> '
>>
>> There are a set Zve* names for the embedded subsets (see github issue
>> #550).
>>
>> A minimal embedded implementaton using RV32E+Zfinx+vectors would be
>> same state size as ARM MVE.
>>
>> P extension does not have floating-point, but for short
>> integer/fixed-point SIMD makes sense as alternative.
>>
>> The software fragmentation issue is that some library routines that
>> expose VLEN might not be portable between app cores and embedded
>> cores, but these are different software ecosystems (e.g. ABI/calling
>> convention might be different) and only a few kinds of routine rely
>> on VLEN.
>>
>> For app cores that can afford VLEN>=128, the advantage is the removal
>> of stripmining code in cases that operate on fixed-size vectors.
>>
>> Krste
>>
>>
>>
>>>>>>> On Wed, 2 Jun 2021 05:10:32 -0700, "Guy Lemieux" <guy.lemieux@...> said:
>>
>> | Allowing VLEN<128 would allow for smaller vector register files,
>> | bit it would also result in a profile that is not
>> | forward-compatible with the V spec. This would produce another fracture the software ecosystem.
>>
>> | To avoid such a fracture, there are two choices:
>> | (1) go with P instead
>> | (2) relax the V spec to allow smaller implementations
>>
>> | So the key question for this group is whether to relax the minimum
>> | VLEN to 32 or 64?
>>
>> | note: a possible justification for keeping 128 might be to
>> | recommend (1) instead. I don’t know anything about P, but it seems
>> | like it could be speced in a way that is competitive/comparable with Helium.
>>
>> | Guy
>>
>> | PS — I have started to design an “RVV-lite” profile which would be
>> | more amenable to embedded implementations. However, I have adopted
>> | a stance that it must remain forward compatible with the full V
>> | spec, so I have not considered VLEN below 128. I am happy to share
>> | my work on this and involve other contributors — email me if you would like to see a copy.
>>
>> | On Wed, Jun 2, 2021 at 3:15 AM Andrew Waterman <andrew@...> wrote:
>>
>> |     The uppercase-V V extension is meant to cater to apps processors, where
>> |     the VLEN >= 128 constraint is not inappropriate and is sometimes
>> |     beneficial.  But there's nothing fundamental about the ISA design that
>> |     prohibits VLEN < 128.  A minimal configuration is VLEN=ELEN=32, giving the
>> |     same total amount of state as MVE.  (And if you set LMUL=4, then you even
>> |     get the same shape: 8 registers of 128 bits apiece.)
>>
>> |     Such a thing wouldn't be called V, but perhaps something like Zvmin.
>> |     Other than agreeing on a feature set and assigning it a name, the
>> |     architecting is already done.
>>
>> |     (If you search the spec for Zfinx, you'll see that a Zfinx variant is
>> |     planned, but only barely sketched out.)
>>
>> |     On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=
>> |     huawei.com@...> wrote:
>>
>> |         Hi everyone,
>>
>> |
>>
>> |         Are there any plans for a cut-down configuration of the vector
>> |         extension suitable for embedded cores? It seems that the 32x128-bit
>> |         register file is suitable for application class cores but it very
>> |         large for embedded cores, especially if
>>
>> |         the F registers also need to be implemented (which I think is the
>> |         case, unless a Zfinx version is specified).
>>
>> |
>>
>> |         ARM MVE only has 8x128-bit registers for FP and Vector, so it much
>> |         more suitable for embedded applications.
>>
>> |         https://en.wikichip.org/wiki/arm/helium
>>
>> |
>>
>> |         What’s the approach here? Should embedded applications implement the
>> |         P-extension instead?
>>
>> |
>>
>> |         Tariq
>>
>> |
>>
>> |         Tariq Kurd
>>
>> |         Processor Design
>>
>> |         I RISC-V Cores, Bristol
>>
>> |         E-mail:
>>
>> |         Tariq.Kurd@...
>>
>> |         Company:
>>
>> |         Huawei technologies R&D (UK) Ltd
>>
>> |         I Address: 290
>>
>> |         Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32
>> |         4TR, UK
>>
>> |
>>
>> |         315px-Huawei
>>
>> |         http://www.huawei.com
>>
>> |         cid:image002.jpg@...
>>
>> |         This e-mail and its attachments contain confidential information from
>> |         HUAWEI, which
>>
>> |         is intended only for the person or entity whose address is listed
>> |         above. Any use of the information contained herein in any way
>> |         (including, but not limited to, total or partial
>> |         disclosure,reproduction, or dissemination) by persons other than the
>> |         intended recipient(s)
>>
>> |         is prohibited. If you receive this e-mail in error, please notify the
>> |         sender by phone or email immediately and delete it !
>>
>> |         本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人
>> |         或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复
>> |         制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知
>> |         发件人并删除本邮件!
>>
>> |
>>
>> |
>> | x[DELETED ATTACHMENT image001.png, PNG image] x[DELETED ATTACHMENT
>> | image002.jpg, JPEG image]












Guy Lemieux
 

On Thu, Jun 3, 2021 at 1:08 PM Zalman Stern <zalman@...> wrote:

If the minimum VLEN is at least 128-bits, one can translate NEON/SSE intrinsics directly without having to have every vector instruction dominated by a loop over the vector length.
that's pretty handy, actually. I'm not sure it should be a property of
the V spec itself, rather it could be a requirement that software
which is translated in this method could require an implementation
with VLEN >= 128 else it would fall back to a scalar translation.

for RVV, I was pretty comfortable with the requirement that RVV
require VLEN >= 128 before this whole thread started. it seemed like a
good length (4 x 32b words) which matched other SIMD instructions sets
as you have noted.

with this post, Tariq indicated that he wants to reduce the amount of
state. from this, I started to think it might be better to shorten
this to VLEN >= 64 or perhaps VLEN >= max(XLEN,FLEN) rather than
reducing the number of named registers [*]

Regarding performance, VLEN=32 or 64 seems ridiculously small until
you consider register grouping. The RVV-lite profile that I'm
proposing would require SEW/LMUL=8, so VLMAX=4 for VLEN=32, and
VLMAX=8 for VLEN=64. These are reasonable vector lengths to get
reasonable amounts of parallelism.


[*] why not just restrict small implementations to 16 or 8 named
registers with VLEN >= 128? it is a consequence of how RVV has chosen
to implement widening and narrowing instructions, which require using
register grouping. in my RVV-lite profile, I considered eliminating
register groups entirely, but this would require some other way to do
widening/narrowing which would not be compatible with RVV. with
SEW/LMUL=32/4, a common setting, there are only 8 vector registers
available. to save register file area, restricting this to just 4
vector registers seems too restrictive. instead, I think relaxing
VLMAX >= 64 achieves the same effect (halving the required register
file size) without requiring such a restriction.

Guy


Krste Asanovic
 

If there was no cost, then supporting VLEN=64 on general apps
processor profile would be a good thing to do. But not allowing
standard software to assume VLEN>=128 imposes a non-trivial impact on
bigger cores, and expectation is the vast majority of apps cores will
want VLEN>=128.

As Zalman points out, the main advantage is removing stripmining code
when it is known vectors will fit, and translating existing code is
one important such use case though not the only one. Removing
stripmining reduces static and dynamic code size and increases
performance. While LMUL>1 allows more cases to be handled without
stripmining, it also reduces available arch registers.

Anyone can of course still build a compatible apps processor with
VLEN=64, but this would fail to run some of the code written for
VLEN>=128 case. And almost anything goes in embedded space.

Krste

On Thu, 3 Jun 2021 13:35:03 -0700, Zalman Stern <zalman@...> said:
| "...if written correctly" is precisely the point. If VLEN is specified as >=128, code that targets 128-bits explicitly by
| setting VL to an appropriate constant for a large swath *is* correct. This allows one to do basically what NEON/SSE do today as
| a baseline for performance.

| Whether this is worthwhile or not may be debated, but insisting that everything should be completely vector length agnostic or
| it is broken is missing the point. Ideally there would be a lot more quantitative data on this, but I'm not going to tilt at
| that windmill right now. The worst case for the overhead of hardware vector length independence occurs at the smallest sizes as
| well.

| In general it's pretty dubious that the same set of fully lowered instruction bits can efficiently cover everything from the
| bottom of the embedded space to HPC. Ideally we'd be moving to more sophisticated lowering -- e.g. dynamic and multi-stage
| compilation -- rather than forcing the issue in the ISA design.

| Another way to go would be to split 32-bit and 64-bit implementations such that the VLEN >= 64 for 32-bit implementations and
| VLEN >= 128 for 64-bit ones. (Application code is rarely going to target 32-bit these days. Minimal embedded implementations
| are probably 32-bit.) Though truth be told, code likely needs a scalar fallback anyway unless the V extension is required.
| (Which it almost certainly won't be if we're talking embedded space.) As such, VLEN not being large enough for the expectations
| code was compiled to is the same as not having the vector unit.

| -Z-

| On Thu, Jun 3, 2021 at 9:33 AM Tony Cole via lists.riscv.org <tony.cole=huawei.com@...> wrote:

| Software should still work with VLEN>=64 if written correctly, as it should be VLEN agnostic.
| Maybe it should be a recommendation that VLEN>=128, with a minimum of 64 for app processors?

| Lower performance is an implementation cost/benefit decision.

| Tony

| -----Original Message-----
| From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
| Sent: 03 June 2021 17:24
| To: Guy Lemieux <guy.lemieux@...>
| Cc: Andrew Waterman <andrew@...>; Tariq Kurd <tariq.kurd@...>; Shaofei (B) <shaofei1@...>;
| tech-vector-ext@...
| Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

|| On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote:
||
|| What is the advantage to RVV requiring VLEN >= 128?
||
|| I think this should be changed to VLEN >= 64 because:
||
|| 1) VLEN = 64 is more likely for small implementations; creating a
|| mandatory expectation to improve software portability

| This is the requirement for app processors, which are not generally small cores.
| Most competing SIMD extensions are at least 128b per vector register.

||
|| 2) two implementations, each with VLEN >= 64, do not expose anything
|| new to software that is not already exposed by VLEN >= 128
||
|| 3) allowing VLEN =32 would expose something new to software (register
|| file data layout when SEW=64)
||
|| 4) are there any disadvantages to VLEN >= 64 (versus the current VLEN
||| = 128)? (I can't see any)

| Lower performance on codes that work well on other app architectures.

| Krste

||
|| Guy
||
||
|| On Wed, Jun 2, 2021 at 11:13 AM <krste@...> wrote:
|||
|||
||| The VLEN>=128 constraint is only for the application processor "V"
||| extension for the app profile - not for embedded vectors which can
||| have VLEN=32.
|||
||| From spec Introduction:
||| '
||| The term base vector extension is used informally to describe the standard set of vector ISA components that will be
| required for the single-letter "V" extension, which is intended for use in standard server and application-processor
| platform profiles. The set of mandatory instructions and supported element widths will vary with the base ISA (RV32I,
| RV64I) as described below.
|||
||| Other profiles, including embedded profiles, may choose to mandate only subsets of these extensions. The exact set of
| mandatory supported instructions for an implementation to be compliant with a given profile will only be determined when
| each profile spec is ratified. For convenience in defining subset profiles, vector instruction subsets are given ISA string
| names beginning with the "Zv" prefix.
||| '
|||
||| There are a set Zve* names for the embedded subsets (see github issue
||| #550).
|||
||| A minimal embedded implementaton using RV32E+Zfinx+vectors would be
||| same state size as ARM MVE.
|||
||| P extension does not have floating-point, but for short
||| integer/fixed-point SIMD makes sense as alternative.
|||
||| The software fragmentation issue is that some library routines that
||| expose VLEN might not be portable between app cores and embedded
||| cores, but these are different software ecosystems (e.g. ABI/calling
||| convention might be different) and only a few kinds of routine rely
||| on VLEN.
|||
||| For app cores that can afford VLEN>=128, the advantage is the removal
||| of stripmining code in cases that operate on fixed-size vectors.
|||
||| Krste
|||
|||
|||
|||||||| On Wed, 2 Jun 2021 05:10:32 -0700, "Guy Lemieux" <guy.lemieux@...> said:
|||
||| | Allowing VLEN<128 would allow for smaller vector register files,
||| | bit it would also result in a profile that is not
||| | forward-compatible with the V spec. This would produce another fracture the software ecosystem.
|||
||| | To avoid such a fracture, there are two choices:
||| | (1) go with P instead
||| | (2) relax the V spec to allow smaller implementations
|||
||| | So the key question for this group is whether to relax the minimum
||| | VLEN to 32 or 64?
|||
||| | note: a possible justification for keeping 128 might be to
||| | recommend (1) instead. I don’t know anything about P, but it seems
||| | like it could be speced in a way that is competitive/comparable with Helium.
|||
||| | Guy
|||
||| | PS — I have started to design an “RVV-lite” profile which would be
||| | more amenable to embedded implementations. However, I have adopted
||| | a stance that it must remain forward compatible with the full V
||| | spec, so I have not considered VLEN below 128. I am happy to share
||| | my work on this and involve other contributors — email me if you would like to see a copy.
|||
||| | On Wed, Jun 2, 2021 at 3:15 AM Andrew Waterman <andrew@...> wrote:
|||
||| |     The uppercase-V V extension is meant to cater to apps processors, where
||| |     the VLEN >= 128 constraint is not inappropriate and is sometimes
||| |     beneficial.  But there's nothing fundamental about the ISA design that
||| |     prohibits VLEN < 128.  A minimal configuration is VLEN=ELEN=32, giving the
||| |     same total amount of state as MVE.  (And if you set LMUL=4, then you even
||| |     get the same shape: 8 registers of 128 bits apiece.)
|||
||| |     Such a thing wouldn't be called V, but perhaps something like Zvmin.
||| |     Other than agreeing on a feature set and assigning it a name, the
||| |     architecting is already done.
|||
||| |     (If you search the spec for Zfinx, you'll see that a Zfinx variant is
||| |     planned, but only barely sketched out.)
|||
||| |     On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd=
||| |     huawei.com@...> wrote:
|||
||| |         Hi everyone,
|||
||| |
|||
||| |         Are there any plans for a cut-down configuration of the vector
||| |         extension suitable for embedded cores? It seems that the 32x128-bit
||| |         register file is suitable for application class cores but it very
||| |         large for embedded cores, especially if
|||
||| |         the F registers also need to be implemented (which I think is the
||| |         case, unless a Zfinx version is specified).
|||
||| |
|||
||| |         ARM MVE only has 8x128-bit registers for FP and Vector, so it much
||| |         more suitable for embedded applications.
|||
||| |         https://en.wikichip.org/wiki/arm/helium
|||
||| |
|||
||| |         What’s the approach here? Should embedded applications implement the
||| |         P-extension instead?
|||
||| |
|||
||| |         Tariq
|||
||| |
|||
||| |         Tariq Kurd
|||
||| |         Processor Design
|||
||| |         I RISC-V Cores, Bristol
|||
||| |         E-mail:
|||
||| |         Tariq.Kurd@...
|||
||| |         Company:
|||
||| |         Huawei technologies R&D (UK) Ltd
|||
||| |         I Address: 290
|||
||| |         Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32
||| |         4TR, UK
|||
||| |
|||
||| |         315px-Huawei
|||
||| |         http://www.huawei.com
|||
||| |         cid:image002.jpg@...
|||
||| |         This e-mail and its attachments contain confidential information from
||| |         HUAWEI, which
|||
||| |         is intended only for the person or entity whose address is listed
||| |         above. Any use of the information contained herein in any way
||| |         (including, but not limited to, total or partial
||| |         disclosure,reproduction, or dissemination) by persons other than the
||| |         intended recipient(s)
|||
||| |         is prohibited. If you receive this e-mail in error, please notify the
||| |         sender by phone or email immediately and delete it !
|||
||| |         本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人
||| |         或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复
||| |         制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知
||| |         发件人并删除本邮件!
|||
||| |
|||
||| |
||| | x[DELETED ATTACHMENT image001.png, PNG image] x[DELETED ATTACHMENT
||| | image002.jpg, JPEG image]

|


Bruce Hoult
 

On Fri, Jun 4, 2021 at 8:09 AM Zalman Stern via lists.riscv.org <zalman=google.com@...> wrote:
If the minimum VLEN is at least 128-bits, one can translate NEON/SSE intrinsics directly without having to have every vector instruction dominated by a loop over the vector length.

This is an excellent point, but there are only 8 SSE/AVX/AVX2 registers in 32 bit mode and 16 in 64 bit.

Therefore a 32 bit RISC-V could use 32 bit VLEN and LMUL=4 to directly translate SSE code without stripmining, and a 64 bit RISC-V could use 64 bit VLEN and LMUL=2. For AVX/AVX2 VLEN=64 is required on 32 bit and VLEN=128 on 64 bit, using the same LMUL.

Similarly, 32 bit ARM NEON works as sixteen 128 bit registers or thirty two 64 bit registers. Thus a 32 bit RISC-V with VLEN=64 can directly translate NEON code using LMUL=1 or LMUL=2.

Aarch64 has thirty two registers of 128 bits each, which can also be treated as thirty two registers of 64 bits each (effectively just setting a smaller VL, the upper half is zeroed). So directly porting 64 bit ARM Advanced SIMD code does require 128 bit registers.

For maximum SIMD-porting compatibility with both ARM and x86 code a 64 bit RISC-V needs VLEN=128 but a 32 bit RISC-V is fine with VLEN=64.


Gregory Kielian
 

Hi Everyone, wanted to continue this interesting discussion.


Was wondering if this is a complete listing of the requirements (so far) for the ZVE* extensions? or if there might be another document/spreadsheet/source-file which would have a running-list of requirements?

In particular, hoping to check if there might be a running-list of instructions required by the ZVE* extensions (e.g. if we would need to implement vector integer division) and the range of LMUL levels we would be required to support?

Looking forward to continuing the discussion.

All the best,
Gregory


Guy Lemieux
 

I’ve taken a stab at reducing the number of instructions in my RVV-lite proposal. The overriding goal, in my mind, is to preserve forward software compatibility so the ecosystem doesn’t need to fragment.

There are lots of instructions that are not essential which I have eliminated. Also, I have dropped or limited the scope of the widening and narrowing instructions — they are awkward to implement because they change the demand in register file read or write bandwidth
due to a mixing of data element sizes.

Limiting LMUL is far more difficult, because it is fundamental to the way RVV changes data widths. The best I could do in my proposal is require SEW/LMUL to always be 8.

I’m happy to share my proposal on request, but I’ve not broadcast it here because it still needs more work. I’d welcome any thoughts on improving it though.

Guy



On Sun, Jun 27, 2021 at 11:54 PM Gregory Kielian <gkielian@...> wrote:
Hi Everyone, wanted to continue this interesting discussion.


Was wondering if this is a complete listing of the requirements (so far) for the ZVE* extensions? or if there might be another document/spreadsheet/source-file which would have a running-list of requirements?

In particular, hoping to check if there might be a running-list of instructions required by the ZVE* extensions (e.g. if we would need to implement vector integer division) and the range of LMUL levels we would be required to support?

Looking forward to continuing the discussion.

All the best,
Gregory