Re: Smaller embedded version of the Vector extension


Nick Knight
 

Hi Tony,

All of the vector permutation instructions can be simulated using the memory system. For example, vslide can be simulated by storing the vector register and loading it at an offset; vrgather can be simulated by an indexed store followed by a unit-stride load (or unit-stride store and indexed load); etc. Whether or not this is more efficient depends on details of the microarchitecture and particular workload.

Best,
Nick Knight


On Wed, Jun 2, 2021 at 1:35 PM Tony Cole via lists.riscv.org <tony.cole=huawei.com@...> wrote:

Hi Bruce,

 

Do you mean vrgather instead of vslide?

 

I use vrgather_vx_* and vslidedown to perform a vector element rotate (and other things), see:

 

        https://github.com/riscv/riscv-v-spec/issues/671#issuecomment-837035001

 

-        I use vrgather_vx_i64m8( vec, 0, vl ) to splat the scalar in element 0 of vec to all elements in the result, I just want it in the top element but there isn’t a better instruction for that.

 

I think you are referring to: vrgather_vv_*  ??

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Tony Cole via lists.riscv.org
Sent: 02 June 2021 18:13
To: Bruce Hoult <bruce@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Hi Bruce,

 

“I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.”

 

I don't see what you mean, please can you elaborate with examples of why/how it exposes the size of the vector register in a very unfortunate way and breaking down if VLEN=1 (do you mean LMUL=1??).

 

The vslide instruction speeds up my code a lot as it reduce reloading (mostly the same) data over and over again.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 13:34
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.

 

I think I saw somewhere they are largely intended for debuggers.

 

On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:

So, (on a 32x 32-bit vector register machine) the widening and narrowing instructions can use 64-bit elements (for destination and source respectively), but not any of other instructions, correct?

 

Note: I use many instructions while processing 64-bit “wide” and “quad” elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4, vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

 

Therefore, this code would not work on a 32x 32-bit vector register machine.

 

 

Tony

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 12:18
To: Tony Cole <tony.cole@...>
Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

Note that the effective LMUL is limited to 8, the same as the actual LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you can only widen to 64 bit results, not 128 bit. 

 

On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the widening multiplies and accumulates produce the same number of wider results using multiple registers (higher effective LMUL)

 

See section 5.2. Vector Operands

 

Each vector operand has an effective element width (EEW) and an effective LMUL (EMUL) that is used to determine the size and location of all the elements within a vector register group. By default, for most operands of most instructions, EEW=SEW and EMUL=LMUL.


Some vector instructions have source and destination vector operands with the same number of elements but different widths, so that EEW and EMUL differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For example, most widening arithmetic instructions have a source group with EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding.

 

 

 

On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does this allow for 64-bit elements?

I don't think it does, but it’s not clear in the spec.

 

I use 64-bit elements for “wide” and “quad” accumulators.

 

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bruce Hoult
Sent: 02 June 2021 11:19
To: Tariq Kurd <
tariq.kurd@...>
Cc:
tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

 

There is nothing to prevent implementing 32x 32 bit registers on a 32 bit CPU. The application processor spec has quite

recently (a few months) specified a 128 bit minimum register size but I don't think there's any good reason for this,

especially in embedded.

 

With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

 

If floating point is desired then Zfinx is available, sharing int & fp scalar registers instead of fp and vector registers.

 

Of course profiles (or just custom chips for custom applications) can define subsets of instructions.

 

On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,

 

Are there any plans for a cut-down configuration of the vector extension suitable for embedded cores? It seems that the 32x128-bit register file is suitable for application class cores but it very large for embedded cores, especially if the F registers also need to be implemented (which I think is the case, unless a Zfinx version is specified).

 

ARM MVE only has 8x128-bit registers for FP and Vector, so it much more suitable for embedded applications.

https://en.wikichip.org/wiki/arm/helium

 

What’s the approach here? Should embedded applications implement the P-extension instead?

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 

Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.