Re: Smaller embedded version of the Vector extension

Tony Cole

Thanks, I must have missed this bit:

"4.5. Mapping with LMUL > 1 and ELEN > VLEN
If vector registers are grouped to support larger SEW, with ELEN > VLEN, the vector registers in the group are concatenated
to form a single array of bytes, with the lowest-numbered register in the group holding the lowest-addressed bytes from the
memory layout."

-----Original Message-----
From: krste@... [mailto:krste@...]
Sent: 02 June 2021 19:17
To: Tony Cole <tony.cole@...>
Cc: Bruce Hoult <bruce@...>; Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...>
Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension

We do allow supported SEW to vary with LMUL, so implementation can support single-width operations on SEW=64. See section 4.5,


On Wed, 2 Jun 2021 12:14:33 +0000, "Tony Cole via" <> said:
| So, (on a 32x 32-bit vector register machine) the widening and
| narrowing instructions can use 64-bit elements (for destination and
| source respectively), but not any of other instructions, correct?

| Note: I use many instructions while processing 64-bit “wide” and “quad”
| elements, e.g. vrgather_vx_i64m8, vslide1down_vx_i64m4,
| vslidedown_vx_i64m8, vredsum_vs_i64m8, etc.

| Therefore, this code would not work on a 32x 32-bit vector register machine.

| Tony

| From: tech-vector-ext@...
| [mailto:tech-vector-ext@...]
| On Behalf Of Bruce Hoult
| Sent: 02 June 2021 12:18
| To: Tony Cole <tony.cole@...>
| Cc: Tariq Kurd <tariq.kurd@...>;
| tech-vector-ext@...; Shaofei (B) <shaofei1@...>
| Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of
| the Vector extension

| Note that the effective LMUL is limited to 8, the same as the actual
| LMUL, so if you've set e32m4 (32 bit elements with LMUL=4) then you
| can only widen to
| 64 bit results, not 128 bit.

| On Wed, Jun 2, 2021 at 11:15 PM Bruce Hoult <bruce@...> wrote:

| Yes. The Standard Element Width (SEW) would be limited to 32 bits, but the
| widening multiplies and accumulates produce the same number of wider
| results using multiple registers (higher effective LMUL)

| See section 5.2. Vector Operands

| Each vector operand has an effective element width (EEW) and an effective
| LMUL (EMUL) that is used to determine the size and location of all the
| elements within a vector register group. By default, for most operands of
| most instructions, EEW=SEW and EMUL=LMUL.

| Some vector instructions have source and destination vector operands with
| the same number of elements but different widths, so that EEW and EMUL
| differ from SEW and LMUL respectively but EEW/EMUL = SEW/LMUL. For
| example, most widening arithmetic instructions have a source group with
| EEW=SEW and EMUL=LMUL but destination group with EEW=2*SEW and EMUL=
| 2*LMUL. Narrowing instructions have a source operand that has EEW=2*SEW
| and EMUL=2*LMUL but destination where EEW=SEW and EMUL=LMUL.

| Vector operands or results may occupy one or more vector registers
| depending on EMUL, but are always specified using the lowest-numbered
| vector register in the group. Using other than the lowest-numbered vector
| register to specify a vector register group is a reserved encoding.

| On Wed, Jun 2, 2021 at 11:11 PM Tony Cole <tony.cole@...> wrote:

| Having 32x 32 bit registers with LMUL=4, giving 8x 128 bits - does
| this allow for 64-bit elements?

| I don't think it does, but it’s not clear in the spec.

| I use 64-bit elements for “wide” and “quad” accumulators.

| From: tech-vector-ext@... [mailto:
| tech-vector-ext@...] On Behalf Of Bruce Hoult
| Sent: 02 June 2021 11:19
| To: Tariq Kurd <tariq.kurd@...>
| Cc: tech-vector-ext@...; Shaofei (B) <
| shaofei1@...>
| Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of
| the Vector extension

| There is nothing to prevent implementing 32x 32 bit registers on a 32
| bit CPU. The application processor spec has quite

| recently (a few months) specified a 128 bit minimum register size but
| I don't think there's any good reason for this,

| especially in embedded.

| With that configuration, LMUL=4 gives 8x 128 bits, the same as MVE.

| If floating point is desired then Zfinx is available, sharing int & fp
| scalar registers instead of fp and vector registers.

| Of course profiles (or just custom chips for custom applications) can
| define subsets of instructions.

| On Wed, Jun 2, 2021 at 10:05 PM Tariq Kurd via
| <> wrote:

| Hi everyone,

| Are there any plans for a cut-down configuration of the vector
| extension suitable for embedded cores? It seems that the
| 32x128-bit register file is suitable for application class cores
| but it very large for embedded cores, especially if the F
| registers also need to be implemented (which I think is the case,
| unless a Zfinx version is specified).

| ARM MVE only has 8x128-bit registers for FP and Vector, so it much
| more suitable for embedded applications.


| What’s the approach here? Should embedded applications implement
| the P-extension instead?

| Tariq

| Tariq Kurd

| Processor Design I RISC-V Cores, Bristol

| E-mail: Tariq.Kurd@...

| Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park
| Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4TR, UK

| 315px-Huawei

| cid:image002.jpg@...

| This e-mail and its attachments contain confidential information
| from HUAWEI, which is intended only for the person or entity whose
| address is listed above. Any use of the information contained
| herein in any way (including, but not limited to, total or partial
| disclosure,reproduction, or dissemination) by persons other than
| the intended recipient(s) is prohibited. If you receive this
| e-mail in error, please notify the sender by phone or email
| immediately and delete it !

| 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的
| 个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地
| 泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电
| 话或邮件通知发件人并删除本邮件!

| x[DELETED ATTACHMENT image001.png, PNG
| image] x[DELETED ATTACHMENT image002.jpg, JPEG image]

Join { to automatically receive all group messages.