Smaller embedded version of the Vector extension
| could an extension just change state like the number of vector registers?On Wed, 2 Jun 2021 11:19:36 -0700, Mark Himelstein <markhimelstein@...> said: | Don't understand tbis question - please elaborate. Krste |
|
Tony Cole
Hi Bruce,
Do you mean vrgather instead of vslide?
I use vrgather_vx_* and vslidedown to perform a vector element rotate (and other things), see:
https://github.com/riscv/riscv-v-spec/issues/671#issuecomment-837035001
- I use vrgather_vx_i64m8( vec, 0, vl ) to splat the scalar in element 0 of vec to all elements in the result, I just want it in the top element but there isn’t a better instruction for that.
I think you are referring to: vrgather_vv_* ??
Tony
From: tech-vector-ext@... [mailto:tech-vector-ext@...]
On Behalf Of Tony Cole via lists.riscv.org
Sent: 02 June 2021 18:13 To: Bruce Hoult <bruce@...> Cc: Tariq Kurd <tariq.kurd@...>; tech-vector-ext@...; Shaofei (B) <shaofei1@...> Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension
Hi Bruce,
“I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.”
I don't see what you mean, please can you elaborate with examples of why/how it exposes the size of the vector register in a very unfortunate way and breaking down if VLEN=1 (do you mean LMUL=1??).
The vslide instruction speeds up my code a lot as it reduce reloading (mostly the same) data over and over again.
Tony
From:
tech-vector-ext@... [mailto:tech-vector-ext@...]
On Behalf Of Bruce Hoult
I an not a fan of the vslide instructions. It seems they expose the size of the vector registers in a very unfortunate way. In particular they break down if VLEN=1. Most code would be better off storing and loading with an offset.
I think I saw somewhere they are largely intended for debuggers.
On Thu, Jun 3, 2021 at 12:15 AM Tony Cole <tony.cole@...> wrote:
|
|
Hi Tony, All of the vector permutation instructions can be simulated using the memory system. For example, vslide can be simulated by storing the vector register and loading it at an offset; vrgather can be simulated by an indexed store followed by a unit-stride load (or unit-stride store and indexed load); etc. Whether or not this is more efficient depends on details of the microarchitecture and particular workload. Best, Nick Knight
|
|
Guy Lemieux
What is the advantage to RVV requiring VLEN >= 128?
I think this should be changed to VLEN >= 64 because: 1) VLEN = 64 is more likely for small implementations; creating a mandatory expectation to improve software portability 2) two implementations, each with VLEN >= 64, do not expose anything new to software that is not already exposed by VLEN >= 128 3) allowing VLEN =32 would expose something new to software (register file data layout when SEW=64) 4) are there any disadvantages to VLEN >= 64 (versus the current VLEN = 128)? (I can't see any)Guy On Wed, Jun 2, 2021 at 11:13 AM <krste@...> wrote:
|
|
On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote:This is the requirement for app processors, which are not generally small cores. Most competing SIMD extensions are at least 128b per vector register. Lower performance on codes that work well on other app architectures. Krste
|
|
Tony Cole
Software should still work with VLEN>=64 if written correctly, as it should be VLEN agnostic.
toggle quoted message
Show quoted text
Maybe it should be a recommendation that VLEN>=128, with a minimum of 64 for app processors? Lower performance is an implementation cost/benefit decision. Tony -----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic Sent: 03 June 2021 17:24 To: Guy Lemieux <guy.lemieux@...> Cc: Andrew Waterman <andrew@...>; Tariq Kurd <tariq.kurd@...>; Shaofei (B) <shaofei1@...>; tech-vector-ext@... Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote:This is the requirement for app processors, which are not generally small cores. Most competing SIMD extensions are at least 128b per vector register. Lower performance on codes that work well on other app architectures. Krste
|
|
Guy Lemieux
Krste, to be clear,The issue
On Thu, Jun 3, 2021 at 9:24 AM Krste Asanovic <krste@...> wrote: On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote:This is the requirement for app processors, which are not generally small cores. The RVV spec should be inclusive, rather than exclusive. Setting VLEN = 128 is a higher threshold that makes it less inclusive. Sorry I wasn't clear. Of course, an implementation with VLEN=64 would4) are there any disadvantages to VLEN >= 64 (versus the current VLENLower performance on codes that work well on other app architectures.= 128)? (I can't see any) likely be slower than one with VLEN=128. To clarify: are there any disadvantages to allowing VLEN=64 in the spec as a minimum threshold? Software should be agnostic of VLEN, but the truth is programmers will squeeze out every last bit where they can and they will latch on to this minimum value when doing things like re-using LSBs of pointers, setting minimum chunk sizes, etc. Hence, asking them to expect VLEN=64 as a minimum would be better (more inclusive). I can't see how this would hurt performance. Guy |
|
Zalman Stern
If the minimum VLEN is at least 128-bits, one can translate NEON/SSE intrinsics directly without having to have every vector instruction dominated by a loop over the vector length. -Z- On Thu, Jun 3, 2021 at 9:38 AM Guy Lemieux <guy.lemieux@...> wrote: Krste, to be clear,The issue |
|
Zalman Stern
"...if written correctly" is precisely the point. If VLEN is specified as >=128, code that targets 128-bits explicitly by setting VL to an appropriate constant for a large swath *is* correct. This allows one to do basically what NEON/SSE do today as a baseline for performance. Whether this is worthwhile or not may be debated, but insisting that everything should be completely vector length agnostic or it is broken is missing the point. Ideally there would be a lot more quantitative data on this, but I'm not going to tilt at that windmill right now. The worst case for the overhead of hardware vector length independence occurs at the smallest sizes as well. In general it's pretty dubious that the same set of fully lowered instruction bits can efficiently cover everything from the bottom of the embedded space to HPC. Ideally we'd be moving to more sophisticated lowering -- e.g. dynamic and multi-stage compilation -- rather than forcing the issue in the ISA design. Another way to go would be to split 32-bit and 64-bit implementations such that the VLEN >= 64 for 32-bit implementations and VLEN >= 128 for 64-bit ones. (Application code is rarely going to target 32-bit these days. Minimal embedded implementations are probably 32-bit.) Though truth be told, code likely needs a scalar fallback anyway unless the V extension is required. (Which it almost certainly won't be if we're talking embedded space.) As such, VLEN not being large enough for the expectations code was compiled to is the same as not having the vector unit. -Z- Software should still work with VLEN>=64 if written correctly, as it should be VLEN agnostic. |
|
Guy Lemieux
On Thu, Jun 3, 2021 at 1:08 PM Zalman Stern <zalman@...> wrote:
that's pretty handy, actually. I'm not sure it should be a property of the V spec itself, rather it could be a requirement that software which is translated in this method could require an implementation with VLEN >= 128 else it would fall back to a scalar translation. for RVV, I was pretty comfortable with the requirement that RVV require VLEN >= 128 before this whole thread started. it seemed like a good length (4 x 32b words) which matched other SIMD instructions sets as you have noted. with this post, Tariq indicated that he wants to reduce the amount of state. from this, I started to think it might be better to shorten this to VLEN >= 64 or perhaps VLEN >= max(XLEN,FLEN) rather than reducing the number of named registers [*] Regarding performance, VLEN=32 or 64 seems ridiculously small until you consider register grouping. The RVV-lite profile that I'm proposing would require SEW/LMUL=8, so VLMAX=4 for VLEN=32, and VLMAX=8 for VLEN=64. These are reasonable vector lengths to get reasonable amounts of parallelism. [*] why not just restrict small implementations to 16 or 8 named registers with VLEN >= 128? it is a consequence of how RVV has chosen to implement widening and narrowing instructions, which require using register grouping. in my RVV-lite profile, I considered eliminating register groups entirely, but this would require some other way to do widening/narrowing which would not be compatible with RVV. with SEW/LMUL=32/4, a common setting, there are only 8 vector registers available. to save register file area, restricting this to just 4 vector registers seems too restrictive. instead, I think relaxing VLMAX >= 64 achieves the same effect (halving the required register file size) without requiring such a restriction. Guy |
|
If there was no cost, then supporting VLEN=64 on general apps
processor profile would be a good thing to do. But not allowing standard software to assume VLEN>=128 imposes a non-trivial impact on bigger cores, and expectation is the vast majority of apps cores will want VLEN>=128. As Zalman points out, the main advantage is removing stripmining code when it is known vectors will fit, and translating existing code is one important such use case though not the only one. Removing stripmining reduces static and dynamic code size and increases performance. While LMUL>1 allows more cases to be handled without stripmining, it also reduces available arch registers. Anyone can of course still build a compatible apps processor with VLEN=64, but this would fail to run some of the code written for VLEN>=128 case. And almost anything goes in embedded space. Krste | "...if written correctly" is precisely the point. If VLEN is specified as >=128, code that targets 128-bits explicitly byOn Thu, 3 Jun 2021 13:35:03 -0700, Zalman Stern <zalman@...> said: | setting VL to an appropriate constant for a large swath *is* correct. This allows one to do basically what NEON/SSE do today as | a baseline for performance. | Whether this is worthwhile or not may be debated, but insisting that everything should be completely vector length agnostic or | it is broken is missing the point. Ideally there would be a lot more quantitative data on this, but I'm not going to tilt at | that windmill right now. The worst case for the overhead of hardware vector length independence occurs at the smallest sizes as | well. | In general it's pretty dubious that the same set of fully lowered instruction bits can efficiently cover everything from the | bottom of the embedded space to HPC. Ideally we'd be moving to more sophisticated lowering -- e.g. dynamic and multi-stage | compilation -- rather than forcing the issue in the ISA design. | Another way to go would be to split 32-bit and 64-bit implementations such that the VLEN >= 64 for 32-bit implementations and | VLEN >= 128 for 64-bit ones. (Application code is rarely going to target 32-bit these days. Minimal embedded implementations | are probably 32-bit.) Though truth be told, code likely needs a scalar fallback anyway unless the V extension is required. | (Which it almost certainly won't be if we're talking embedded space.) As such, VLEN not being large enough for the expectations | code was compiled to is the same as not having the vector unit. | -Z- | On Thu, Jun 3, 2021 at 9:33 AM Tony Cole via lists.riscv.org <tony.cole=huawei.com@...> wrote: | Software should still work with VLEN>=64 if written correctly, as it should be VLEN agnostic. | Maybe it should be a recommendation that VLEN>=128, with a minimum of 64 for app processors? | Lower performance is an implementation cost/benefit decision. | Tony | -----Original Message----- | From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic | Sent: 03 June 2021 17:24 | To: Guy Lemieux <guy.lemieux@...> | Cc: Andrew Waterman <andrew@...>; Tariq Kurd <tariq.kurd@...>; Shaofei (B) <shaofei1@...>; | tech-vector-ext@... | Subject: Re: [RISC-V] [tech-vector-ext] Smaller embedded version of the Vector extension || On Jun 3, 2021, at 9:16 AM, Guy Lemieux <guy.lemieux@...> wrote: || || What is the advantage to RVV requiring VLEN >= 128? || || I think this should be changed to VLEN >= 64 because: || || 1) VLEN = 64 is more likely for small implementations; creating a || mandatory expectation to improve software portability | This is the requirement for app processors, which are not generally small cores. | Most competing SIMD extensions are at least 128b per vector register. || || 2) two implementations, each with VLEN >= 64, do not expose anything || new to software that is not already exposed by VLEN >= 128 || || 3) allowing VLEN =32 would expose something new to software (register || file data layout when SEW=64) || || 4) are there any disadvantages to VLEN >= 64 (versus the current VLEN ||| = 128)? (I can't see any) | Lower performance on codes that work well on other app architectures. | Krste || || Guy || || || On Wed, Jun 2, 2021 at 11:13 AM <krste@...> wrote: ||| ||| ||| The VLEN>=128 constraint is only for the application processor "V" ||| extension for the app profile - not for embedded vectors which can ||| have VLEN=32. ||| ||| From spec Introduction: ||| ' ||| The term base vector extension is used informally to describe the standard set of vector ISA components that will be | required for the single-letter "V" extension, which is intended for use in standard server and application-processor | platform profiles. The set of mandatory instructions and supported element widths will vary with the base ISA (RV32I, | RV64I) as described below. ||| ||| Other profiles, including embedded profiles, may choose to mandate only subsets of these extensions. The exact set of | mandatory supported instructions for an implementation to be compliant with a given profile will only be determined when | each profile spec is ratified. For convenience in defining subset profiles, vector instruction subsets are given ISA string | names beginning with the "Zv" prefix. ||| ' ||| ||| There are a set Zve* names for the embedded subsets (see github issue ||| #550). ||| ||| A minimal embedded implementaton using RV32E+Zfinx+vectors would be ||| same state size as ARM MVE. ||| ||| P extension does not have floating-point, but for short ||| integer/fixed-point SIMD makes sense as alternative. ||| ||| The software fragmentation issue is that some library routines that ||| expose VLEN might not be portable between app cores and embedded ||| cores, but these are different software ecosystems (e.g. ABI/calling ||| convention might be different) and only a few kinds of routine rely ||| on VLEN. ||| ||| For app cores that can afford VLEN>=128, the advantage is the removal ||| of stripmining code in cases that operate on fixed-size vectors. ||| ||| Krste ||| ||| ||| |||||||| On Wed, 2 Jun 2021 05:10:32 -0700, "Guy Lemieux" <guy.lemieux@...> said: ||| ||| | Allowing VLEN<128 would allow for smaller vector register files, ||| | bit it would also result in a profile that is not ||| | forward-compatible with the V spec. This would produce another fracture the software ecosystem. ||| ||| | To avoid such a fracture, there are two choices: ||| | (1) go with P instead ||| | (2) relax the V spec to allow smaller implementations ||| ||| | So the key question for this group is whether to relax the minimum ||| | VLEN to 32 or 64? ||| ||| | note: a possible justification for keeping 128 might be to ||| | recommend (1) instead. I don’t know anything about P, but it seems ||| | like it could be speced in a way that is competitive/comparable with Helium. ||| ||| | Guy ||| ||| | PS — I have started to design an “RVV-lite” profile which would be ||| | more amenable to embedded implementations. However, I have adopted ||| | a stance that it must remain forward compatible with the full V ||| | spec, so I have not considered VLEN below 128. I am happy to share ||| | my work on this and involve other contributors — email me if you would like to see a copy. ||| ||| | On Wed, Jun 2, 2021 at 3:15 AM Andrew Waterman <andrew@...> wrote: ||| ||| | The uppercase-V V extension is meant to cater to apps processors, where ||| | the VLEN >= 128 constraint is not inappropriate and is sometimes ||| | beneficial. But there's nothing fundamental about the ISA design that ||| | prohibits VLEN < 128. A minimal configuration is VLEN=ELEN=32, giving the ||| | same total amount of state as MVE. (And if you set LMUL=4, then you even ||| | get the same shape: 8 registers of 128 bits apiece.) ||| ||| | Such a thing wouldn't be called V, but perhaps something like Zvmin. ||| | Other than agreeing on a feature set and assigning it a name, the ||| | architecting is already done. ||| ||| | (If you search the spec for Zfinx, you'll see that a Zfinx variant is ||| | planned, but only barely sketched out.) ||| ||| | On Wed, Jun 2, 2021 at 3:04 AM Tariq Kurd via lists.riscv.org <tariq.kurd= ||| | huawei.com@...> wrote: ||| ||| | Hi everyone, ||| ||| | ||| ||| | Are there any plans for a cut-down configuration of the vector ||| | extension suitable for embedded cores? It seems that the 32x128-bit ||| | register file is suitable for application class cores but it very ||| | large for embedded cores, especially if ||| ||| | the F registers also need to be implemented (which I think is the ||| | case, unless a Zfinx version is specified). ||| ||| | ||| ||| | ARM MVE only has 8x128-bit registers for FP and Vector, so it much ||| | more suitable for embedded applications. ||| ||| | https://en.wikichip.org/wiki/arm/helium ||| ||| | ||| ||| | What’s the approach here? Should embedded applications implement the ||| | P-extension instead? ||| ||| | ||| ||| | Tariq ||| ||| | ||| ||| | Tariq Kurd ||| ||| | Processor Design ||| ||| | I RISC-V Cores, Bristol ||| ||| | E-mail: ||| ||| | Tariq.Kurd@... ||| ||| | Company: ||| ||| | Huawei technologies R&D (UK) Ltd ||| ||| | I Address: 290 ||| ||| | Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 ||| | 4TR, UK ||| ||| | ||| ||| | 315px-Huawei ||| ||| | http://www.huawei.com ||| ||| | cid:image002.jpg@... ||| ||| | This e-mail and its attachments contain confidential information from ||| | HUAWEI, which ||| ||| | is intended only for the person or entity whose address is listed ||| | above. Any use of the information contained herein in any way ||| | (including, but not limited to, total or partial ||| | disclosure,reproduction, or dissemination) by persons other than the ||| | intended recipient(s) ||| ||| | is prohibited. If you receive this e-mail in error, please notify the ||| | sender by phone or email immediately and delete it ! ||| ||| | 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人 ||| | 或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复 ||| | 制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知 ||| | 发件人并删除本邮件! ||| ||| | ||| ||| | ||| | x[DELETED ATTACHMENT image001.png, PNG image] x[DELETED ATTACHMENT ||| | image002.jpg, JPEG image] | |
|
On Fri, Jun 4, 2021 at 8:09 AM Zalman Stern via lists.riscv.org <zalman=google.com@...> wrote:
This is an excellent point, but there are only 8 SSE/AVX/AVX2 registers in 32 bit mode and 16 in 64 bit. Therefore a 32 bit RISC-V could use 32 bit VLEN and LMUL=4 to directly translate SSE code without stripmining, and a 64 bit RISC-V could use 64 bit VLEN and LMUL=2. For AVX/AVX2 VLEN=64 is required on 32 bit and VLEN=128 on 64 bit, using the same LMUL. Similarly, 32 bit ARM NEON works as sixteen 128 bit registers or thirty two 64 bit registers. Thus a 32 bit RISC-V with VLEN=64 can directly translate NEON code using LMUL=1 or LMUL=2. Aarch64 has thirty two registers of 128 bits each, which can also be treated as thirty two registers of 64 bits each (effectively just setting a smaller VL, the upper half is zeroed). So directly porting 64 bit ARM Advanced SIMD code does require 128 bit registers. For maximum SIMD-porting compatibility with both ARM and x86 code a 64 bit RISC-V needs VLEN=128 but a 32 bit RISC-V is fine with VLEN=64. |
|
Hi Everyone, wanted to continue this interesting discussion. Noticed that the ZVE* Extensions are listed now on the vspec.adoc: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#181-zve-vector-extensions-for-embedded-processors Was wondering if this is a complete listing of the requirements (so far) for the ZVE* extensions? or if there might be another document/spreadsheet/source-file which would have a running-list of requirements? In particular, hoping to check if there might be a running-list of instructions required by the ZVE* extensions (e.g. if we would need to implement vector integer division) and the range of LMUL levels we would be required to support? Looking forward to continuing the discussion. All the best, Gregory |
|
Guy Lemieux
I’ve taken a stab at reducing the number of instructions in my RVV-lite proposal. The overriding goal, in my mind, is to preserve forward software compatibility so the ecosystem doesn’t need to fragment. There are lots of instructions that are not essential which I have eliminated. Also, I have dropped or limited the scope of the widening and narrowing instructions — they are awkward to implement because they change the demand in register file read or write bandwidth due to a mixing of data element sizes. Limiting LMUL is far more difficult, because it is fundamental to the way RVV changes data widths. The best I could do in my proposal is require SEW/LMUL to always be 8. I’m happy to share my proposal on request, but I’ve not broadcast it here because it still needs more work. I’d welcome any thoughts on improving it though. Guy On Sun, Jun 27, 2021 at 11:54 PM Gregory Kielian <gkielian@...> wrote:
|
|