Re: Vector Memory Ordering

swallach
i have not been following this thread in lots of detail
could someone please explain why we need to differentiate between ordered and unordered load/stores.
in the 6 or vector systems i have been involved with, vector references bypassed the cache, main memory was highly interleaved.
compilers could not care less. one of the major performance optimizations, was to eliminate power of 2 strides. (generally manually done)
at convey we even had a option to deploy a prime number of interleave memory system. (the bsp first did this)
i saw one application, a major cfd code, that sorted references to a stencil based reference pattern. this was done to optimize performance for cache based vector systems.
with the presence of HBM memory systems, and some cleaver memory controller design (that could be done with vector references information), i am pretty sure ordered and unordered loads/stores will have the same implementation. to be more specific, a memory design for GUPS, would have this type of implementation
i look forward to a better understanding
Guy Lemieux commented:
I think 90%+ of implementations will choose to do ordered loads and stores even though unordered is permitted.
This means programmers will expect them to be ordered, and such software will not work properly on the remaining implementations. This compatibility problem is a concern.
I think the best way to combat this is to have 2 sets of instructions: ordered and unordered. The unordered implementation can simply do the ordered thing in simple implementations.
Ordered stores to a FIFO is a paradigm I was hoping to use for inter-processor communication.
I think compiler considerations are also important, but I don’t know the implications here.
Guy
Maybe what's below could be improved by saying that if the base address (in src1) was non-idempotent or an "ordered channel," the entire instruction would run in order. If not, it would not. We could allow stride of zero but not other overlapping strides for stores. Having a later access come into a non-idempotent or ordered region would raise an exception. That would provide for loads from and stores to a FIFO to work. But it wouldn't provide for an instruction to "fall into" such a memory segment part way through. Bill On 9/4/20 10:18 AM, Bill Huffman wrote:
I think from this morning, we are considering: - Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
- Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are: - A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount of auto-vectorization that can be done?
Exception reporting still has issues: - Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than for unit stride or even strided accesses.
Bill
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|
Re: Vector Memory Ordering
Guy Lemieux commented:
I think 90%+ of implementations will choose to do ordered loads and stores even though unordered is permitted.
This means programmers will expect them to be ordered, and such software will not work properly on the remaining implementations. This compatibility problem is a concern.
I think the best way to combat this is to have 2 sets of instructions: ordered and unordered. The unordered implementation can simply do the ordered thing in simple implementations.
Ordered stores to a FIFO is a paradigm I was hoping to use for inter-processor communication.
I think compiler considerations are also important, but I don’t know the implications here.
Guy
Maybe what's below could be improved by saying that if the base address (in src1) was non-idempotent or an "ordered channel," the entire instruction would run in order. If not, it would not. We could allow stride of zero
but not other overlapping strides for stores. Having a later access come into a non-idempotent or ordered region would raise an exception. That would provide for loads from and stores to a FIFO to work. But it wouldn't provide for an instruction to "fall
into" such a memory segment part way through.
Bill
On 9/4/20 10:18 AM, Bill Huffman wrote:
I think from this morning, we are considering:
- Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
- Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are:
- A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount
of auto-vectorization that can be done?
Exception reporting still has issues:
- Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than
for unit stride or even strided accesses.
Bill
|
|
I think from this morning, we are considering:
- Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
- Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are:
- A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount
of auto-vectorization that can be done?
Exception reporting still has issues:
- Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than
for unit stride or even strided accesses.
Bill
|
|
Usual vector TG meeting today

Krste Asanovic
Though I don’t know if we’re affected by calendar changes,
Krste
|
|
Re: Signed v Unsigned Immediate: vsaddu.vi
Andrew, Nick,
Thank you for the quick responses. Nick, the text updates look like they directly reflect the intent.
-Cohen
|
|
Re: Decompress Instruction
Thanks Krste, that makes sense but the logic is not that straight forward, people usually needs "decompress" when they are using "compress", maybe we can add some comment on this at the "vcompress" section?
|
|

Krste Asanovic
If the decompress is the inverse of compress, then there will be a packed vector holding the non-zero elements and a bit mask indicating which elements should receive the elements after unpacking 7 6 5 4 3 2 1 0 # vid e d c b a # packed vector of 5 elements 1 0 0 1 1 1 0 1 # mask vector of 8 elements e 0 0 d c b 0 a # result of decompress This can be synthesized by using iota and masked vrgather 1 0 0 1 1 1 0 1 # mask vector 4 4 4 3 2 1 1 0 # viota.m 0 0 0 0 0 0 0 0 # zero result register e 0 0 d c b 0 a # vrgather using viota.m under mask code is # v0 holds mask # v1 holds packed data # v11 holds decompressed data viota.m v10, v0 # Calc iota from mask in v0 vmv.v.i v11, 0 # Clear destination vrgather.vv v11, v1, v10, v0.t # Expand into destination So decompress is quite fast already. The reason there is a compress instruction is that it cannot be synthesized from other instructions in the same way. You could provide a "compress bit mask into packed indices" instruction, then do an vrgather, but that is not much simpler than just doing the compress. Krste On Thu, 03 Sep 2020 00:12:51 -0700, "lidawei14 via lists.riscv.org" <lidawei14=huawei.com@...> said:
| Hi all, | For common AI workloads such as DNNs, data communications between network layers introduce huge pressure | on capacity and bandwidth of the memory hierarchy. | For instance, dynamic large activation or feature map data needs to be buffered and communicated across | multiple layers, which often appears to be sparse (e.g. ReLU). | People use bit vectors to "compress" the data buffered and "decompress" for the following layer | computations. | Here we can see from the spec that "vcompress" has already been included, how about "vdecompress"? | Thanks, | Dawei |
|
|
Hi all, For common AI workloads such as DNNs, data communications between network layers introduce huge pressure on capacity and bandwidth of the memory hierarchy.
For instance, dynamic large activation or feature map data needs to be buffered and communicated across multiple layers, which often appears to be sparse (e.g. ReLU). People use bit vectors to "compress" the data buffered and "decompress" for the following layer computations.
Here we can see from the spec that "vcompress" has already been included, how about "vdecompress"?
Thanks, Dawei
|
|
Re: EEW and non-indexed loads/stores

Krste Asanovic
Correct, Krste
toggle quoted messageShow quoted text
On Sep 2, 2020, at 11:10 PM, Roger Ferrer Ibanez <roger.ferrer@...> wrote:
Hi all,
I understand the EEW, as explicitly encoded in the load/store instructions applies to the vector of indices for the indexed loads and stores. For instance we can load a vector "SEW=8,LMUL=1" using a vector of indices of "SEW=64,LMUL=8" by making sure vtype has "SEW=8,LMUL=1" and using v{l,s}xei64.
I'd like to confirm I'm understanding correctly the EEW for unit-stride and strided loads and stores.
Say that vtype is such that SEW=16,LMUL=1 and we execute a v{l,s}{,s}e32.v. Now the EEW of the data and address operands is EEW=32 (as encoded in the instruction) so EMUL=(EEW/SEW)*LMUL=(32/16)*1=2. So in this case we're loading/storing a vector SEW=32,LMUL=2.
Is my interpretation correct?
If it is, I assume this is useful in sequences such as the following one
# SEW=16,LMUL=1 vle16.v v1, (t0) # Load a vector of sew=16,lmul=1 vle32.v v2, (t1) # Load a vector of sew=32,lmul=2, cool, no need to change vtype vwadd.wv v4, v2, v1 # v4_v5(32)[:] ← v2_v3(32)[:] + sign-extend(v1(16)[:]) vse32.v v4, (t1) # Store a vector of sew=32,lmul=2, no need to change vtype either
Thank you,
-- Roger Ferrer Ibáñez - roger.ferrer@... Barcelona Supercomputing Center - Centro Nacional de Supercomputación
http://bsc.es/disclaimer
|
|
EEW and non-indexed loads/stores

Roger Ferrer Ibanez
Hi all, I understand the EEW, as explicitly encoded in the load/store instructions applies to the vector of indices for the indexed loads and stores. For instance we can load a vector "SEW=8,LMUL=1" using a vector of indices of "SEW=64,LMUL=8" by making sure vtype has "SEW=8,LMUL=1" and using v{l,s}xei64. I'd like to confirm I'm understanding correctly the EEW for unit-stride and strided loads and stores. Say that vtype is such that SEW=16,LMUL=1 and we execute a v{l,s}{,s}e32.v. Now the EEW of the data and address operands is EEW=32 (as encoded in the instruction) so EMUL=(EEW/SEW)*LMUL=(32/16)*1=2. So in this case we're loading/storing a vector SEW=32,LMUL=2. Is my interpretation correct? If it is, I assume this is useful in sequences such as the following one # SEW=16,LMUL=1 vle16.v v1, (t0) # Load a vector of sew=16,lmul=1 vle32.v v2, (t1) # Load a vector of sew=32,lmul=2, cool, no need to change vtype vwadd.wv v4, v2, v1 # v4_v5(32)[:] ← v2_v3(32)[:] + sign-extend(v1(16)[:]) vse32.v v4, (t1) # Store a vector of sew=32,lmul=2, no need to change vtype either Thank you, -- Roger Ferrer Ibáñez - roger.ferrer@... Barcelona Supercomputing Center - Centro Nacional de Supercomputación http://bsc.es/disclaimer
|
|
Re: Signed v Unsigned Immediate: vsaddu.vi

Nick Knight
Hi Cohen,
Thanks for your careful reading.
Best, Nick Knight
toggle quoted messageShow quoted text
On Wed, Sep 2, 2020 at 2:44 PM Andrew Waterman < andrew@...> wrote: The non-normative text you quoted should be edited to delete the words “it is signed”.
The immediate is sign-extended, but then is treated as an unsigned value. So the operation doesn’t differ based on the argument type.
(This sign-extended-but-unsigned-immediate pattern is also exists for e.g. sltiu in the base ISA and vmsgtu.vi in the vector extension.)
From chapter 11, section 1 (#3):
The 5-bit immediate is unsigned when either providing a register index in vrgather or a count for shift, clip, or slide. In all other cases
it is signed and sign extended to SEW bits, even for bitwise and unsigned instructions, notably compare and add.
From chapter 13, section 1: Saturating forms of integer add and subtract are provided, for both signed and unsigned integers. If the result would overflow the destination, the result is replaced with the closest representable value, and the vxsat bit is set.
This results in a conundrum: operation SEW RS1 RS2 vsaddu.vv 8 0x0ff 0x01 vsaddu.vi 8 0x01f 0x01
These two operations now provide a difference of result. Taking the maximum unsigned integer value, adding one, causes saturation. The result value for the vector-vector operation would be 0xff and the VXSAT bit would be set. This shouldn't be a surprise. However, the immediate form is more difficult. The immediate value is sign-extended to SEW size and treated as a signed value. This means the arithmetic is now (-1) + 1 = 0. This does not create a saturation (a value outside expected return parameters). The result value from the vector-immediate operation would be 0x1f and the VXSAT bit would be clear.
This is from the specification, as written, in a strict sense.
From a use-case sense, what is trying to be accomplished, here? Two counter perspectives: 1 - from a use-case perspective, why would a programmer or compiler specifically pick an unsigned operation, only to operate on values using a signed-immediate in a signed format? I'm curious that this case is. 2 - from an architecture/implementation perspective, this is the first time that an engine will have to operate on an instruction differently based on the *source* of the operand. That is, more narrowly, the arithmetic engines are given an operation encoding (usually an "onto" mapping from the opcode space) and operands, but does not care where the operations came from. In other words, the vector engine itself would receive a full bit set in RS1 for both cases, above, for a saturating unsigned (sorta) add. However, the outcome is required to be different?
I would imagine others have run into this situation, and I'd like to know both the intent of having a signed-immediate value for this unsigned operation, as well as the applicability of section 11.1 to this instruction.
|
|
Re: Signed v Unsigned Immediate: vsaddu.vi
The non-normative text you quoted should be edited to delete the words “it is signed”.
The immediate is sign-extended, but then is treated as an unsigned value. So the operation doesn’t differ based on the argument type.
(This sign-extended-but-unsigned-immediate pattern is also exists for e.g. sltiu in the base ISA and vmsgtu.vi in the vector extension.)
toggle quoted messageShow quoted text
From chapter 11, section 1 (#3):
The 5-bit immediate is unsigned when either providing a register index in vrgather or a count for shift, clip, or slide. In all other cases
it is signed and sign extended to SEW bits, even for bitwise and unsigned instructions, notably compare and add.
From chapter 13, section 1: Saturating forms of integer add and subtract are provided, for both signed and unsigned integers. If the result would overflow the destination, the result is replaced with the closest representable value, and the vxsat bit is set.
This results in a conundrum: operation SEW RS1 RS2 vsaddu.vv 8 0x0ff 0x01 vsaddu.vi 8 0x01f 0x01
These two operations now provide a difference of result. Taking the maximum unsigned integer value, adding one, causes saturation. The result value for the vector-vector operation would be 0xff and the VXSAT bit would be set. This shouldn't be a surprise. However, the immediate form is more difficult. The immediate value is sign-extended to SEW size and treated as a signed value. This means the arithmetic is now (-1) + 1 = 0. This does not create a saturation (a value outside expected return parameters). The result value from the vector-immediate operation would be 0x1f and the VXSAT bit would be clear.
This is from the specification, as written, in a strict sense.
From a use-case sense, what is trying to be accomplished, here? Two counter perspectives: 1 - from a use-case perspective, why would a programmer or compiler specifically pick an unsigned operation, only to operate on values using a signed-immediate in a signed format? I'm curious that this case is. 2 - from an architecture/implementation perspective, this is the first time that an engine will have to operate on an instruction differently based on the *source* of the operand. That is, more narrowly, the arithmetic engines are given an operation encoding (usually an "onto" mapping from the opcode space) and operands, but does not care where the operations came from. In other words, the vector engine itself would receive a full bit set in RS1 for both cases, above, for a saturating unsigned (sorta) add. However, the outcome is required to be different?
I would imagine others have run into this situation, and I'd like to know both the intent of having a signed-immediate value for this unsigned operation, as well as the applicability of section 11.1 to this instruction.
|
|
Signed v Unsigned Immediate: vsaddu.vi
From chapter 11, section 1 (#3):
The 5-bit immediate is unsigned when either providing a register index in vrgather or a count for shift, clip, or slide. In all other cases
it is signed and sign extended to SEW bits, even for bitwise and unsigned instructions, notably compare and add.
From chapter 13, section 1: Saturating forms of integer add and subtract are provided, for both signed and unsigned integers. If the result would overflow the destination, the result is replaced with the closest representable value, and the vxsat bit is set.
This results in a conundrum: operation SEW RS1 RS2 vsaddu.vv 8 0x0ff 0x01 vsaddu.vi 8 0x01f 0x01
These two operations now provide a difference of result. Taking the maximum unsigned integer value, adding one, causes saturation. The result value for the vector-vector operation would be 0xff and the VXSAT bit would be set. This shouldn't be a surprise. However, the immediate form is more difficult. The immediate value is sign-extended to SEW size and treated as a signed value. This means the arithmetic is now (-1) + 1 = 0. This does not create a saturation (a value outside expected return parameters). The result value from the vector-immediate operation would be 0x1f and the VXSAT bit would be clear.
This is from the specification, as written, in a strict sense.
From a use-case sense, what is trying to be accomplished, here? Two counter perspectives: 1 - from a use-case perspective, why would a programmer or compiler specifically pick an unsigned operation, only to operate on values using a signed-immediate in a signed format? I'm curious that this case is. 2 - from an architecture/implementation perspective, this is the first time that an engine will have to operate on an instruction differently based on the *source* of the operand. That is, more narrowly, the arithmetic engines are given an operation encoding (usually an "onto" mapping from the opcode space) and operands, but does not care where the operations came from. In other words, the vector engine itself would receive a full bit set in RS1 for both cases, above, for a saturating unsigned (sorta) add. However, the outcome is required to be different?
I would imagine others have run into this situation, and I'd like to know both the intent of having a signed-immediate value for this unsigned operation, as well as the applicability of section 11.1 to this instruction.
|
|
Cancelling Vector TG meeting today

Krste Asanovic
Sorry for late notice, but I have to cancel the vector tech meeting today,
Krste
|
|
Re: GNU toolchain with RVV intrinsic support
Thank you for the clarification. Excellent.
toggle quoted messageShow quoted text
On Mon, Aug 24, 2020, 17:35 Bruce Hoult, < bruce@...> wrote: On Tue, Aug 25, 2020 at 5:34 AM David Horner <ds2horner@...> wrote:
Thank you very much for this advancement.
I have two concerns, in the body is a response.
.
On 2020-08-21 9:34 a.m., Kito Cheng
wrote:
I am pleased to announce that our/SiFive's RVV
intrinsic enabled GCC are open-sourced now.
We put the sources on riscv's github, and the RVV intrinsics
have been integrated in the riscv-gnu-toolchain, so you can
build the RVV intrinsic enabled GNU toolchain as usual.
$ git clone git@...:riscv/riscv-gnu-toolchain.git -b
rvv-intrinsic
$ <path-to-riscv-gnu-toolchain>/configure
--with-arch=rv64gcv_zfh --prefix=<INSTALL-PATH>
$ make newlib build-qemu
$ cat rvv_vadd.c
>
> #include <riscv_vector.h>
> #include <stdio.h>
>
> void vec_add_rvv
Shouldn't this be vec_add32_rvv ? It is not a generalized vector
add.
The user can call functions anything they want. The example might be better if this was clear by calling it foo() or demo_vector_add() or something.
(int *a, int *b, int *c, size_t n) {
> size_t vl;
> vint32m2_t va, vb, vc;
> for (;vl = vsetvl_e32m2 (n);n -= vl) {
> vb = vle32_v_i32m2 (b);
> vc = vle32_v_i32m2 (c);
> va = vadd_vv_i32m2 (vb, vc);
> vse32_v_i32m2 (a, va);
> a += vl;
The vector pointer should be advanced by vl * 32.
The variable "a" in an "int *" pointer. When you add an integer to it C automatically scales the integer (vl) by sizeof(int).
|
|
Re: GNU toolchain with RVV intrinsic support

Bruce Hoult
On Tue, Aug 25, 2020 at 5:34 AM David Horner <ds2horner@...> wrote:
Thank you very much for this advancement.
I have two concerns, in the body is a response.
.
On 2020-08-21 9:34 a.m., Kito Cheng
wrote:
I am pleased to announce that our/SiFive's RVV
intrinsic enabled GCC are open-sourced now.
We put the sources on riscv's github, and the RVV intrinsics
have been integrated in the riscv-gnu-toolchain, so you can
build the RVV intrinsic enabled GNU toolchain as usual.
$ git clone git@...:riscv/riscv-gnu-toolchain.git -b
rvv-intrinsic
$ <path-to-riscv-gnu-toolchain>/configure
--with-arch=rv64gcv_zfh --prefix=<INSTALL-PATH>
$ make newlib build-qemu
$ cat rvv_vadd.c
>
> #include <riscv_vector.h>
> #include <stdio.h>
>
> void vec_add_rvv
Shouldn't this be vec_add32_rvv ? It is not a generalized vector
add.
The user can call functions anything they want. The example might be better if this was clear by calling it foo() or demo_vector_add() or something.
(int *a, int *b, int *c, size_t n) {
> size_t vl;
> vint32m2_t va, vb, vc;
> for (;vl = vsetvl_e32m2 (n);n -= vl) {
> vb = vle32_v_i32m2 (b);
> vc = vle32_v_i32m2 (c);
> va = vadd_vv_i32m2 (vb, vc);
> vse32_v_i32m2 (a, va);
> a += vl;
The vector pointer should be advanced by vl * 32.
The variable "a" in an "int *" pointer. When you add an integer to it C automatically scales the integer (vl) by sizeof(int).
|
|
Re: GNU toolchain with RVV intrinsic support
Thank you very much for this advancement.
I have two concerns, in the body is a response.
.
On 2020-08-21 9:34 a.m., Kito Cheng
wrote:
I am pleased to announce that our/SiFive's RVV
intrinsic enabled GCC are open-sourced now.
We put the sources on riscv's github, and the RVV intrinsics
have been integrated in the riscv-gnu-toolchain, so you can
build the RVV intrinsic enabled GNU toolchain as usual.
$ git clone git@...:riscv/riscv-gnu-toolchain.git -b
rvv-intrinsic
$ <path-to-riscv-gnu-toolchain>/configure
--with-arch=rv64gcv_zfh --prefix=<INSTALL-PATH>
$ make newlib build-qemu
$ cat rvv_vadd.c
>
> #include <riscv_vector.h>
> #include <stdio.h>
>
> void vec_add_rvv
Shouldn't this be vec_add32_rvv ? It is not a generalized vector
add.
(int *a, int *b, int *c, size_t n) {
> size_t vl;
> vint32m2_t va, vb, vc;
> for (;vl = vsetvl_e32m2 (n);n -= vl) {
> vb = vle32_v_i32m2 (b);
> vc = vle32_v_i32m2 (c);
> va = vadd_vv_i32m2 (vb, vc);
> vse32_v_i32m2 (a, va);
> a += vl;
The vector pointer should be advanced by vl * 32.
(I originally thought the vl = vsetvl may have done the by 32
scaling and that n was in bytes,
but I have now convinced myself that the problem is likely the
pointer advance,
and the VLEN is at least 256 so only one pass of the loop for the
below test case.)
> b += vl;
> c += vl;
> }
> }
>
> int x[10] = {1,2,3,4,5,6,7,8,9,0};
> int y[10] = {0,9,8,7,6,5,4,3,2,1};
> int z[10];
>
> int main()
> {
> int i;
> vec_add_rvv(z, x, y, 10);
> for (i=0; i<10; i++)
> printf ("%d ", z[i]);
> printf("\n");
> return 0;
> }
$ riscv64-unknown-elf-gcc rvv_vadd.c -O2
$ qemu-riscv64 -cpu
rv64,x-v=true,vlen=256,elen=64,vext_spec=v1.0 a.out
It is verified with our internal testsuite and several internal
projects, however this project is still a work in progress, and
we intend to improve the work continually. Feedback and bug
reports are welcome, as well as contributions and pull-requests.
Current status:
- Implement ~95% RVV intrinsic function listed in the intrinsic
spec ( https://github.com/riscv/rvv-intrinsic-doc)
- FP16 supported for both vector and scalar.
- fp16 uses __fp16 temporally, this might change in future.
- Fractional LMUL is not implemented yet.
- RV32 is not well supported for scalar-vector operations with
SEW=64.
- Function call with vector type is not well supported yet,
arguments will be passed/returned in memory in current
implementation.
- *NO* auto vectorization support.
|
|
Re: V extension groups analogue to the standard groups

mark
Just a reminder that we will differentiate between branding (i.e. what we trademark and what members can advertise) and internal use (like uname in linux vs. splash screen, etc.).
the proposed policy is under review in the policies/proposed folder
toggle quoted messageShow quoted text
On Sun, Aug 23, 2020 at 3:26 PM Simon Davidmann Imperas < simond@...> wrote: thanks - I am OK with whichever you choose.
On Sat, Aug 22, 2020 at 12:30 AM Andrew Waterman < andrew@...> wrote:
On Fri, Aug 21, 2020 at 2:43 PM Simon Davidmann < simond@...> wrote: A question to clarify. You state: RV32IV doesn’t mandate any FP hardware in the vector unit, whereas RV32IFV means both scalar and vector support single-precision, etc.
This means if I understand you that we need to add F to get F hardware in the vector unit - so RV32IV means V with no F hardware, and RV32IFV includes F hardware.
So for consistency...
What does RV32IV means for M hardware multiply - do I need to RV32IMV to get scalar and vector hardware multiply?
I don’t believe the spec explicitly addresses this question, but I agree it makes sense. Alternatively, V could require M, since it doesn’t make much sense to pay for a vector unit but be too stingy to pay for a multiplier. But that might be less consistent. (My recommendation is that RV32IV continue to mean “no multiplier”, even though it’s a silly configuration.)
RV32IV means no F and no M hardware? - so I need to explicitly include the extensions I need as V assumes nothing but I?
My recommendation is to clarify in the spec that RV32IV is a valid config with no FPU in the vector unit, and RV32IFV is also a valid config with an FPU in both scalar and vector.
Or is something assumed for M?
If we choose to define that V implies M, RV32IV and RV32IMV would be synonyms.
thanks
On Thu, Aug 20, 2020 at 8:48 PM Andrew Waterman < andrew@...> wrote: Quad-widening ops have been moved to a separate extension, Zvqmac.
I believe the intent is that the capital-V V extension supports the same FP datatypes as the scalar ISA, so e.g., RV32IV doesn’t mandate any FP hardware in the vector unit, whereas RV32IFV means both scalar and vector support single-precision, etc.
I’m surprised all those hashtags made it past the spam filter! On Thu, Aug 20, 2020 at 11:42 AM Strauch, Tobias (HENSOLDT Cyber GmbH) < tobias.strauch@...> wrote:
Apologies if this is old stuff already dismissed. But I give it a try anyway.
Wouldn't it make sense to separate more complex vector instructions from more trivial ones? Already with the very first base release ? Vector instructions can also be helpful in small devices #IOT #Edge #GAP8
#RISCY without the need to fully support floating point instructions or without the need for a quad multiply.
The suggestion would be to basically group vector extensions analogue to the standard instructions (I, M, F, D, Q, …), instead of having an already complex base and then subtract or re-define subsets of instructions
again ?
Wouldn't that be in-line with the RISC-V philosophy of modularity and simplicity ? The beauty would be that you have a non-vector and a vector group version.
Possible nomenclature based on order:
M: Standard Multiply Divide Instructions (MUL, ...)
V: Very Basic Vector Instructions (VSETVL, ...)
MV: Standard Multiply Divide Instructions and Very Basic Vector Instructions (MUL, VSETVL, ...
VM: Standard Multiply Divide Instructions, Very Basic Vector Instructions and Vector Integer Multiply\Divide Instructions (MUL, VSETVL, VMUL, ...)
F, D, Q analogue to M as suggested.
The V version will not be a 1:1 match with the standard version and will cover additional aspects. But it can be argued, that when you implement the V version (of M, F, D, Q, ...), then you most likely will have
the relevant standard counterparts implemented as well anyway.
Kind Regards, Tobias
-- ====================================================================
The information contained in this electronic mail message and any attachments hereto is privileged and confidential information intended only for the use of the individual or entity named above or their designee. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error please immediately notify us by return message or by telephone and delete the original message from your mail system. Thank you. ====================================================================
--
====================================================================
The information contained in this electronic mail message and any attachments hereto is privileged and confidential information intended only for the use of the individual or entity named above or their designee. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error please immediately notify us by return message or by telephone and delete the original message from your mail system. Thank you. ====================================================================
|
|
Re: V extension groups analogue to the standard groups

Simon Davidmann Imperas
thanks - I am OK with whichever you choose.
toggle quoted messageShow quoted text
On Sat, Aug 22, 2020 at 12:30 AM Andrew Waterman < andrew@...> wrote:
On Fri, Aug 21, 2020 at 2:43 PM Simon Davidmann < simond@...> wrote: A question to clarify. You state: RV32IV doesn’t mandate any FP hardware in the vector unit, whereas RV32IFV means both scalar and vector support single-precision, etc.
This means if I understand you that we need to add F to get F hardware in the vector unit - so RV32IV means V with no F hardware, and RV32IFV includes F hardware.
So for consistency...
What does RV32IV means for M hardware multiply - do I need to RV32IMV to get scalar and vector hardware multiply?
I don’t believe the spec explicitly addresses this question, but I agree it makes sense. Alternatively, V could require M, since it doesn’t make much sense to pay for a vector unit but be too stingy to pay for a multiplier. But that might be less consistent. (My recommendation is that RV32IV continue to mean “no multiplier”, even though it’s a silly configuration.)
RV32IV means no F and no M hardware? - so I need to explicitly include the extensions I need as V assumes nothing but I?
My recommendation is to clarify in the spec that RV32IV is a valid config with no FPU in the vector unit, and RV32IFV is also a valid config with an FPU in both scalar and vector.
Or is something assumed for M?
If we choose to define that V implies M, RV32IV and RV32IMV would be synonyms.
thanks
On Thu, Aug 20, 2020 at 8:48 PM Andrew Waterman < andrew@...> wrote: Quad-widening ops have been moved to a separate extension, Zvqmac.
I believe the intent is that the capital-V V extension supports the same FP datatypes as the scalar ISA, so e.g., RV32IV doesn’t mandate any FP hardware in the vector unit, whereas RV32IFV means both scalar and vector support single-precision, etc.
I’m surprised all those hashtags made it past the spam filter! On Thu, Aug 20, 2020 at 11:42 AM Strauch, Tobias (HENSOLDT Cyber GmbH) < tobias.strauch@...> wrote:
Apologies if this is old stuff already dismissed. But I give it a try anyway.
Wouldn't it make sense to separate more complex vector instructions from more trivial ones? Already with the very first base release ? Vector instructions can also be helpful in small devices #IOT #Edge #GAP8
#RISCY without the need to fully support floating point instructions or without the need for a quad multiply.
The suggestion would be to basically group vector extensions analogue to the standard instructions (I, M, F, D, Q, …), instead of having an already complex base and then subtract or re-define subsets of instructions
again ?
Wouldn't that be in-line with the RISC-V philosophy of modularity and simplicity ? The beauty would be that you have a non-vector and a vector group version.
Possible nomenclature based on order:
M: Standard Multiply Divide Instructions (MUL, ...)
V: Very Basic Vector Instructions (VSETVL, ...)
MV: Standard Multiply Divide Instructions and Very Basic Vector Instructions (MUL, VSETVL, ...
VM: Standard Multiply Divide Instructions, Very Basic Vector Instructions and Vector Integer Multiply\Divide Instructions (MUL, VSETVL, VMUL, ...)
F, D, Q analogue to M as suggested.
The V version will not be a 1:1 match with the standard version and will cover additional aspects. But it can be argued, that when you implement the V version (of M, F, D, Q, ...), then you most likely will have
the relevant standard counterparts implemented as well anyway.
Kind Regards, Tobias
-- ====================================================================
The information contained in this electronic mail message and any attachments hereto is privileged and confidential information intended only for the use of the individual or entity named above or their designee. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error please immediately notify us by return message or by telephone and delete the original message from your mail system. Thank you. ====================================================================
-- ====================================================================
The information contained in this electronic mail message and any attachments hereto is privileged and confidential information intended only for the use of the individual or entity named above or their designee. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error please immediately notify us by return message or by telephone and delete the original message from your mail system. Thank you. ====================================================================
|
|
Re: V extension groups analogue to the standard groups

Krste Asanovic
Anybody is free to use any subset of supported instructions and element widths/types. The Z names can be extended down to individual instructions/width if necessary. However, we have to guide the software ecosystem where to spend the available finite effort. So we choose and name some common combinations to inform software/tool providers what to support, and to enable compliance testing of those combinations. We can always add new Z names later for subsets that prove popular. This can happen after the instruction spec itself is ratified, in a much lighter-weight process. Krste On Sat, 22 Aug 2020 00:32:56 -0700, "Allen Baum" <allen.baum@...> said:
| Works for me. | -Allen | On Aug 21, 2020, at 11:41 PM, Andrew Waterman <andrew@...> wrote: | It's OK for esoteric combinations to require long ISA strings, I think. |
|
|