Date   

official v0.8 release of vector spec reference simulator

Simon Davidmann Imperas
 

Krste, Roger

We have just released an update of our free riscvOVPsim reference simulator version: 20191217.0  and put it on the https://github.com/riscv/riscv-ovpsim.

riscvOVPsim supports the full latest Vector instruction v0.8 release and is available now.

Our policy is to update our free reference simulator as soon as the vector specification updates - normally 2-3 days after the vector spec is updated. [We need to do this as our customers who are creating silicon need up-to-the-minute reference model for verification.]

riscvOVPsim is a complete envelope model of the full RISC-V 32/64 specification and is configured by command line options.

An example from its documentation for the vectors (last months spec changes) is:
Version 0.8
Stable 0.8 official release (commit 9a65519), with these changes compared to version 0.8-draft-20191118:
- vector context status in mstatus register is now implemented;
- whole register load and store operations have been restricted to a single register only;
- whole register move operations have been restricted to aligned groups of 1, 2, 4 or 8 registers only.
Version 0.8-draft-20191118
Stable 0.8 draft of November 18 2019, with these changes compared to version 0.8-draft-20191117:
- vsetvl/vsetvli with rd!=zero and rs1=zero sets vl to the maximum vector length.
Version 0.8-draft-20191117
Stable0.8 draft of November 17 2019, with these changes compared to version 0.8-draft-20191004:
- indexed load/store instructions zero-extend offsets (previously, they were sign-extended);
- vslide1up/vslide1down instructions sign-extend XLEN values to SEW length (previously, they
were zero-extended);
- vadc/vsbc instruction encodings require vm=0 (previously, they required vm=1);
- vmadc/vmsbc instruction encodings allow both vm=0, implying carry input is used, and vm=1,
implying carry input is zero (previously, only vm=1 was permitted, implying carry input is used);
- vaaddu.vv, vaaddu.vx, vasubu.vv and vasubu.vx instructions added;
- vaadd.vv and vaadd.vx, instruction encodings changed;
- vaadd.vi instruction removed;
- all widening saturating scaled multiply-add instructions removed;
- vqmaccu.vv, vqmaccu.vx, vqmacc.vv, vqmacc.vx, vqmacc.vx, vqmaccsu.vx and vqmaccus.vx in-
structions added;
- CSR vlenb added (vector register length in bytes);
- load/store whole register instructions added;
- whole register move instructions added.
Version 0.8-draft-20191004
Stable 0.8 draft of October 4 2019, with these changes compared to version 0.8-draft-20190906:
- vwsmaccsu and vwsmaccus instruction encodings exchanged.
etc...
For full documentation, please clone the repo or browse the simulator doc:  https://github.com/riscv/riscv-ovpsim/blob/master/doc/OVP_Model_Specific_Information_riscv_RV64GCV.pdf

thanks for your interest

Simon Davidmann
Imperas Software


Calling Convention for Vector ?

"戎杰杰
 

Hi,

 Anyone know extra designed ABI information (like Calling Convention)
 about for vector register ?


--Jojo


Re: Calling Convention for Vector ?

Andrew Waterman
 

There is a brief sketch of the Linux vector calling convention here: 

Note this is the convention for normal C ABI calls; a separate convention will be adopted for vector millicode calls.

On Mon, Dec 23, 2019 at 2:12 PM "戎杰杰 <jiejie.rjj@...> wrote:
Hi,

 Anyone know extra designed ABI information (like Calling Convention)
 about for vector register ?


--Jojo


Re: Calling Convention for Vector ?

"戎杰杰
 

Hi,

 Thanks for your mention.

 It’s so clear & simple, there is no convention for vector args & return of function ?
 also, according our long time designed cpu experiments, there should be some 

 callee saved vector registers for performance across some complicated function calls, right ? :)

 Any considers or details for excluding like vector args ?


--Jojo
在 2019年12月24日 +0800 AM4:46,Andrew Waterman <andrew@...>,写道:

There is a brief sketch of the Linux vector calling convention here: 

Note this is the convention for normal C ABI calls; a separate convention will be adopted for vector millicode calls.

On Mon, Dec 23, 2019 at 2:12 PM "戎杰杰 <jiejie.rjj@...> wrote:
Hi,

 Anyone know extra designed ABI information (like Calling Convention)
 about for vector register ?


--Jojo


Re: Calling Convention for Vector ?

Earl Killian
 

Vectors are passed in memory and returned in memory. Vectors are arbitrary length, whereas the vector registers are fixed length, and can only be used to temporarily hold a portion of a memory vector. Thus it doesn’t make sense to pass or return things in vector registers, or to have the registers saved or restored as part of the calling convention.

On Dec 25, 2019, at 20:17, "戎杰杰 <jiejie.rjj@...> wrote:

Hi,

 Thanks for your mention.

 It’s so clear & simple, there is no convention for vector args & return of function ?

 also, according our long time designed cpu experiments, there should be some

 callee saved vector registers for performance across some complicated function calls, right ? :)

 Any considers or details for excluding like vector args ?


--Jojo
在 2019年12月24日 +0800 AM4:46,Andrew Waterman <andrew@...>,写道:
There is a brief sketch of the Linux vector calling convention here:
https://github.com/riscv/riscv-v-spec/blob/master/calling-convention.adoc

Note this is the convention for normal C ABI calls; a separate convention will be adopted for vector millicode calls.

On Mon, Dec 23, 2019 at 2:12 PM "戎杰杰 <jiejie.rjj@...> wrote:
Hi,

 Anyone know extra designed ABI information (like Calling Convention)

 about for vector register ?


--Jojo


Re: Calling Convention for Vector ?

Jim Wilson
 

On Thu, Dec 26, 2019 at 2:01 PM Earl Killian <earl.killian@...> wrote:
Vectors are passed in memory and returned in memory. Vectors are arbitrary length, whereas the vector registers are fixed length, and can only be used to temporarily hold a portion of a memory vector. Thus it doesn’t make sense to pass or return things in vector registers, or to have the registers saved or restored as part of the calling convention.
Some code will not want args in vector regs, so that we don't have to
save/restore them around calls. Some code will want args in vector
regs, so that they can have subroutines that operate on vectors. If
you have already loaded part of a vector into a vector register, it is
silly to send it back to memory just so you can call a function that
reads it back in. It is better to leave it in a register to reduce
memory bandwidth. So we need two calling conventions. Or
alternatively, one calling convention with optional vector support
that can be enabled only when needed. If you look at ARM SVE, you
will see that this is what they have done.

I think this is more complicated for rvv though as we have LMUL up to
8, which means we need 16 registers worst case for two arguments,
which will have to be v8-15 or v16-v23 or v24-v31 because of alignment
issues. Plus we need v0 for an optional mask so we can't use v1-v7
for arguments. And vlen will have to be an implicit argument.
Someone will have to spend time doing experiments to see how well this
works in practice to make sure it is reasonable. And we will need a
reasonable compiler first before we can do experiments, which we don't
really have yet, and may not have for a while. Not to mention
hardware to test on. I think it will be a while before we can
formally specify a vector calling convention.

Jim


Re: Calling Convention for Vector ?

"戎杰杰
 

Hi,

 We met some problems as your mention also.

Consider some code will want args in vector regs, we study from SVE
 vregs layout and config our RISCV vregs layout as following:

 | v0-7     | v0-7     | Temporaries | Caller |
 | v8-15   | v8-15   | Function arguments/return values | Caller |
 | v16-23 | v16-23 | Function arguments | Caller |
 | v24-31 | v24-31 | Saved register | Callee |

 This configuration will fix like v0 mask reg,
 or we can use 16 registers for two arguments in 8 LMUL.
We can make a draft to improving call convention with args in vector :)


--Jojo
在 2019年12月28日 +0800 AM12:12,Jim Wilson <jimw@...>,写道:

On Thu, Dec 26, 2019 at 2:01 PM Earl Killian <earl.killian@...> wrote:
Vectors are passed in memory and returned in memory. Vectors are arbitrary length, whereas the vector registers are fixed length, and can only be used to temporarily hold a portion of a memory vector. Thus it doesn’t make sense to pass or return things in vector registers, or to have the registers saved or restored as part of the calling convention.

Some code will not want args in vector regs, so that we don't have to
save/restore them around calls. Some code will want args in vector
regs, so that they can have subroutines that operate on vectors. If
you have already loaded part of a vector into a vector register, it is
silly to send it back to memory just so you can call a function that
reads it back in. It is better to leave it in a register to reduce
memory bandwidth. So we need two calling conventions. Or
alternatively, one calling convention with optional vector support
that can be enabled only when needed. If you look at ARM SVE, you
will see that this is what they have done.

I think this is more complicated for rvv though as we have LMUL up to
8, which means we need 16 registers worst case for two arguments,
which will have to be v8-15 or v16-v23 or v24-v31 because of alignment
issues. Plus we need v0 for an optional mask so we can't use v1-v7
for arguments. And vlen will have to be an implicit argument.
Someone will have to spend time doing experiments to see how well this
works in practice to make sure it is reasonable. And we will need a
reasonable compiler first before we can do experiments, which we don't
really have yet, and may not have for a while. Not to mention
hardware to test on. I think it will be a while before we can
formally specify a vector calling convention.

Jim


Re: Calling Convention for Vector ?

Andrew Waterman
 

Providing callee-saved vector registers in the regular C calling convention might actually degrade performance, as most vector computation is done in leaf functions or in strip-mine loops that don't call functions.  Functions that want to use all the vector registers will have to spill some callee-saved registers, even if the callee-saved registers aren't providing much benefit.

By contrast, the vector millicode calling convention (for routines like element-wise transcendentals) would likely benefit from an alternate calling convention that has some callee-saved vector registers.


On Mon, Jan 13, 2020 at 12:35 AM 戎杰杰 <jiejie.rjj@...> wrote:
Hi,

 We met some problems as your mention also.

Consider some code will want args in vector regs, we study from SVE
 vregs layout and config our RISCV vregs layout as following:

 | v0-7     | v0-7     | Temporaries | Caller |
 | v8-15   | v8-15   | Function arguments/return values | Caller |
 | v16-23 | v16-23 | Function arguments | Caller |
 | v24-31 | v24-31 | Saved register | Callee |

 This configuration will fix like v0 mask reg,
 or we can use 16 registers for two arguments in 8 LMUL.
We can make a draft to improving call convention with args in vector :)


--Jojo
在 2019年12月28日 +0800 AM12:12,Jim Wilson <jimw@...>,写道:
On Thu, Dec 26, 2019 at 2:01 PM Earl Killian <earl.killian@...> wrote:
Vectors are passed in memory and returned in memory. Vectors are arbitrary length, whereas the vector registers are fixed length, and can only be used to temporarily hold a portion of a memory vector. Thus it doesn’t make sense to pass or return things in vector registers, or to have the registers saved or restored as part of the calling convention.

Some code will not want args in vector regs, so that we don't have to
save/restore them around calls. Some code will want args in vector
regs, so that they can have subroutines that operate on vectors. If
you have already loaded part of a vector into a vector register, it is
silly to send it back to memory just so you can call a function that
reads it back in. It is better to leave it in a register to reduce
memory bandwidth. So we need two calling conventions. Or
alternatively, one calling convention with optional vector support
that can be enabled only when needed. If you look at ARM SVE, you
will see that this is what they have done.

I think this is more complicated for rvv though as we have LMUL up to
8, which means we need 16 registers worst case for two arguments,
which will have to be v8-15 or v16-v23 or v24-v31 because of alignment
issues. Plus we need v0 for an optional mask so we can't use v1-v7
for arguments. And vlen will have to be an implicit argument.
Someone will have to spend time doing experiments to see how well this
works in practice to make sure it is reasonable. And we will need a
reasonable compiler first before we can do experiments, which we don't
really have yet, and may not have for a while. Not to mention
hardware to test on. I think it will be a while before we can
formally specify a vector calling convention.

Jim


Re: Calling Convention for Vector ?

Andy Glew Si5
 

Oh, heck [*]:

 

Callee saved registers of any form can have bad performance where there is a potential partial register issue. E.g. on an out of order machine with register renaming. Although even some simple non-out of order microarchitectures benefit from register renaming.

 

RISC-V vectors have partial register issues due to masks and vector length.

 

(Note *: I sent something like this email to Andrew, since I was chicken to talk to the list. Embarrassingly, justifying my cowardice, I flipped a bit between callee and caller saved registers in that original email. It's callee save that has partial register issues. Andrew reminded me about vector masks as a cause of partial register issues, which I should've known about if my brain had been working right, and told me about vector length as a cause of partial register issues in RISC-V, which I should've realized but admittedly have not worked on a vector length architecture in many years.)

 

From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of Andrew Waterman
Sent: Monday, January 13, 2020 14:18
To: 戎杰杰 <jiejie.rjj@...>
Cc: Earl Killian <earl.killian@...>; Jim Wilson <jimw@...>; tech-vector-ext@...; mingjie@...
Subject: Re: [RISC-V] [tech-vector-ext] Calling Convention for Vector ?

 

Providing callee-saved vector registers in the regular C calling convention might actually degrade performance, as most vector computation is done in leaf functions or in strip-mine loops that don't call functions.  Functions that want to use all the vector registers will have to spill some callee-saved registers, even if the callee-saved registers aren't providing much benefit.

 

By contrast, the vector millicode calling convention (for routines like element-wise transcendentals) would likely benefit from an alternate calling convention that has some callee-saved vector registers.

 

On Mon, Jan 13, 2020 at 12:35 AM 戎杰杰 <jiejie.rjj@...> wrote:

Hi,

 

 We met some problems as your mention also.

 

 Consider some code will want args in vector regs, we study from SVE

 vregs layout and config our RISCV vregs layout as following:

 

 | v0-7     | v0-7     | Temporaries | Caller |

 | v8-15   | v8-15   | Function arguments/return values | Caller |

 | v16-23 | v16-23 | Function arguments | Caller |

 | v24-31 | v24-31 | Saved register | Callee |

 

 This configuration will fix like v0 mask reg,

 or we can use 16 registers for two arguments in 8 LMUL.

 We can make a draft to improving call convention with args in vector :)

 

 

--Jojo

20191228 +0800 AM12:12Jim Wilson <jimw@...>,写道:

On Thu, Dec 26, 2019 at 2:01 PM Earl Killian <earl.killian@...> wrote:

Vectors are passed in memory and returned in memory. Vectors are arbitrary length, whereas the vector registers are fixed length, and can only be used to temporarily hold a portion of a memory vector. Thus it doesn’t make sense to pass or return things in vector registers, or to have the registers saved or restored as part of the calling convention.


Some code will not want args in vector regs, so that we don't have to
save/restore them around calls. Some code will want args in vector
regs, so that they can have subroutines that operate on vectors. If
you have already loaded part of a vector into a vector register, it is
silly to send it back to memory just so you can call a function that
reads it back in. It is better to leave it in a register to reduce
memory bandwidth. So we need two calling conventions. Or
alternatively, one calling convention with optional vector support
that can be enabled only when needed. If you look at ARM SVE, you
will see that this is what they have done.

I think this is more complicated for rvv though as we have LMUL up to
8, which means we need 16 registers worst case for two arguments,
which will have to be v8-15 or v16-v23 or v24-v31 because of alignment
issues. Plus we need v0 for an optional mask so we can't use v1-v7
for arguments. And vlen will have to be an implicit argument.
Someone will have to spend time doing experiments to see how well this
works in practice to make sure it is reasonable. And we will need a
reasonable compiler first before we can do experiments, which we don't
really have yet, and may not have for a while. Not to mention
hardware to test on. I think it will be a while before we can
formally specify a vector calling convention.

Jim


Slidedown overlapping of dest and source regsiters

Thang Tran
 

The slideup instruction has this restriction:

The destination vector register group for vslideup cannot overlap the source vector register group or the mask register, otherwise an illegal instruction exception is raised.

The slidedown instruction has different restriction:

The destination vector register group cannot overlap the mask register if LMUL>1, otherwise an illegal instruction exception is raised.

The overlapping of the source and destination registers assumes the implementation to be in a certain way which is inflexible. I think that the slidedown instruction should have the same restriction of non-overlapping of source and destination registers.

Thanks, Thang


Re: Slidedown overlapping of dest and source regsiters

Andrew Waterman
 

It's important that the slidedown instruction can overwrite its source operand.  Debuggers will use this feature to populate a vector register in-place without clobbering other architectural state.

On Tue, Jan 28, 2020 at 10:59 AM Thang Tran <thang@...> wrote:
The slideup instruction has this restriction:

The destination vector register group for vslideup cannot overlap the source vector register group or the mask register, otherwise an illegal instruction exception is raised.

The slidedown instruction has different restriction:

The destination vector register group cannot overlap the mask register if LMUL>1, otherwise an illegal instruction exception is raised.

The overlapping of the source and destination registers assumes the implementation to be in a certain way which is inflexible. I think that the slidedown instruction should have the same restriction of non-overlapping of source and destination registers.

Thanks, Thang




Re: Slidedown overlapping of dest and source regsiters

Thang Tran
 

Hi Andrew,

I do not understand your statement. Why is it important? Why is the difference with slideup?

 

The slideup cannot clobber the source operand with destination operand because the destination register writes to source register before the source operand is read.

 

The slidedown instruction should be the same because my implementation would writes to the source register before the source operand is read. The allowed overlapping of source & destination registers assumes a certain implementation of slidedown which is not good for other people.

 

Thanks, Thang

 

From: Andrew Waterman [mailto:andrew@...]
Sent: Tuesday, January 28, 2020 11:23 AM
To: Thang Tran <thang@...>
Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Slidedown overlapping of dest and source regsiters

 

It's important that the slidedown instruction can overwrite its source operand.  Debuggers will use this feature to populate a vector register in-place without clobbering other architectural state.

 

On Tue, Jan 28, 2020 at 10:59 AM Thang Tran <thang@...> wrote:

The slideup instruction has this restriction:

The destination vector register group for vslideup cannot overlap the source vector register group or the mask register, otherwise an illegal instruction exception is raised.

The slidedown instruction has different restriction:

The destination vector register group cannot overlap the mask register if LMUL>1, otherwise an illegal instruction exception is raised.

The overlapping of the source and destination registers assumes the implementation to be in a certain way which is inflexible. I think that the slidedown instruction should have the same restriction of non-overlapping of source and destination registers.

Thanks, Thang


Re: Slidedown overlapping of dest and source regsiters

Guy Lemieux
 

Hi Thang,

I think Andrew is suggesting that the vslideup restriction is there to
allow some flexibility with implementations. However, one of
(vslideup/vslidedown) needs to allow the same source/dest register
(group) because the debugger is going to use this feature to inject
new data without clobbering other vector registers.

I believe most implementations iterating over a vector will be
incrementing the element index -- this allows vslidedown to safely
clobber earlier elements (higher index values are being read out while
lower index values are being written, so the lower index values will
have been previously read and the elements are in-transit in the
pipeline). If your vector implementation is decrementing the element
index, then you couldn't allow src/dst overlap with vslidedown, but
you could allow it with vslideup. Hence, there is an implicit
assumption here about implementations (ie, count up is preferred, or
else you have to buffer the whole vector register group).

I'm not sure how the debugger would be using this feature, but if I
had to guess, I think the debugger would actually be using vslide1down
(not vslidedown) to inject data into a vector. So, perhaps the
overlapping src/dst requirement should only be for vslide1down? Also,
as an alternative, there are also various vmv instructions that could
be used by the debugger which move one element at a time and do allow
overlapping src/dst. I don't think debugger performance is crucial.

Guy

On Tue, Jan 28, 2020 at 12:42 PM Thang Tran <thang@...> wrote:

Hi Andrew,

I do not understand your statement. Why is it important? Why is the difference with slideup?



The slideup cannot clobber the source operand with destination operand because the destination register writes to source register before the source operand is read.



The slidedown instruction should be the same because my implementation would writes to the source register before the source operand is read. The allowed overlapping of source & destination registers assumes a certain implementation of slidedown which is not good for other people.



Thanks, Thang



From: Andrew Waterman [mailto:andrew@...]
Sent: Tuesday, January 28, 2020 11:23 AM
To: Thang Tran <thang@...>
Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Slidedown overlapping of dest and source regsiters



It's important that the slidedown instruction can overwrite its source operand. Debuggers will use this feature to populate a vector register in-place without clobbering other architectural state.



On Tue, Jan 28, 2020 at 10:59 AM Thang Tran <thang@...> wrote:

The slideup instruction has this restriction:

The destination vector register group for vslideup cannot overlap the source vector register group or the mask register, otherwise an illegal instruction exception is raised.

The slidedown instruction has different restriction:

The destination vector register group cannot overlap the mask register if LMUL>1, otherwise an illegal instruction exception is raised.

The overlapping of the source and destination registers assumes the implementation to be in a certain way which is inflexible. I think that the slidedown instruction should have the same restriction of non-overlapping of source and destination registers.

Thanks, Thang


Re: Slidedown overlapping of dest and source regsiters

Thang Tran
 

Thanks Guy for the explanation, but my implementation is both incrementing element index for slideup and decrementing element index for slidedown (which is symmetrical implementation and simplest from my point of view).

I have no issue with dest/source registers overlapping for slide1down and slide1up. As you suggested can be used for debugging.

Thanks, Thang

-----Original Message-----
From: Guy Lemieux [mailto:glemieux@...]
Sent: Tuesday, January 28, 2020 1:40 PM
To: Thang Tran <thang@...>
Cc: Andrew Waterman <andrew@...>; Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Slidedown overlapping of dest and source regsiters

Hi Thang,

I think Andrew is suggesting that the vslideup restriction is there to allow some flexibility with implementations. However, one of
(vslideup/vslidedown) needs to allow the same source/dest register
(group) because the debugger is going to use this feature to inject new data without clobbering other vector registers.

I believe most implementations iterating over a vector will be incrementing the element index -- this allows vslidedown to safely clobber earlier elements (higher index values are being read out while lower index values are being written, so the lower index values will have been previously read and the elements are in-transit in the pipeline). If your vector implementation is decrementing the element index, then you couldn't allow src/dst overlap with vslidedown, but you could allow it with vslideup. Hence, there is an implicit assumption here about implementations (ie, count up is preferred, or else you have to buffer the whole vector register group).

I'm not sure how the debugger would be using this feature, but if I had to guess, I think the debugger would actually be using vslide1down (not vslidedown) to inject data into a vector. So, perhaps the overlapping src/dst requirement should only be for vslide1down? Also, as an alternative, there are also various vmv instructions that could be used by the debugger which move one element at a time and do allow overlapping src/dst. I don't think debugger performance is crucial.

Guy


On Tue, Jan 28, 2020 at 12:42 PM Thang Tran <thang@...> wrote:

Hi Andrew,

I do not understand your statement. Why is it important? Why is the difference with slideup?



The slideup cannot clobber the source operand with destination operand because the destination register writes to source register before the source operand is read.



The slidedown instruction should be the same because my implementation would writes to the source register before the source operand is read. The allowed overlapping of source & destination registers assumes a certain implementation of slidedown which is not good for other people.



Thanks, Thang



From: Andrew Waterman [mailto:andrew@...]
Sent: Tuesday, January 28, 2020 11:23 AM
To: Thang Tran <thang@...>
Cc: Krste Asanovic <krste@...>;
tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Slidedown overlapping of dest
and source regsiters



It's important that the slidedown instruction can overwrite its source operand. Debuggers will use this feature to populate a vector register in-place without clobbering other architectural state.



On Tue, Jan 28, 2020 at 10:59 AM Thang Tran <thang@...> wrote:

The slideup instruction has this restriction:

The destination vector register group for vslideup cannot overlap the source vector register group or the mask register, otherwise an illegal instruction exception is raised.

The slidedown instruction has different restriction:

The destination vector register group cannot overlap the mask register if LMUL>1, otherwise an illegal instruction exception is raised.

The overlapping of the source and destination registers assumes the implementation to be in a certain way which is inflexible. I think that the slidedown instruction should have the same restriction of non-overlapping of source and destination registers.

Thanks, Thang


Re: Slidedown overlapping of dest and source regsiters

Guy Lemieux
 

Thanks Guy for the explanation, but my implementation is both incrementing element index for slideup and decrementing element index for slidedown (which is symmetrical implementation and simplest from my point of view).
I'm curious why you chose to be symmetrical (no need), and why you
decided incrementing for slideup decrementing for slidedn (I would do
the opposite).

By incrementing for vslidedown, and decrementing for vslideup, it
eliminates the race condition in both directions and allows
overlapping src/dst for both.

However, by supporting both incrementing and decrementing, you are
adding extra hardware that isn't strictly necessary.

Guy


Re: Slidedown overlapping of dest and source regsiters

Andrew Waterman
 



On Tue, Jan 28, 2020 at 1:40 PM Guy Lemieux <glemieux@...> wrote:
Hi Thang,

I think Andrew is suggesting that the vslideup restriction is there to
allow some flexibility with implementations. However, one of
(vslideup/vslidedown) needs to allow the same source/dest register
(group) because the debugger is going to use this feature to inject
new data without clobbering other vector registers.

I believe most implementations iterating over a vector will be
incrementing the element index -- this allows vslidedown to safely
clobber earlier elements (higher index values are being read out while
lower index values are being written, so the lower index values will
have been previously read and the elements are in-transit in the
pipeline). If your vector implementation is decrementing the element
index, then you couldn't allow src/dst overlap with vslidedown, but
you could allow it with vslideup. Hence, there is an implicit
assumption here about implementations (ie, count up is preferred, or
else you have to buffer the whole vector register group).

I'm not sure how the debugger would be using this feature, but if I
had to guess, I think the debugger would actually be using vslide1down
(not vslidedown) to inject data into a vector. So, perhaps the
overlapping src/dst requirement should only be for vslide1down? Also,
as an alternative, there are also various vmv instructions that could
be used by the debugger which move one element at a time and do allow
overlapping src/dst. I don't think debugger performance is crucial.

Oops, yes, I meant vslide1down.

Using vslide1down isn't about performance; it's the only way I know of for the debugger to construct a vector without additional storage.  The alternative would have been to add an instruction to insert an element into an arbitrary element position, which for various reasons was deemed a less-preferable alternative.


Guy


On Tue, Jan 28, 2020 at 12:42 PM Thang Tran <thang@...> wrote:
>
> Hi Andrew,
>
> I do not understand your statement. Why is it important? Why is the difference with slideup?
>
>
>
> The slideup cannot clobber the source operand with destination operand because the destination register writes to source register before the source operand is read.
>
>
>
> The slidedown instruction should be the same because my implementation would writes to the source register before the source operand is read. The allowed overlapping of source & destination registers assumes a certain implementation of slidedown which is not good for other people.
>
>
>
> Thanks, Thang
>
>
>
> From: Andrew Waterman [mailto:andrew@...]
> Sent: Tuesday, January 28, 2020 11:23 AM
> To: Thang Tran <thang@...>
> Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
> Subject: Re: [RISC-V] [tech-vector-ext] Slidedown overlapping of dest and source regsiters
>
>
>
> It's important that the slidedown instruction can overwrite its source operand.  Debuggers will use this feature to populate a vector register in-place without clobbering other architectural state.
>
>
>
> On Tue, Jan 28, 2020 at 10:59 AM Thang Tran <thang@...> wrote:
>
> The slideup instruction has this restriction:
>
> The destination vector register group for vslideup cannot overlap the source vector register group or the mask register, otherwise an illegal instruction exception is raised.
>
> The slidedown instruction has different restriction:
>
> The destination vector register group cannot overlap the mask register if LMUL>1, otherwise an illegal instruction exception is raised.
>
> The overlapping of the source and destination registers assumes the implementation to be in a certain way which is inflexible. I think that the slidedown instruction should have the same restriction of non-overlapping of source and destination registers.
>
> Thanks, Thang
>


Minutes from 2020/1/24 meeting

Krste Asanovic
 

Date: 2020/1/24
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~20
github: https://github.com/riscv/riscv-v-spec

Outstanding items from prior meeting:
#362/#354, #341, #348, #347, #335, #326, #318, #235

Current items:
#362/#354, #341

New actionable items to be addressed in next meeting:

Is group tracking to roadmap? y

The main discussion in the group that occupied most of the time was
around #362/#354, whether to drop the fixed-size vector load/store
instructions, leaving only the SEW-sized load/store instructions.
Dropping these instructions would save considerable complexity in
memory pipelines. However, dropping support would also require
execution of additional instructions in some common cases. A remedy
would be to add more widening or quad-widening (quadening) compute
instructions to reduce this impact.

During the call a second issue was raised. The current design uses
constant SEW/LMUL ratios to align data types of different element
widths. If only SEW-sized load/stores were available, then a
computation using a mixture of element widths would have to use larger
LMUL for larger SEW values, which effectively reduces the number of
available registers and so increases register pressure. The fixed
width load/stores allow, e.g., a byte to be loaded into a vector
register with four-byte width with LMUL=1 so avoids this issue.

No resolution was reached, but this was to be studied further.

The second point that was quickly discussed at the end of the meeting
was the proposed separation of a vcsr from fcsr (#341). There was
little obvious enthusiasm in favor of one choice over the other, and
so the design is left as is for now.


RISC-V Vector Task Group: fractional LMUL

Krste Asanovic
 

In the last meeting, we discussed a problem that would be introduced
if we were to drop the fixed-size (b/h/w) variants of the vector
load/stores and only have the SEW-size (e) variants. From the
minutes:

The current design uses constant SEW/LMUL ratios to align data
types of different element widths. If only SEW-sized load/stores
were available, then a computation using a mixture of element
widths would have to use larger LMUL for larger SEW values, which
effectively reduces the number of available registers and so
increases register pressure. The fixed width load/stores allow,
e.g., a byte to be loaded into a vector register with four-byte
width with LMUL=1 so avoids this issue.

Considering the case of a byte (8b) load into a word (32b) register.
The effect of a byte load is to use only one quarter of the bits in a
register, with widening to replicate zero/sign bits into the other
bits of the register.

A different strategy to use a portion of the bits in a vector register
would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4,
1/8. This has the effect of supporting a given SEW/LMUL ratio with
smaller LMUL values. This can be done without adding additional state
to the machine, but only by adding a new variant of vsetvli that sets
vl according to a shorter VLMAX calculated with the appropriate
reduction in VLEN.

E.g., vsetvli rd, rs1, e8,f2 # LMUL=1/2
vsetvli rd, rs1, e8,f4 # LMUL=1/4
vsetvli rd, rs1, e8,f8 # LMUL=1/8

These instructions leave LMUL=1 in vtype, and the machine executes the
instructions as before, just the vl will be shorter in these
instructions.

The same effect could be achieved without any new ISA instructions by
performing vsetvli with widest SEW to set vl, then repeat with vsetvli
with rd=x0,rs1=x0 to keep this vl value. However, this would add an
additional instruction in the general case (sometimes, widest
operation isn't naturally the first in a loop), but in other cases vl
is fixed throughout a loop and can arrange so first setvl uses widest
SEW, so the additional instructions can be avoided.

With or without new vsetvli implementation for fractional LMUL, there
are still more dynamic instructions required in general than the
fixed-size loads into SEW elements, which don't need to change SEW.

We can discuss further in the next task group meeting tomorrow.
Members can find login details on the members task group calendar.

Krste


Re: RISC-V Vector Task Group: fractional LMUL

Nick Knight
 

Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its input register group(s)?

--Nick Knight


On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote:

In the last meeting, we discussed a problem that would be introduced
if we were to drop the fixed-size (b/h/w) variants of the vector
load/stores and only have the SEW-size (e) variants.  From the
minutes:

    The current design uses constant SEW/LMUL ratios to align data
    types of different element widths.  If only SEW-sized load/stores
    were available, then a computation using a mixture of element
    widths would have to use larger LMUL for larger SEW values, which
    effectively reduces the number of available registers and so
    increases register pressure.  The fixed width load/stores allow,
    e.g., a byte to be loaded into a vector register with four-byte
    width with LMUL=1 so avoids this issue.

Considering the case of a byte (8b) load into a word (32b) register.
The effect of a byte load is to use only one quarter of the bits in a
register, with widening to replicate zero/sign bits into the other
bits of the register.

A different strategy to use a portion of the bits in a vector register
would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4,
1/8.  This has the effect of supporting a given SEW/LMUL ratio with
smaller LMUL values.  This can be done without adding additional state
to the machine, but only by adding a new variant of vsetvli that sets
vl according to a shorter VLMAX calculated with the appropriate
reduction in VLEN.

E.g.,    vsetvli rd, rs1, e8,f2    # LMUL=1/2
         vsetvli rd, rs1, e8,f4    # LMUL=1/4
         vsetvli rd, rs1, e8,f8    # LMUL=1/8

These instructions leave LMUL=1 in vtype, and the machine executes the
instructions as before, just the vl will be shorter in these
instructions.

The same effect could be achieved without any new ISA instructions by
performing vsetvli with widest SEW to set vl, then repeat with vsetvli
with rd=x0,rs1=x0 to keep this vl value.  However, this would add an
additional instruction in the general case (sometimes, widest
operation isn't naturally the first in a loop), but in other cases vl
is fixed throughout a loop and can arrange so first setvl uses widest
SEW, so the additional instructions can be avoided.

With or without new vsetvli implementation for fractional LMUL, there
are still more dynamic instructions required in general than the
fixed-size loads into SEW elements, which don't need to change SEW.

We can discuss further in the next task group meeting tomorrow.
Members can find login details on the members task group calendar.

Krste





Re: RISC-V Vector Task Group: fractional LMUL

Krste Asanovic
 

I'm realizing the idea doesn't quite work unless machine actually
stores the fractional LMUL value (to cope with SLEN and widening
instruction behavior), and so would need the new instruction and an
extra bit of state in vtype.

But with this change, yes.

For floating-point code, where there is no way to load a narrower type
into a wider element, this can be used to increase the number of
registers of a wider width, e.g., in a matrix multiply accumulating
double-precision values that are the product of single-precision
values, using widening muladds, can arrange as something like:

vsetvli x0, x0, e32,f2 # Fractional Lmul
vle.v v0, (bmatp) # Get row of matrix
flw f1, (x15) # Get scalar
vfwmacc.vf v1, f1, v0 # One row of outer-product
add x15, x15, acolstride # Bump pointer
...
flw f31, (x15) # Get scalar
vfwmacc.vf v31, f31, v0 # Last row of outer-product
...

which is holding 31 rows of the destination matrix accumulators as
doubles while performing widening muladds from a single-precision
vector load held in v0. This is probably overkill register blocking
for this particular example, but shows the general improvement.

Krste


On Thu, 6 Feb 2020 22:05:50 -0800, "Nick Knight" <nick.knight@...> said:
| Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its input
| register group(s)?

| --Nick Knight

| On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote:

| In the last meeting, we discussed a problem that would be introduced
| if we were to drop the fixed-size (b/h/w) variants of the vector
| load/stores and only have the SEW-size (e) variants.  From the
| minutes:

|     The current design uses constant SEW/LMUL ratios to align data
|     types of different element widths.  If only SEW-sized load/stores
|     were available, then a computation using a mixture of element
|     widths would have to use larger LMUL for larger SEW values, which
|     effectively reduces the number of available registers and so
|     increases register pressure.  The fixed width load/stores allow,
|     e.g., a byte to be loaded into a vector register with four-byte
|     width with LMUL=1 so avoids this issue.

| Considering the case of a byte (8b) load into a word (32b) register.
| The effect of a byte load is to use only one quarter of the bits in a
| register, with widening to replicate zero/sign bits into the other
| bits of the register.

| A different strategy to use a portion of the bits in a vector register
| would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4,
| 1/8.  This has the effect of supporting a given SEW/LMUL ratio with
| smaller LMUL values.  This can be done without adding additional state
| to the machine, but only by adding a new variant of vsetvli that sets
| vl according to a shorter VLMAX calculated with the appropriate
| reduction in VLEN.

| E.g.,    vsetvli rd, rs1, e8,f2    # LMUL=1/2
|          vsetvli rd, rs1, e8,f4    # LMUL=1/4
|          vsetvli rd, rs1, e8,f8    # LMUL=1/8

| These instructions leave LMUL=1 in vtype, and the machine executes the
| instructions as before, just the vl will be shorter in these
| instructions.

| The same effect could be achieved without any new ISA instructions by
| performing vsetvli with widest SEW to set vl, then repeat with vsetvli
| with rd=x0,rs1=x0 to keep this vl value.  However, this would add an
| additional instruction in the general case (sometimes, widest
| operation isn't naturally the first in a loop), but in other cases vl
| is fixed throughout a loop and can arrange so first setvl uses widest
| SEW, so the additional instructions can be avoided.

| With or without new vsetvli implementation for fractional LMUL, there
| are still more dynamic instructions required in general than the
| fixed-size loads into SEW elements, which don't need to change SEW.

| We can discuss further in the next task group meeting tomorrow.
| Members can find login details on the members task group calendar.

| Krste

|