official v0.8 release of vector spec reference simulator
Krste, Roger We have just released an update of our free riscvOVPsim reference simulator version: 20191217.0 and put it on the https://github.com/riscv/riscv-ovpsim. riscvOVPsim supports the full latest Vector instruction v0.8 release and is available now. Our policy is to update our free reference simulator as soon as the vector specification updates - normally 2-3 days after the vector spec is updated. [We need to do this as our customers who are creating silicon need up-to-the-minute reference model for verification.] riscvOVPsim is a complete envelope model of the full RISC-V 32/64 specification and is configured by command line options. An example from its documentation for the vectors (last months spec changes) is: Version 0.8 Version 0.8-draft-20191118 Stable 0.8 draft of November 18 2019, with these changes compared to version 0.8-draft-20191117:Version 0.8-draft-20191117 Stable0.8 draft of November 17 2019, with these changes compared to version 0.8-draft-20191004:Version 0.8-draft-20191004 Stable 0.8 draft of October 4 2019, with these changes compared to version 0.8-draft-20190906:etc... For full documentation, please clone the repo or browse the simulator doc: https://github.com/riscv/riscv-ovpsim/blob/master/doc/OVP_Model_Specific_Information_riscv_RV64GCV.pdf thanks for your interest Simon Davidmann Imperas Software |
|
Calling Convention for Vector ?
"戎杰杰
Hi,
Anyone know extra designed ABI information (like Calling Convention)
about for vector register ?
--Jojo
|
|
Re: Calling Convention for Vector ?
Andrew Waterman
There is a brief sketch of the Linux vector calling convention here: Note this is the convention for normal C ABI calls; a separate convention will be adopted for vector millicode calls. On Mon, Dec 23, 2019 at 2:12 PM "戎杰杰 <jiejie.rjj@...> wrote:
|
|
Re: Calling Convention for Vector ?
"戎杰杰
Hi,
Thanks for your mention.
It’s so clear & simple, there is no convention for vector args & return of function ?
also, according our long time designed cpu experiments, there should be some
callee saved vector registers for performance across some complicated function calls, right ? :)
Any considers or details for excluding like vector args ?
--Jojo
在 2019年12月24日 +0800 AM4:46,Andrew Waterman <andrew@...>,写道:
|
|
Re: Calling Convention for Vector ?
Earl Killian
Vectors are passed in memory and returned in memory. Vectors are arbitrary length, whereas the vector registers are fixed length, and can only be used to temporarily hold a portion of a memory vector. Thus it doesn’t make sense to pass or return things in vector registers, or to have the registers saved or restored as part of the calling convention.
toggle quoted message
Show quoted text
On Dec 25, 2019, at 20:17, "戎杰杰 <jiejie.rjj@...> wrote: |
|
Re: Calling Convention for Vector ?
On Thu, Dec 26, 2019 at 2:01 PM Earl Killian <earl.killian@...> wrote:
Vectors are passed in memory and returned in memory. Vectors are arbitrary length, whereas the vector registers are fixed length, and can only be used to temporarily hold a portion of a memory vector. Thus it doesn’t make sense to pass or return things in vector registers, or to have the registers saved or restored as part of the calling convention.Some code will not want args in vector regs, so that we don't have to save/restore them around calls. Some code will want args in vector regs, so that they can have subroutines that operate on vectors. If you have already loaded part of a vector into a vector register, it is silly to send it back to memory just so you can call a function that reads it back in. It is better to leave it in a register to reduce memory bandwidth. So we need two calling conventions. Or alternatively, one calling convention with optional vector support that can be enabled only when needed. If you look at ARM SVE, you will see that this is what they have done. I think this is more complicated for rvv though as we have LMUL up to 8, which means we need 16 registers worst case for two arguments, which will have to be v8-15 or v16-v23 or v24-v31 because of alignment issues. Plus we need v0 for an optional mask so we can't use v1-v7 for arguments. And vlen will have to be an implicit argument. Someone will have to spend time doing experiments to see how well this works in practice to make sure it is reasonable. And we will need a reasonable compiler first before we can do experiments, which we don't really have yet, and may not have for a while. Not to mention hardware to test on. I think it will be a while before we can formally specify a vector calling convention. Jim |
|
Re: Calling Convention for Vector ?
"戎杰杰
Hi,
We met some problems as your mention also.
Consider some code will want args in vector regs, we study from SVE
vregs layout and config our RISCV vregs layout as following:
| v0-7 | v0-7 | Temporaries | Caller |
| v8-15 | v8-15 | Function arguments/return values | Caller |
| v16-23 | v16-23 | Function arguments | Caller |
| v24-31 | v24-31 | Saved register | Callee |
This configuration will fix like v0 mask reg,
or we can use 16 registers for two arguments in 8 LMUL.
We can make a draft to improving call convention with args in vector :)
--Jojo
在 2019年12月28日 +0800 AM12:12,Jim Wilson <jimw@...>,写道: On Thu, Dec 26, 2019 at 2:01 PM Earl Killian <earl.killian@...> wrote: |
|
Re: Calling Convention for Vector ?
Andrew Waterman
Providing callee-saved vector registers in the regular C calling convention might actually degrade performance, as most vector computation is done in leaf functions or in strip-mine loops that don't call functions. Functions that want to use all the vector registers will have to spill some callee-saved registers, even if the callee-saved registers aren't providing much benefit. By contrast, the vector millicode calling convention (for routines like element-wise transcendentals) would likely benefit from an alternate calling convention that has some callee-saved vector registers. On Mon, Jan 13, 2020 at 12:35 AM 戎杰杰 <jiejie.rjj@...> wrote:
|
|
Re: Calling Convention for Vector ?
Andy Glew Si5
Oh, heck [*]:
Callee saved registers of any form can have bad performance where there is a potential partial register issue. E.g. on an out of order machine with register renaming. Although even some simple non-out of order microarchitectures benefit from register renaming.
RISC-V vectors have partial register issues due to masks and vector length.
(Note *: I sent something like this email to Andrew, since I was chicken to talk to the list. Embarrassingly, justifying my cowardice, I flipped a bit between callee and caller saved registers in that original email. It's callee save that has partial register issues. Andrew reminded me about vector masks as a cause of partial register issues, which I should've known about if my brain had been working right, and told me about vector length as a cause of partial register issues in RISC-V, which I should've realized but admittedly have not worked on a vector length architecture in many years.)
From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of Andrew Waterman
Providing callee-saved vector registers in the regular C calling convention might actually degrade performance, as most vector computation is done in leaf functions or in strip-mine loops that don't call functions. Functions that want to use all the vector registers will have to spill some callee-saved registers, even if the callee-saved registers aren't providing much benefit.
By contrast, the vector millicode calling convention (for routines like element-wise transcendentals) would likely benefit from an alternate calling convention that has some callee-saved vector registers.
On Mon, Jan 13, 2020 at 12:35 AM 戎杰杰 <jiejie.rjj@...> wrote:
|
|
Slidedown overlapping of dest and source regsiters
Thang Tran
The slideup instruction has this restriction:
The destination vector register group for vslideup cannot overlap the source vector register group or the mask register, otherwise an illegal instruction exception is raised. The slidedown instruction has different restriction: The destination vector register group cannot overlap the mask register if LMUL>1, otherwise an illegal instruction exception is raised. The overlapping of the source and destination registers assumes the implementation to be in a certain way which is inflexible. I think that the slidedown instruction should have the same restriction of non-overlapping of source and destination registers. Thanks, Thang |
|
Re: Slidedown overlapping of dest and source regsiters
Andrew Waterman
It's important that the slidedown instruction can overwrite its source operand. Debuggers will use this feature to populate a vector register in-place without clobbering other architectural state. On Tue, Jan 28, 2020 at 10:59 AM Thang Tran <thang@...> wrote: The slideup instruction has this restriction: |
|
Re: Slidedown overlapping of dest and source regsiters
Thang Tran
Hi Andrew, I do not understand your statement. Why is it important? Why is the difference with slideup?
The slideup cannot clobber the source operand with destination operand because the destination register writes to source register before the source operand is read.
The slidedown instruction should be the same because my implementation would writes to the source register before the source operand is read. The allowed overlapping of source & destination registers assumes a certain implementation of slidedown which is not good for other people.
Thanks, Thang
From: Andrew Waterman [mailto:andrew@...]
Sent: Tuesday, January 28, 2020 11:23 AM To: Thang Tran <thang@...> Cc: Krste Asanovic <krste@...>; tech-vector-ext@... Subject: Re: [RISC-V] [tech-vector-ext] Slidedown overlapping of dest and source regsiters
It's important that the slidedown instruction can overwrite its source operand. Debuggers will use this feature to populate a vector register in-place without clobbering other architectural state.
On Tue, Jan 28, 2020 at 10:59 AM Thang Tran <thang@...> wrote:
|
|
Re: Slidedown overlapping of dest and source regsiters
Guy Lemieux
Hi Thang,
toggle quoted message
Show quoted text
I think Andrew is suggesting that the vslideup restriction is there to allow some flexibility with implementations. However, one of (vslideup/vslidedown) needs to allow the same source/dest register (group) because the debugger is going to use this feature to inject new data without clobbering other vector registers. I believe most implementations iterating over a vector will be incrementing the element index -- this allows vslidedown to safely clobber earlier elements (higher index values are being read out while lower index values are being written, so the lower index values will have been previously read and the elements are in-transit in the pipeline). If your vector implementation is decrementing the element index, then you couldn't allow src/dst overlap with vslidedown, but you could allow it with vslideup. Hence, there is an implicit assumption here about implementations (ie, count up is preferred, or else you have to buffer the whole vector register group). I'm not sure how the debugger would be using this feature, but if I had to guess, I think the debugger would actually be using vslide1down (not vslidedown) to inject data into a vector. So, perhaps the overlapping src/dst requirement should only be for vslide1down? Also, as an alternative, there are also various vmv instructions that could be used by the debugger which move one element at a time and do allow overlapping src/dst. I don't think debugger performance is crucial. Guy On Tue, Jan 28, 2020 at 12:42 PM Thang Tran <thang@...> wrote:
|
|
Re: Slidedown overlapping of dest and source regsiters
Thang Tran
Thanks Guy for the explanation, but my implementation is both incrementing element index for slideup and decrementing element index for slidedown (which is symmetrical implementation and simplest from my point of view).
toggle quoted message
Show quoted text
I have no issue with dest/source registers overlapping for slide1down and slide1up. As you suggested can be used for debugging. Thanks, Thang -----Original Message-----
From: Guy Lemieux [mailto:glemieux@...] Sent: Tuesday, January 28, 2020 1:40 PM To: Thang Tran <thang@...> Cc: Andrew Waterman <andrew@...>; Krste Asanovic <krste@...>; tech-vector-ext@... Subject: Re: [RISC-V] [tech-vector-ext] Slidedown overlapping of dest and source regsiters Hi Thang, I think Andrew is suggesting that the vslideup restriction is there to allow some flexibility with implementations. However, one of (vslideup/vslidedown) needs to allow the same source/dest register (group) because the debugger is going to use this feature to inject new data without clobbering other vector registers. I believe most implementations iterating over a vector will be incrementing the element index -- this allows vslidedown to safely clobber earlier elements (higher index values are being read out while lower index values are being written, so the lower index values will have been previously read and the elements are in-transit in the pipeline). If your vector implementation is decrementing the element index, then you couldn't allow src/dst overlap with vslidedown, but you could allow it with vslideup. Hence, there is an implicit assumption here about implementations (ie, count up is preferred, or else you have to buffer the whole vector register group). I'm not sure how the debugger would be using this feature, but if I had to guess, I think the debugger would actually be using vslide1down (not vslidedown) to inject data into a vector. So, perhaps the overlapping src/dst requirement should only be for vslide1down? Also, as an alternative, there are also various vmv instructions that could be used by the debugger which move one element at a time and do allow overlapping src/dst. I don't think debugger performance is crucial. Guy On Tue, Jan 28, 2020 at 12:42 PM Thang Tran <thang@...> wrote:
|
|
Re: Slidedown overlapping of dest and source regsiters
Guy Lemieux
Thanks Guy for the explanation, but my implementation is both incrementing element index for slideup and decrementing element index for slidedown (which is symmetrical implementation and simplest from my point of view).I'm curious why you chose to be symmetrical (no need), and why you decided incrementing for slideup decrementing for slidedn (I would do the opposite). By incrementing for vslidedown, and decrementing for vslideup, it eliminates the race condition in both directions and allows overlapping src/dst for both. However, by supporting both incrementing and decrementing, you are adding extra hardware that isn't strictly necessary. Guy |
|
Re: Slidedown overlapping of dest and source regsiters
Andrew Waterman
On Tue, Jan 28, 2020 at 1:40 PM Guy Lemieux <glemieux@...> wrote: Hi Thang, Oops, yes, I meant vslide1down. Using vslide1down isn't about performance; it's the only way I know of for the debugger to construct a vector without additional storage. The alternative would have been to add an instruction to insert an element into an arbitrary element position, which for various reasons was deemed a less-preferable alternative.
|
|
Minutes from 2020/1/24 meeting
Date: 2020/1/24
Task Group: Vector Extension Chair: Krste Asanovic Number of Attendees: ~20 github: https://github.com/riscv/riscv-v-spec Outstanding items from prior meeting: #362/#354, #341, #348, #347, #335, #326, #318, #235 Current items: #362/#354, #341 New actionable items to be addressed in next meeting: Is group tracking to roadmap? y The main discussion in the group that occupied most of the time was around #362/#354, whether to drop the fixed-size vector load/store instructions, leaving only the SEW-sized load/store instructions. Dropping these instructions would save considerable complexity in memory pipelines. However, dropping support would also require execution of additional instructions in some common cases. A remedy would be to add more widening or quad-widening (quadening) compute instructions to reduce this impact. During the call a second issue was raised. The current design uses constant SEW/LMUL ratios to align data types of different element widths. If only SEW-sized load/stores were available, then a computation using a mixture of element widths would have to use larger LMUL for larger SEW values, which effectively reduces the number of available registers and so increases register pressure. The fixed width load/stores allow, e.g., a byte to be loaded into a vector register with four-byte width with LMUL=1 so avoids this issue. No resolution was reached, but this was to be studied further. The second point that was quickly discussed at the end of the meeting was the proposed separation of a vcsr from fcsr (#341). There was little obvious enthusiasm in favor of one choice over the other, and so the design is left as is for now. |
|
RISC-V Vector Task Group: fractional LMUL
In the last meeting, we discussed a problem that would be introduced
if we were to drop the fixed-size (b/h/w) variants of the vector load/stores and only have the SEW-size (e) variants. From the minutes: The current design uses constant SEW/LMUL ratios to align data types of different element widths. If only SEW-sized load/stores were available, then a computation using a mixture of element widths would have to use larger LMUL for larger SEW values, which effectively reduces the number of available registers and so increases register pressure. The fixed width load/stores allow, e.g., a byte to be loaded into a vector register with four-byte width with LMUL=1 so avoids this issue. Considering the case of a byte (8b) load into a word (32b) register. The effect of a byte load is to use only one quarter of the bits in a register, with widening to replicate zero/sign bits into the other bits of the register. A different strategy to use a portion of the bits in a vector register would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4, 1/8. This has the effect of supporting a given SEW/LMUL ratio with smaller LMUL values. This can be done without adding additional state to the machine, but only by adding a new variant of vsetvli that sets vl according to a shorter VLMAX calculated with the appropriate reduction in VLEN. E.g., vsetvli rd, rs1, e8,f2 # LMUL=1/2 vsetvli rd, rs1, e8,f4 # LMUL=1/4 vsetvli rd, rs1, e8,f8 # LMUL=1/8 These instructions leave LMUL=1 in vtype, and the machine executes the instructions as before, just the vl will be shorter in these instructions. The same effect could be achieved without any new ISA instructions by performing vsetvli with widest SEW to set vl, then repeat with vsetvli with rd=x0,rs1=x0 to keep this vl value. However, this would add an additional instruction in the general case (sometimes, widest operation isn't naturally the first in a loop), but in other cases vl is fixed throughout a loop and can arrange so first setvl uses widest SEW, so the additional instructions can be avoided. With or without new vsetvli implementation for fractional LMUL, there are still more dynamic instructions required in general than the fixed-size loads into SEW elements, which don't need to change SEW. We can discuss further in the next task group meeting tomorrow. Members can find login details on the members task group calendar. Krste |
|
Re: RISC-V Vector Task Group: fractional LMUL
Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its input register group(s)? --Nick Knight On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote:
|
|
Re: RISC-V Vector Task Group: fractional LMUL
I'm realizing the idea doesn't quite work unless machine actually
stores the fractional LMUL value (to cope with SLEN and widening instruction behavior), and so would need the new instruction and an extra bit of state in vtype. But with this change, yes. For floating-point code, where there is no way to load a narrower type into a wider element, this can be used to increase the number of registers of a wider width, e.g., in a matrix multiply accumulating double-precision values that are the product of single-precision values, using widening muladds, can arrange as something like: vsetvli x0, x0, e32,f2 # Fractional Lmul vle.v v0, (bmatp) # Get row of matrix flw f1, (x15) # Get scalar vfwmacc.vf v1, f1, v0 # One row of outer-product add x15, x15, acolstride # Bump pointer ... flw f31, (x15) # Get scalar vfwmacc.vf v31, f31, v0 # Last row of outer-product ... which is holding 31 rows of the destination matrix accumulators as doubles while performing widening muladds from a single-precision vector load held in v0. This is probably overkill register blocking for this particular example, but shows the general improvement. Krste | Could a fractional LMUL circumvent the constraint that a widening instruction's output register group must be larger than its inputOn Thu, 6 Feb 2020 22:05:50 -0800, "Nick Knight" <nick.knight@...> said: | register group(s)? | --Nick Knight | On Thu, Feb 6, 2020 at 9:34 PM Krste Asanovic <krste@...> wrote: | In the last meeting, we discussed a problem that would be introduced | if we were to drop the fixed-size (b/h/w) variants of the vector | load/stores and only have the SEW-size (e) variants. From the | minutes: | The current design uses constant SEW/LMUL ratios to align data | types of different element widths. If only SEW-sized load/stores | were available, then a computation using a mixture of element | widths would have to use larger LMUL for larger SEW values, which | effectively reduces the number of available registers and so | increases register pressure. The fixed width load/stores allow, | e.g., a byte to be loaded into a vector register with four-byte | width with LMUL=1 so avoids this issue. | Considering the case of a byte (8b) load into a word (32b) register. | The effect of a byte load is to use only one quarter of the bits in a | register, with widening to replicate zero/sign bits into the other | bits of the register. | A different strategy to use a portion of the bits in a vector register | would be to add the concept of a fractional LMUL, i.e., LMUL=1/2, 1/4, | 1/8. This has the effect of supporting a given SEW/LMUL ratio with | smaller LMUL values. This can be done without adding additional state | to the machine, but only by adding a new variant of vsetvli that sets | vl according to a shorter VLMAX calculated with the appropriate | reduction in VLEN. | E.g., vsetvli rd, rs1, e8,f2 # LMUL=1/2 | vsetvli rd, rs1, e8,f4 # LMUL=1/4 | vsetvli rd, rs1, e8,f8 # LMUL=1/8 | These instructions leave LMUL=1 in vtype, and the machine executes the | instructions as before, just the vl will be shorter in these | instructions. | The same effect could be achieved without any new ISA instructions by | performing vsetvli with widest SEW to set vl, then repeat with vsetvli | with rd=x0,rs1=x0 to keep this vl value. However, this would add an | additional instruction in the general case (sometimes, widest | operation isn't naturally the first in a loop), but in other cases vl | is fixed throughout a loop and can arrange so first setvl uses widest | SEW, so the additional instructions can be avoided. | With or without new vsetvli implementation for fractional LMUL, there | are still more dynamic instructions required in general than the | fixed-size loads into SEW elements, which don't need to change SEW. | We can discuss further in the next task group meeting tomorrow. | Members can find login details on the members task group calendar. | Krste | |
|