Date
1  9 of 9
Zvediv extension discussions
Ken Dockser
I am hearing renewed interest in adding a dotproduct extension to the RISCV Vectors. This includes everything from adding a handful of FP and Int dotproduct instructions all the way to completing the Zvediv extension. What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings? Thanks, Ken


Peter Lieber
In the Graphics/ML SIG, we also discussed matrix operations as well. We talked about a single matrix opcode with various functions like vectormatrix, matrixmatrix, and dot product functions, among others. Just some initial ideas... The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort. Regards, Peter Lieber
On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:


Earl Killian
I hope that these discussions will begin with algorithms and applications that need the additional performance, and proceed to analyze how proposed instructions address the greater need.
toggle quoted messageShow quoted text


Guy Lemieux
The AMX extension for AVX512 is an interesting approach....
On Thu, Feb 3, 2022 at 8:02 PM Earl Killian <earl.killian@...> wrote:


Ken Dockser
Thanks folks, these are all very good points. Earl: I absolutely agree that these extensions (like all RISCV extensions) need to be developed based on realworld needs and need to be able to show their value (including utility and usability), as well as value/cost ratio. Guy: I agree that we need to look at the other leading architectures as we are well behind them in this area. We then need to come up with our own solutions that address the market needs and fit within the RISCV approach and philosophy. Peter: Yes, we need to work to create extensions that take into account our future needs and intentions. In this case, where we are talking about adding instructions that improve matrix performance in the vector registers, we need to keep in mind how this might fit with future extensions that operate on matrices. We still need to figure out how we can effectively and efficiently take this next step in RISCV Vector. It seems like the best approach would be to leverage the existing Vector TG by producing an updated charter that is focused on completing the Zvediv extension. Is this permitted/possible? Are the current Chair and Vice Chair amenable to this? Thanks, Ken
On Thu, Feb 3, 2022 at 10:11 PM Guy Lemieux <guy.lemieux@...> wrote:


Guy Lemieux
Great points Ken & Earl. One thing I'll point out is that this does not necessarily have much to do with EDIV specifically. For example, the main goal of EDIV is to support smaller element dot products, particularly for integers. This helps with ML inference, sure. But it leaves a lot of performance on the table, may not help much with other operations than dot products, and probably won't help much with other applications (including ML training). There are two angles that would supersede this in capability and performance:  adding vector shapes  adding accelerators to the vector unit  in particular, the combination of both shapes + accelerators In my work (embodied primarily by the VectorBlox MXP architecture) , these were its two primary advantages. a) allow streaming accelerators to replace the regular ALUs. in particular, you can add a systolic array, where you have N^2 PEs and only require O(N) inputs on each edge and produce O(N) outputs. such accelerators can be standardized, but first the ISA interface needs to be standardized (see below) and possibly the physical interface (eg, AXIstream). b) allow vector operands to have shapes, in particular to allow tensor shapes. this affects data readout ordering and writeback ordering when connecting to the accelerator. this would allow, for example, a traditional 1D vector to be interpreted as a 2D tile, or even a 2D subtile. this affects the address generators to the register file, and may require data swizzling to rearrange into the proper shapes and to avoid data bubbles. In addition to the above, we loosely organized our register file as a multibanked scratchpad, rather than having fixedsize disjoint registers. This allowed a named vector register to be replaced by a scalar address (a pointer) which fits in the scalar register file. This allowed vectors of arbitrary length, and to start at arbitrary locations, producing much more flexible shapes and subshapes to be read out. This property is probably too much for people to accept right away, but it is needed when you want to have maximum flexibility for both (a) and (b) above. Note that none of this has anything specifically to do with EDIV. However, it could build upon the vtypes system that Krste has devised. (He previously tried to suggest matrix shapes as possible vtype. In his suggestion, each vector register had its own type descriptor; in the MXP architecture, the type descriptor is global like the vector length register but there are 3 copies of them so src1, src2 and dst can all be different shapes.) Guy
On Fri, Feb 4, 2022 at 10:27 AM Ken Dockser <kad@...> wrote:


Krste Asanovic
The current vector TG is only tasked with one more deliverable, which
is the IEEE FP16 halfprecision extensions (Zvfh, Zvfhmin). This is effectively already defined in ratified specification, and so is just a question of ratifying these extension names and the detailed contents of each extension. There is also Zvamo, which was dropped due to conflict with scalar subword AMO encoding space, but which will need a nontrivial redesign and hence should be a new TG. It is probably not a high priority extension compared to others. Zvediv was only sketched out during the vector development process, and is not clear to me (as the original creator) that this is the right way to go for general matrix operations. Another thing to be aware of is that from TSC, vector crypto has the highest priority among any new vector extensions. Vector BF16 support is another TSC priority. There are a lot of ideas in this thread, but there is a lot of work to define something that will also work well on OoO cores. The SIG is probably a good place to work through some ideas and arrive at more concrete proposals that can lead to a TG. Krste  Great points Ken & Earl.On Fri, 4 Feb 2022 11:00:23 0800, "Guy Lemieux" <guy.lemieux@...> said:  One thing I'll point out is that this does not necessarily have much to do with EDIV specifically.  For example, the main goal of EDIV is to support smaller element dot products, particularly for integers. This helps with ML inference, sure. But it leaves a lot of performance on the  table, may not help much with other operations than dot products, and probably won't help much with other applications (including ML training).  There are two angles that would supersede this in capability and performance:   adding vector shapes   adding accelerators to the vector unit   in particular, the combination of both shapes + accelerators  In my work (embodied primarily by the VectorBlox MXP architecture) , these were its two primary advantages.  a) allow streaming accelerators to replace the regular ALUs. in particular, you can add a systolic array, where you have N^2 PEs and only require O(N) inputs on each edge and produce O  (N) outputs. such accelerators can be standardized, but first the ISA interface needs to be standardized (see below) and possibly the physical interface (eg, AXIstream).  b) allow vector operands to have shapes, in particular to allow tensor shapes. this affects data readout ordering and writeback ordering when connecting to the accelerator. this would  allow, for example, a traditional 1D vector to be interpreted as a 2D tile, or even a 2D subtile. this affects the address generators to the register file, and may require data swizzling  to rearrange into the proper shapes and to avoid data bubbles.  In addition to the above, we loosely organized our register file as a multibanked scratchpad, rather than having fixedsize disjoint registers. This allowed a named vector register to  be replaced by a scalar address (a pointer) which fits in the scalar register file. This allowed vectors of arbitrary length, and to start at arbitrary locations, producing much more  flexible shapes and subshapes to be read out. This property is probably too much for people to accept right away, but it is needed when you want to have maximum flexibility for both (a)  and (b) above.  Note that none of this has anything specifically to do with EDIV. However, it could build upon the vtypes system that Krste has devised. (He previously tried to suggest matrix shapes as  possible vtype. In his suggestion, each vector register had its own type descriptor; in the MXP architecture, the type descriptor is global like the vector length register but there are  3 copies of them so src1, src2 and dst can all be different shapes.)  Guy  On Fri, Feb 4, 2022 at 10:27 AM Ken Dockser <kad@...> wrote:  Thanks folks, these are all very good points.  Earl: I absolutely agree that these extensions (like all RISCV extensions) need to be developed based on realworld needs and need to be able to show their value (including utility  and usability), as well as value/cost ratio.  Guy: I agree that we need to look at the other leading architectures as we are well behind them in this area. We then need to come up with our own solutions that address the market  needs and fit within the RISCV approach and philosophy.  Peter: Yes, we need to work to create extensions that take into account our future needs and intentions. In this case, where we are talking about adding instructions that improve  matrix performance in the vector registers, we need to keep in mind how this might fit with future extensions that operate on matrices.  We still need to figure out how we can effectively and efficiently take this next step in RISCV Vector. It seems like the best approach would be to leverage the existing Vector TG  by producing an updated charter that is focused on completing the Zvediv extension. Is this permitted/possible? Are the current Chair and Vice Chair amenable to this?  Thanks,  Ken  On Thu, Feb 3, 2022 at 10:11 PM Guy Lemieux <guy.lemieux@...> wrote:  The AMX extension for AVX512 is an interesting approach....  https://en.wikichip.org/wiki/intel/dl_boost  On Thu, Feb 3, 2022 at 8:02 PM Earl Killian <earl.killian@...> wrote:  I hope that these discussions will begin with algorithms and applications that need the additional performance, and proceed to analyze how proposed instructions address the  greater need.  On Feb 3, 2022, at 14:45, Peter Lieber <peteralieber@...> wrote:  In the Graphics/ML SIG, we also discussed matrix operations as well. We talked about a single matrix opcode with various functions like vectormatrix, matrixmatrix, and  dot product functions, among others. Just some initial ideas...  The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort.  Regards,  Peter Lieber  On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:  I am hearing renewed interest in adding a dotproduct extension to the RISCV Vectors. This includes everything from adding a handful of FP and Int dotproduct  instructions all the way to completing the Zvediv extension.  What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings?  Thanks,  Ken 


Victor Moya
I have serious doubts Zvediv helps with graphics. My previous experience is that trying to force SIMD4 over vectors doesn't really help nor does optimize hardware usage. Every major vendor moved away from it. My experience is that you get the same or better performance by just unrolling the color (or coordinate for geometry) component loop on flat vectors. So I would need to look at the actual use cases and if these use cases are really relevant for modern 3D graphic workloads. My experience with ML is somewhat more reduced but I don't see any major issue with the current vector definition (other than the still missing float16 and bfloat16 from the spec). We are also interested in dot products or similar operations that increase compute density for matrix multiplication but the hardware cost versus performance improvement needs to be managed carefully. If you add 2D shape to a vector there are hardware costs related with register file access and/or the crossbar between computing lanes. So I would like to see the use cases and how they help. I don't have a closed opinion on this topic. Talking from experience, from a hardware perspective the current V extension is already somewhat challenging when you add OoO so better don't add extra complexity that doesn't add real performance :). I expect the SIG will work on these proposals and I will be around to collaborate. Victor
On Wed, Feb 9, 2022 at 7:25 AM Krste Asanovic <krste@...> wrote:


Abel Bernabeu
Sorry, I have followed the thread but COVID kept me busy here. Zvediv is not strictly needed for matrix multiply. I have to correct Peter Lieber who interpreted that from what we talked at the Graphics and ML SIG and said so in this thread. What I tried to communicate on the Graphics and ML SIG was that: If Zvediv was reintroduced then I would suggest a different behaviour for matrix multiply depending on whether a source operand (left hand side or right hand side matrix) is the same for all the work items or not. About Zvediv itself (original subject), the value I see is that one can perform several workitems in parallel and change the number of processed elements dynamically. If there is a reduce operation per workitem, without Zvediv we need one reduce instruction per workitem. If the number of workitems changes, the program has to be recompiled accordingly. So the number of workitems becomes a nonorthogonal state of the program and needs to be recompiled every time this parameter changes. In some graphics designs I worked with in the past (like Intel's Gen) that implies having different shaders for different patch shapes (8x8, 8x4, 8x2, 8x1 and all the transpositions). Ideally I would like the same shader code to be valid for every possible patch shape: write the shader binary once and use it with any present and future patch size. Being patchsize agnostic is as beneficial for graphics as being vectorlength agnostic is for other domains. And that is the motivation behind reintroducing Zvediv. Regards. PD: OoO is not a high priority for graphics or ML. I would be perfectly happy if OoO is not possible for any instruction on a shading core for graphics.
On Wed, Feb 9, 2022 at 8:50 AM Victor Moya <victor.moya@...> wrote:

