Date   

Re: Vector element groups

Krste Asanovic
 

On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said:
| On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and
| not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would
| like to elucidate the issues?

We already have vector loads and stores with static EEW in the
instruction, which ignore dynamic SEW. Future 64-bit encodings would
also have static EEWs in instruction. If static encoding space was available,
we would not have had dynamic SEW at all.

The current EG proposal does require vl to be set.

Krste


Re: Vector element groups

Nicolas Brunie
 

Hi Yann,
   I think Ken is referencing the optimization of splitting the sha256's state in two and merging rounds. It is for example described here : https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html

Regards,
Nicolas 

Le mar. 19 juil. 2022 à 01:47, Yann Loisel <yann.loisel@...> a écrit :
Hi Ken
Not sure to follow you on your 128-bit inputs and outputs for the SHA256.
The spec speaks about 8 32-bit working variables, a, b, c, ..., g,h, used as the current state, so 256 bits and then an 8-group of 32 bit- values.
Could you please elaborate here ?
Having an algorithmic representation could be helpful for the overall discussion too.
Thanks
yann

On Fri, Jul 15, 2022 at 8:07 PM Ken Dockser <kad@...> wrote:
Thanks for putting this concept proposal together, Krste.

I have several initial comments and questions:
  1. I am all for the concept of element groups. As you point out, they are especially useful in cryptography where we need to operate on data sizes that are greater than the (current) largest SEW=64.
  2. SHA256 is used as an example in the document, but with an incorrect number of elements in a group. To be competitive, SHA256 instructions require 128-bit (not 256-bit) inputs and outputs. More specifically, the outputs need to be a group of 4 (not 8) 32-bit values and each of the three inputs need to be 4 32-bit values.
  3. If LMUL is allowed to be used for the concatenation of registers in narrow implementations (i.e., VLEN < EGW) to form the groups, what would be the meaning of LMUL>1 for wider implementations (i.e., when VLEN >= EGW)?
  4. How would vl be interpreted when VLMUL is used for the concatenation of registers to form groups?
  5. Would instructions that are based on element groups be required to support the use of LMUL to form those groups on narrower implementations? Or, could the instruction be defined to require a minimum VLEN
Thanks,
Ken



--

Yann Loisel
Principal Security Architect


Re: Vector element groups

Yann Loisel
 

Hi Ken
Not sure to follow you on your 128-bit inputs and outputs for the SHA256.
The spec speaks about 8 32-bit working variables, a, b, c, ..., g,h, used as the current state, so 256 bits and then an 8-group of 32 bit- values.
Could you please elaborate here ?
Having an algorithmic representation could be helpful for the overall discussion too.
Thanks
yann

On Fri, Jul 15, 2022 at 8:07 PM Ken Dockser <kad@...> wrote:
Thanks for putting this concept proposal together, Krste.

I have several initial comments and questions:
  1. I am all for the concept of element groups. As you point out, they are especially useful in cryptography where we need to operate on data sizes that are greater than the (current) largest SEW=64.
  2. SHA256 is used as an example in the document, but with an incorrect number of elements in a group. To be competitive, SHA256 instructions require 128-bit (not 256-bit) inputs and outputs. More specifically, the outputs need to be a group of 4 (not 8) 32-bit values and each of the three inputs need to be 4 32-bit values.
  3. If LMUL is allowed to be used for the concatenation of registers in narrow implementations (i.e., VLEN < EGW) to form the groups, what would be the meaning of LMUL>1 for wider implementations (i.e., when VLEN >= EGW)?
  4. How would vl be interpreted when VLMUL is used for the concatenation of registers to form groups?
  5. Would instructions that are based on element groups be required to support the use of LMUL to form those groups on narrower implementations? Or, could the instruction be defined to require a minimum VLEN
Thanks,
Ken



--

Yann Loisel
Principal Security Architect


Re: Vector element groups

Allen Baum
 

There is a plan to insert the formal Sail code for an operator in the spec. 
That is an ongoing project that has been started (there's a prototype), but no resource. 

Vector has many state variables that cn affect the result, 
 (i.e.  ~80? different legal configurations or elen, vleng, sew, lmul, etc), 
and parameterizing the functions to take all the possibilities into account 
without replicating the functions is proving to be a bit challenging.
But once that Sail code is written, having it appear in the spec text doesn't look too difficult
(for pure operators. Loads and stores are a different matter....)

I'm not sure that will be as readable and understandable as a picture, though.

On Fri, Jul 15, 2022 at 10:09 AM Zalman Stern <zalman@...> wrote:
In as much as this is a concept, it may be worth commenting on how this should interact with 64-bit instructions that encode SEW in the instruction. (Which, if I am not mistaken, is a possible future direction.) Specifically, I would expect the types of instructions mentioned as having, or requiring, fixed SEW to not get longer forms. It may be worth introducing standard terminology for an instruction that is specified as not using SEW.

Adding both an array notation and a graphical representation to define instruction behavior would help a great deal. It is a general problem throughout documenting ML operators that understanding prose describing array operations is imprecise and invariably relies on the reader already having a pretty good idea of what is going on in order to fill in the gaps. This concept would be stronger if it were a coding style guide for a more formal language in addition to describing vocabulary. Clarity and precision of specification becomes more important as lane grouping and cross-lane interaction increases in the instruction operation.

-Z-


On Fri, Jul 15, 2022 at 9:11 AM Earl Killian <earl.killian@...> wrote:
While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking would be when individual specs implement the concept. I would think it would require some pretty good justification to not have an exception.

On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would like to elucidate the issues?

On Jul 15, 2022, at 08:30, Allen Baum <allen.baum@...> wrote:

A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste








Re: Vector element groups

Ken Dockser
 

Thanks for putting this concept proposal together, Krste.

I have several initial comments and questions:
  1. I am all for the concept of element groups. As you point out, they are especially useful in cryptography where we need to operate on data sizes that are greater than the (current) largest SEW=64.
  2. SHA256 is used as an example in the document, but with an incorrect number of elements in a group. To be competitive, SHA256 instructions require 128-bit (not 256-bit) inputs and outputs. More specifically, the outputs need to be a group of 4 (not 8) 32-bit values and each of the three inputs need to be 4 32-bit values.
  3. If LMUL is allowed to be used for the concatenation of registers in narrow implementations (i.e., VLEN < EGW) to form the groups, what would be the meaning of LMUL>1 for wider implementations (i.e., when VLEN >= EGW)?
  4. How would vl be interpreted when VLMUL is used for the concatenation of registers to form groups?
  5. Would instructions that are based on element groups be required to support the use of LMUL to form those groups on narrower implementations? Or, could the instruction be defined to require a minimum VLEN
Thanks,
Ken


Re: Vector element groups

Zalman Stern
 

In as much as this is a concept, it may be worth commenting on how this should interact with 64-bit instructions that encode SEW in the instruction. (Which, if I am not mistaken, is a possible future direction.) Specifically, I would expect the types of instructions mentioned as having, or requiring, fixed SEW to not get longer forms. It may be worth introducing standard terminology for an instruction that is specified as not using SEW.

Adding both an array notation and a graphical representation to define instruction behavior would help a great deal. It is a general problem throughout documenting ML operators that understanding prose describing array operations is imprecise and invariably relies on the reader already having a pretty good idea of what is going on in order to fill in the gaps. This concept would be stronger if it were a coding style guide for a more formal language in addition to describing vocabulary. Clarity and precision of specification becomes more important as lane grouping and cross-lane interaction increases in the instruction operation.

-Z-


On Fri, Jul 15, 2022 at 9:11 AM Earl Killian <earl.killian@...> wrote:
While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking would be when individual specs implement the concept. I would think it would require some pretty good justification to not have an exception.

On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would like to elucidate the issues?

On Jul 15, 2022, at 08:30, Allen Baum <allen.baum@...> wrote:

A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste








Re: Vector element groups

Earl Killian
 

While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking would be when individual specs implement the concept. I would think it would require some pretty good justification to not have an exception.

On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would like to elucidate the issues?

On Jul 15, 2022, at 08:30, Allen Baum <allen.baum@...> wrote:

A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste








Re: Vector element groups

Jon Tate
 

It's interesting that there seems to be some overlap between this work and the work I've been doing for a matrix multiplication subextension proposal for machine learning workloads. Perhaps we should approach this element group concept from a more general direction?

On Fri, Jul 15, 2022 at 10:30 AM Allen Baum <allen.baum@...> wrote:
A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste






--
Jon Tate
Software Engineer, Project Shodan, Google


Re: Vector element groups

Allen Baum
 

A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste







Vector element groups

Krste Asanovic
 

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions. I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste


Re: I have some questions about the VMADC/VMSBC instructions, thank you for your valuable comments.

Andrew Waterman
 



On Thu, Jun 16, 2022 at 2:43 AM <lilei2@...> wrote:
1. Question for tail bits of mask-producing instructions.
In the case of mask-producing instructions, tail elements are the bits with (vl <= bit index < VLEN).So according to riscv-v-spec-1.0, page 46, VMADC/VMSBC instructions operate with tail-agnostic policy, which means the tail bits can be written with 1 or unchanged. 
While according to page 13, we have a more relax constraint: except for mask load instructions, any element in the tail of a mask result can also be written with the value the mask-producing operation would have calculated with vl=VLMAX. Which means we can overwritten all remaining bits that past vl with 1s or with the value the mask-producing operation would have calculated.
For example, if VLEN=128, LMUL=1, SEW=32, there are only 4 body bits for VMADC instruction. If current vl=2, we can write the calculation results to bit[1:0], and write all other 126 bits with 1s.

It is legal to fill bits vl..VLEN-1 with 1s because of the clause that these instructions are always tail-agnostic.

Or we can write 4 bits calculation results to bit[3:0] in which only bit[1:0] are body bits, and write all other 124 bits with 1s.

It is also legal to compute bits 0..VLMAX-1 (as a function of elements 0..VLMAX-1), then fill bits VLMAX..VLEN-1 with 1s, because of the clause that mask-producing instructions are permitted to write the result they would have written if vl had been set to VLMAX.

But I'm not sure that the remark "in which only bit[1:0] are body bits" matters.  In this style of implementation, the behavior is the same as if the body contained elements 0..VLMAX-1.

Whether either of these implementations is legal?
 
2. Question for inactive body bits of mask-producing instructions.
When vtype.vma=0, which means mask-undisturbed policy, the inactive body bits should retain its value. 
For example, when vma=0, the VMSBF_M and VMSEQ with vm=0, should not change the inactive body bits.
Is my understand correct?thanks.

Right.


I have some questions about the VMADC/VMSBC instructions, thank you for your valuable comments.

lilei2@...
 

1. Question for tail bits of mask-producing instructions.
In the case of mask-producing instructions, tail elements are the bits with (vl <= bit index < VLEN).So according to riscv-v-spec-1.0, page 46, VMADC/VMSBC instructions operate with tail-agnostic policy, which means the tail bits can be written with 1 or unchanged. 
While according to page 13, we have a more relax constraint: except for mask load instructions, any element in the tail of a mask result can also be written with the value the mask-producing operation would have calculated with vl=VLMAX. Which means we can overwritten all remaining bits that past vl with 1s or with the value the mask-producing operation would have calculated.
For example, if VLEN=128, LMUL=1, SEW=32, there are only 4 body bits for VMADC instruction. If current vl=2, we can write the calculation results to bit[1:0], and write all other 126 bits with 1s. Or we can write 4 bits calculation results to bit[3:0] in which only bit[1:0] are body bits, and write all other 124 bits with 1s.
Whether either of these implementations is legal?
 
2. Question for inactive body bits of mask-producing instructions.
When vtype.vma=0, which means mask-undisturbed policy, the inactive body bits should retain its value. 
For example, when vma=0, the VMSBF_M and VMSEQ with vm=0, should not change the inactive body bits.
Is my understand correct?thanks.


Re: Zvediv extension discussions

Abel Bernabeu
 

Sorry, I have followed the thread but COVID kept me busy here.

Zvediv is not strictly needed for matrix multiply. I have to correct Peter Lieber who interpreted that from what we talked at the Graphics and ML SIG and said so in this thread.

What I tried to communicate on the Graphics and ML SIG was that: If Zvediv was reintroduced then I would suggest a different behaviour for matrix multiply depending on whether a source operand (left hand side or right hand side matrix) is the same for all the work items or not.

About Zvediv itself (original subject), the value I see is that one can perform several work-items in parallel and change the number of processed elements dynamically.

If there is a reduce operation per work-item, without Zvediv we need one reduce instruction per work-item. If the number of work-items changes, the program has to be recompiled accordingly. So the number of work-items becomes a non-orthogonal state of the program and needs to be recompiled every time this parameter changes.

In some graphics designs I worked with in the past (like Intel's Gen) that implies having different shaders for different patch shapes (8x8, 8x4, 8x2, 8x1 and all the transpositions). Ideally I would like the same shader code to be valid for every possible patch shape: write the shader binary once and use it with any present and future patch size.

Being patch-size agnostic is as beneficial for graphics as being vector-length agnostic is for other domains. And that is the motivation behind reintroducing Zvediv.

Regards.

PD: OoO is not a high priority for graphics or ML. I would be perfectly happy if OoO is not possible for any instruction on a shading core for graphics.


On Wed, Feb 9, 2022 at 8:50 AM Victor Moya <victor.moya@...> wrote:

I have serious doubts Zvediv helps with graphics. My previous experience is that trying to force SIMD4 over vectors doesn't really help nor does optimize hardware usage. Every major vendor moved away from it. My experience is that you get the same or better performance by just unrolling the color (or coordinate for geometry) component loop on flat vectors. So I would need to look at the actual use cases and if these use cases are really relevant for modern 3D graphic workloads.

My experience with ML is somewhat more reduced but I don't see any major issue with the current vector definition (other than the still missing float16 and bfloat16 from the spec). We are also interested in dot products or similar operations that increase compute density for matrix multiplication but the hardware cost versus performance improvement needs to be managed carefully.

If you add 2D shape to a vector there are hardware costs related with register file access and/or the crossbar between computing lanes. So I would like to see the use cases and how they help. I don't have a closed opinion on this topic.

Talking from experience, from a hardware perspective the current V extension is already somewhat challenging when you add OoO so better don't add extra complexity that doesn't add real performance :).

I expect the SIG will work on these proposals and I will be around to collaborate.

Victor


On Wed, Feb 9, 2022 at 7:25 AM Krste Asanovic <krste@...> wrote:

The current vector TG is only tasked with one more deliverable, which
is the IEEE FP16 half-precision extensions (Zvfh, Zvfhmin).  This is
effectively already defined in ratified specification, and so is just
a question of ratifying these extension names and the detailed
contents of each extension.

There is also Zvamo, which was dropped due to conflict with scalar
subword AMO encoding space, but which will need a non-trivial redesign
and hence should be a new TG.  It is probably not a high priority
extension compared to others.

Zvediv was only sketched out during the vector development process,
and is not clear to me (as the original creator) that this is the
right way to go for general matrix operations.

Another thing to be aware of is that from TSC, vector crypto has the
highest priority among any new vector extensions.  Vector BF16 support
is another TSC priority.

There are a lot of ideas in this thread, but there is a lot of work to
define something that will also work well on OoO cores.  The SIG is
probably a good place to work through some ideas and arrive at more
concrete proposals that can lead to a TG.

Krste


>>>>> On Fri, 4 Feb 2022 11:00:23 -0800, "Guy Lemieux" <guy.lemieux@...> said:

| Great points Ken & Earl.
| One thing I'll point out is that this does not necessarily have much to do with EDIV specifically.

| For example, the main goal of EDIV is to support smaller element dot products, particularly for integers. This helps with ML inference, sure. But it leaves a lot of performance on the
| table, may not help much with other operations than dot products, and probably won't help much with other applications (including ML training).

| There are two angles that would supersede this in capability and performance:
| -- adding vector shapes
| -- adding accelerators to the vector unit
| -- in particular, the combination of both shapes + accelerators

| In my work (embodied primarily by the VectorBlox MXP architecture) , these were its two primary advantages.

| a) allow streaming accelerators to replace the regular ALUs. in particular, you can add a systolic array, where you have N^2 PEs and only require O(N) inputs on each edge and produce O
| (N) outputs. such accelerators can be standardized, but first the ISA interface needs to be standardized (see below) and possibly the physical interface (eg, AXI-stream).

| b) allow vector operands to have shapes, in particular to allow tensor shapes. this affects data readout ordering and writeback ordering when connecting to the accelerator. this would
| allow, for example, a traditional 1D vector to be interpreted as a 2D tile, or even a 2D subtile. this affects the address generators to the register file, and may require data swizzling
| to rearrange into the proper shapes and to avoid data bubbles.

| In addition to the above, we loosely organized our register file as a multi-banked scratchpad, rather than having fixed-size disjoint registers. This allowed a named vector register to
| be replaced by a scalar address (a pointer) which fits in the scalar register file. This allowed vectors of arbitrary length, and to start at arbitrary locations, producing much more
| flexible shapes and subshapes to be read out. This property is probably too much for people to accept right away, but it is needed when you want to have maximum flexibility for both (a)
| and (b) above.

| Note that none of this has anything specifically to do with EDIV. However, it could build upon the vtypes system that Krste has devised. (He previously tried to suggest matrix shapes as
| possible vtype. In his suggestion, each vector register had its own type descriptor; in the MXP architecture, the type descriptor is global like the vector length register but there are
| 3 copies of them so src1, src2 and dst can all be different shapes.)

| Guy

| On Fri, Feb 4, 2022 at 10:27 AM Ken Dockser <kad@...> wrote:

|     Thanks folks, these are all very good points.

|     Earl: I absolutely agree that these extensions (like all RISC-V extensions) need to be developed based on real-world needs and need to be able to show their value (including utility
|     and usability),  as well as value/cost ratio.

|     Guy: I agree that we need to look at the other leading architectures as we are well behind them in this area. We then need to come up with our own solutions that address the market
|     needs and fit within the RISC-V approach and philosophy. 

|     Peter: Yes, we need to work to create extensions that take into account our future needs and intentions. In this case, where we are talking about adding instructions that improve
|     matrix performance in the vector registers, we need to keep in mind how this might fit with future extensions that operate on matrices.

|     We still need to figure out how we can effectively and efficiently take this next step in RISC-V Vector.  It seems like the best approach would be to leverage the existing Vector TG
|     by producing an updated charter that is focused on completing the Zvediv extension. Is this permitted/possible? Are the current Chair and Vice Chair amenable to this?

|     Thanks,
|     Ken

|     On Thu, Feb 3, 2022 at 10:11 PM Guy Lemieux <guy.lemieux@...> wrote:

|         The AMX extension for AVX512 is an interesting approach.... 

|         https://en.wikichip.org/wiki/intel/dl_boost

|         On Thu, Feb 3, 2022 at 8:02 PM Earl Killian <earl.killian@...> wrote:

|             I hope that these discussions will begin with algorithms and applications that need the additional performance, and proceed to analyze how proposed instructions address the
|             greater need.

|                 On Feb 3, 2022, at 14:45, Peter Lieber <peteralieber@...> wrote:

|                 In the Graphics/ML SIG, we also discussed matrix operations as well.  We talked about a single matrix opcode with various functions like vector-matrix, matrix-matrix, and
|                 dot product functions, among others.  Just some initial ideas... 

|                 The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort.

|                 Regards,
|                 Peter Lieber

|                 On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:

|                     I am hearing renewed interest in adding a dot-product extension to the RISC-V Vectors. This includes everything from adding a handful of FP and Int dot-product
|                     instructions all the way to completing the Zvediv extension.

|                     What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings?

|                     Thanks,
|                     Ken

|






Re: Zvediv extension discussions

Victor Moya
 


I have serious doubts Zvediv helps with graphics. My previous experience is that trying to force SIMD4 over vectors doesn't really help nor does optimize hardware usage. Every major vendor moved away from it. My experience is that you get the same or better performance by just unrolling the color (or coordinate for geometry) component loop on flat vectors. So I would need to look at the actual use cases and if these use cases are really relevant for modern 3D graphic workloads.

My experience with ML is somewhat more reduced but I don't see any major issue with the current vector definition (other than the still missing float16 and bfloat16 from the spec). We are also interested in dot products or similar operations that increase compute density for matrix multiplication but the hardware cost versus performance improvement needs to be managed carefully.

If you add 2D shape to a vector there are hardware costs related with register file access and/or the crossbar between computing lanes. So I would like to see the use cases and how they help. I don't have a closed opinion on this topic.

Talking from experience, from a hardware perspective the current V extension is already somewhat challenging when you add OoO so better don't add extra complexity that doesn't add real performance :).

I expect the SIG will work on these proposals and I will be around to collaborate.

Victor


On Wed, Feb 9, 2022 at 7:25 AM Krste Asanovic <krste@...> wrote:

The current vector TG is only tasked with one more deliverable, which
is the IEEE FP16 half-precision extensions (Zvfh, Zvfhmin).  This is
effectively already defined in ratified specification, and so is just
a question of ratifying these extension names and the detailed
contents of each extension.

There is also Zvamo, which was dropped due to conflict with scalar
subword AMO encoding space, but which will need a non-trivial redesign
and hence should be a new TG.  It is probably not a high priority
extension compared to others.

Zvediv was only sketched out during the vector development process,
and is not clear to me (as the original creator) that this is the
right way to go for general matrix operations.

Another thing to be aware of is that from TSC, vector crypto has the
highest priority among any new vector extensions.  Vector BF16 support
is another TSC priority.

There are a lot of ideas in this thread, but there is a lot of work to
define something that will also work well on OoO cores.  The SIG is
probably a good place to work through some ideas and arrive at more
concrete proposals that can lead to a TG.

Krste


>>>>> On Fri, 4 Feb 2022 11:00:23 -0800, "Guy Lemieux" <guy.lemieux@...> said:

| Great points Ken & Earl.
| One thing I'll point out is that this does not necessarily have much to do with EDIV specifically.

| For example, the main goal of EDIV is to support smaller element dot products, particularly for integers. This helps with ML inference, sure. But it leaves a lot of performance on the
| table, may not help much with other operations than dot products, and probably won't help much with other applications (including ML training).

| There are two angles that would supersede this in capability and performance:
| -- adding vector shapes
| -- adding accelerators to the vector unit
| -- in particular, the combination of both shapes + accelerators

| In my work (embodied primarily by the VectorBlox MXP architecture) , these were its two primary advantages.

| a) allow streaming accelerators to replace the regular ALUs. in particular, you can add a systolic array, where you have N^2 PEs and only require O(N) inputs on each edge and produce O
| (N) outputs. such accelerators can be standardized, but first the ISA interface needs to be standardized (see below) and possibly the physical interface (eg, AXI-stream).

| b) allow vector operands to have shapes, in particular to allow tensor shapes. this affects data readout ordering and writeback ordering when connecting to the accelerator. this would
| allow, for example, a traditional 1D vector to be interpreted as a 2D tile, or even a 2D subtile. this affects the address generators to the register file, and may require data swizzling
| to rearrange into the proper shapes and to avoid data bubbles.

| In addition to the above, we loosely organized our register file as a multi-banked scratchpad, rather than having fixed-size disjoint registers. This allowed a named vector register to
| be replaced by a scalar address (a pointer) which fits in the scalar register file. This allowed vectors of arbitrary length, and to start at arbitrary locations, producing much more
| flexible shapes and subshapes to be read out. This property is probably too much for people to accept right away, but it is needed when you want to have maximum flexibility for both (a)
| and (b) above.

| Note that none of this has anything specifically to do with EDIV. However, it could build upon the vtypes system that Krste has devised. (He previously tried to suggest matrix shapes as
| possible vtype. In his suggestion, each vector register had its own type descriptor; in the MXP architecture, the type descriptor is global like the vector length register but there are
| 3 copies of them so src1, src2 and dst can all be different shapes.)

| Guy

| On Fri, Feb 4, 2022 at 10:27 AM Ken Dockser <kad@...> wrote:

|     Thanks folks, these are all very good points.

|     Earl: I absolutely agree that these extensions (like all RISC-V extensions) need to be developed based on real-world needs and need to be able to show their value (including utility
|     and usability),  as well as value/cost ratio.

|     Guy: I agree that we need to look at the other leading architectures as we are well behind them in this area. We then need to come up with our own solutions that address the market
|     needs and fit within the RISC-V approach and philosophy. 

|     Peter: Yes, we need to work to create extensions that take into account our future needs and intentions. In this case, where we are talking about adding instructions that improve
|     matrix performance in the vector registers, we need to keep in mind how this might fit with future extensions that operate on matrices.

|     We still need to figure out how we can effectively and efficiently take this next step in RISC-V Vector.  It seems like the best approach would be to leverage the existing Vector TG
|     by producing an updated charter that is focused on completing the Zvediv extension. Is this permitted/possible? Are the current Chair and Vice Chair amenable to this?

|     Thanks,
|     Ken

|     On Thu, Feb 3, 2022 at 10:11 PM Guy Lemieux <guy.lemieux@...> wrote:

|         The AMX extension for AVX512 is an interesting approach.... 

|         https://en.wikichip.org/wiki/intel/dl_boost

|         On Thu, Feb 3, 2022 at 8:02 PM Earl Killian <earl.killian@...> wrote:

|             I hope that these discussions will begin with algorithms and applications that need the additional performance, and proceed to analyze how proposed instructions address the
|             greater need.

|                 On Feb 3, 2022, at 14:45, Peter Lieber <peteralieber@...> wrote:

|                 In the Graphics/ML SIG, we also discussed matrix operations as well.  We talked about a single matrix opcode with various functions like vector-matrix, matrix-matrix, and
|                 dot product functions, among others.  Just some initial ideas... 

|                 The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort.

|                 Regards,
|                 Peter Lieber

|                 On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:

|                     I am hearing renewed interest in adding a dot-product extension to the RISC-V Vectors. This includes everything from adding a handful of FP and Int dot-product
|                     instructions all the way to completing the Zvediv extension.

|                     What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings?

|                     Thanks,
|                     Ken

|






Re: Zvediv extension discussions

Krste Asanovic
 

The current vector TG is only tasked with one more deliverable, which
is the IEEE FP16 half-precision extensions (Zvfh, Zvfhmin). This is
effectively already defined in ratified specification, and so is just
a question of ratifying these extension names and the detailed
contents of each extension.

There is also Zvamo, which was dropped due to conflict with scalar
subword AMO encoding space, but which will need a non-trivial redesign
and hence should be a new TG. It is probably not a high priority
extension compared to others.

Zvediv was only sketched out during the vector development process,
and is not clear to me (as the original creator) that this is the
right way to go for general matrix operations.

Another thing to be aware of is that from TSC, vector crypto has the
highest priority among any new vector extensions. Vector BF16 support
is another TSC priority.

There are a lot of ideas in this thread, but there is a lot of work to
define something that will also work well on OoO cores. The SIG is
probably a good place to work through some ideas and arrive at more
concrete proposals that can lead to a TG.

Krste


On Fri, 4 Feb 2022 11:00:23 -0800, "Guy Lemieux" <guy.lemieux@...> said:
| Great points Ken & Earl.
| One thing I'll point out is that this does not necessarily have much to do with EDIV specifically.

| For example, the main goal of EDIV is to support smaller element dot products, particularly for integers. This helps with ML inference, sure. But it leaves a lot of performance on the
| table, may not help much with other operations than dot products, and probably won't help much with other applications (including ML training).

| There are two angles that would supersede this in capability and performance:
| -- adding vector shapes
| -- adding accelerators to the vector unit
| -- in particular, the combination of both shapes + accelerators

| In my work (embodied primarily by the VectorBlox MXP architecture) , these were its two primary advantages.

| a) allow streaming accelerators to replace the regular ALUs. in particular, you can add a systolic array, where you have N^2 PEs and only require O(N) inputs on each edge and produce O
| (N) outputs. such accelerators can be standardized, but first the ISA interface needs to be standardized (see below) and possibly the physical interface (eg, AXI-stream).

| b) allow vector operands to have shapes, in particular to allow tensor shapes. this affects data readout ordering and writeback ordering when connecting to the accelerator. this would
| allow, for example, a traditional 1D vector to be interpreted as a 2D tile, or even a 2D subtile. this affects the address generators to the register file, and may require data swizzling
| to rearrange into the proper shapes and to avoid data bubbles.

| In addition to the above, we loosely organized our register file as a multi-banked scratchpad, rather than having fixed-size disjoint registers. This allowed a named vector register to
| be replaced by a scalar address (a pointer) which fits in the scalar register file. This allowed vectors of arbitrary length, and to start at arbitrary locations, producing much more
| flexible shapes and subshapes to be read out. This property is probably too much for people to accept right away, but it is needed when you want to have maximum flexibility for both (a)
| and (b) above.

| Note that none of this has anything specifically to do with EDIV. However, it could build upon the vtypes system that Krste has devised. (He previously tried to suggest matrix shapes as
| possible vtype. In his suggestion, each vector register had its own type descriptor; in the MXP architecture, the type descriptor is global like the vector length register but there are
| 3 copies of them so src1, src2 and dst can all be different shapes.)

| Guy

| On Fri, Feb 4, 2022 at 10:27 AM Ken Dockser <kad@...> wrote:

| Thanks folks, these are all very good points.

| Earl: I absolutely agree that these extensions (like all RISC-V extensions) need to be developed based on real-world needs and need to be able to show their value (including utility
| and usability),  as well as value/cost ratio.

| Guy: I agree that we need to look at the other leading architectures as we are well behind them in this area. We then need to come up with our own solutions that address the market
| needs and fit within the RISC-V approach and philosophy. 

| Peter: Yes, we need to work to create extensions that take into account our future needs and intentions. In this case, where we are talking about adding instructions that improve
| matrix performance in the vector registers, we need to keep in mind how this might fit with future extensions that operate on matrices.

| We still need to figure out how we can effectively and efficiently take this next step in RISC-V Vector.  It seems like the best approach would be to leverage the existing Vector TG
| by producing an updated charter that is focused on completing the Zvediv extension. Is this permitted/possible? Are the current Chair and Vice Chair amenable to this?

| Thanks,
| Ken

| On Thu, Feb 3, 2022 at 10:11 PM Guy Lemieux <guy.lemieux@...> wrote:

| The AMX extension for AVX512 is an interesting approach.... 

| https://en.wikichip.org/wiki/intel/dl_boost

| On Thu, Feb 3, 2022 at 8:02 PM Earl Killian <earl.killian@...> wrote:

| I hope that these discussions will begin with algorithms and applications that need the additional performance, and proceed to analyze how proposed instructions address the
| greater need.

| On Feb 3, 2022, at 14:45, Peter Lieber <peteralieber@...> wrote:

| In the Graphics/ML SIG, we also discussed matrix operations as well.  We talked about a single matrix opcode with various functions like vector-matrix, matrix-matrix, and
| dot product functions, among others.  Just some initial ideas... 

| The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort.

| Regards,
| Peter Lieber

| On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:

| I am hearing renewed interest in adding a dot-product extension to the RISC-V Vectors. This includes everything from adding a handful of FP and Int dot-product
| instructions all the way to completing the Zvediv extension.

| What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings?

| Thanks,
| Ken

|


Re: Zvediv extension discussions

Guy Lemieux
 

Great points Ken & Earl.

One thing I'll point out is that this does not necessarily have much to do with EDIV specifically.

For example, the main goal of EDIV is to support smaller element dot products, particularly for integers. This helps with ML inference, sure. But it leaves a lot of performance on the table, may not help much with other operations than dot products, and probably won't help much with other applications (including ML training).

There are two angles that would supersede this in capability and performance:
-- adding vector shapes
-- adding accelerators to the vector unit
-- in particular, the combination of both shapes + accelerators

In my work (embodied primarily by the VectorBlox MXP architecture) , these were its two primary advantages.

a) allow streaming accelerators to replace the regular ALUs. in particular, you can add a systolic array, where you have N^2 PEs and only require O(N) inputs on each edge and produce O(N) outputs. such accelerators can be standardized, but first the ISA interface needs to be standardized (see below) and possibly the physical interface (eg, AXI-stream).

b) allow vector operands to have shapes, in particular to allow tensor shapes. this affects data readout ordering and writeback ordering when connecting to the accelerator. this would allow, for example, a traditional 1D vector to be interpreted as a 2D tile, or even a 2D subtile. this affects the address generators to the register file, and may require data swizzling to rearrange into the proper shapes and to avoid data bubbles.

In addition to the above, we loosely organized our register file as a multi-banked scratchpad, rather than having fixed-size disjoint registers. This allowed a named vector register to be replaced by a scalar address (a pointer) which fits in the scalar register file. This allowed vectors of arbitrary length, and to start at arbitrary locations, producing much more flexible shapes and subshapes to be read out. This property is probably too much for people to accept right away, but it is needed when you want to have maximum flexibility for both (a) and (b) above.

Note that none of this has anything specifically to do with EDIV. However, it could build upon the vtypes system that Krste has devised. (He previously tried to suggest matrix shapes as possible vtype. In his suggestion, each vector register had its own type descriptor; in the MXP architecture, the type descriptor is global like the vector length register but there are 3 copies of them so src1, src2 and dst can all be different shapes.)

Guy



On Fri, Feb 4, 2022 at 10:27 AM Ken Dockser <kad@...> wrote:
Thanks folks, these are all very good points.

Earl: I absolutely agree that these extensions (like all RISC-V extensions) need to be developed based on real-world needs and need to be able to show their value (including utility and usability),  as well as value/cost ratio.

Guy: I agree that we need to look at the other leading architectures as we are well behind them in this area. We then need to come up with our own solutions that address the market needs and fit within the RISC-V approach and philosophy. 

Peter: Yes, we need to work to create extensions that take into account our future needs and intentions. In this case, where we are talking about adding instructions that improve matrix performance in the vector registers, we need to keep in mind how this might fit with future extensions that operate on matrices.

We still need to figure out how we can effectively and efficiently take this next step in RISC-V Vector.  It seems like the best approach would be to leverage the existing Vector TG by producing an updated charter that is focused on completing the Zvediv extension. Is this permitted/possible? Are the current Chair and Vice Chair amenable to this?

Thanks,
Ken

On Thu, Feb 3, 2022 at 10:11 PM Guy Lemieux <guy.lemieux@...> wrote:
The AMX extension for AVX512 is an interesting approach.... 


On Thu, Feb 3, 2022 at 8:02 PM Earl Killian <earl.killian@...> wrote:
I hope that these discussions will begin with algorithms and applications that need the additional performance, and proceed to analyze how proposed instructions address the greater need.

On Feb 3, 2022, at 14:45, Peter Lieber <peteralieber@...> wrote:

In the Graphics/ML SIG, we also discussed matrix operations as well.  We talked about a single matrix opcode with various functions like vector-matrix, matrix-matrix, and dot product functions, among others.  Just some initial ideas... 

The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort.

Regards,
Peter Lieber


On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:
I am hearing renewed interest in adding a dot-product extension to the RISC-V Vectors. This includes everything from adding a handful of FP and Int dot-product instructions all the way to completing the Zvediv extension.

What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings?

Thanks,
Ken




Re: Zvediv extension discussions

Ken Dockser
 

Thanks folks, these are all very good points.

Earl: I absolutely agree that these extensions (like all RISC-V extensions) need to be developed based on real-world needs and need to be able to show their value (including utility and usability),  as well as value/cost ratio.

Guy: I agree that we need to look at the other leading architectures as we are well behind them in this area. We then need to come up with our own solutions that address the market needs and fit within the RISC-V approach and philosophy. 

Peter: Yes, we need to work to create extensions that take into account our future needs and intentions. In this case, where we are talking about adding instructions that improve matrix performance in the vector registers, we need to keep in mind how this might fit with future extensions that operate on matrices.

We still need to figure out how we can effectively and efficiently take this next step in RISC-V Vector.  It seems like the best approach would be to leverage the existing Vector TG by producing an updated charter that is focused on completing the Zvediv extension. Is this permitted/possible? Are the current Chair and Vice Chair amenable to this?

Thanks,
Ken

On Thu, Feb 3, 2022 at 10:11 PM Guy Lemieux <guy.lemieux@...> wrote:
The AMX extension for AVX512 is an interesting approach.... 


On Thu, Feb 3, 2022 at 8:02 PM Earl Killian <earl.killian@...> wrote:
I hope that these discussions will begin with algorithms and applications that need the additional performance, and proceed to analyze how proposed instructions address the greater need.

On Feb 3, 2022, at 14:45, Peter Lieber <peteralieber@...> wrote:

In the Graphics/ML SIG, we also discussed matrix operations as well.  We talked about a single matrix opcode with various functions like vector-matrix, matrix-matrix, and dot product functions, among others.  Just some initial ideas... 

The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort.

Regards,
Peter Lieber


On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:
I am hearing renewed interest in adding a dot-product extension to the RISC-V Vectors. This includes everything from adding a handful of FP and Int dot-product instructions all the way to completing the Zvediv extension.

What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings?

Thanks,
Ken




Re: Zvediv extension discussions

Guy Lemieux
 

The AMX extension for AVX512 is an interesting approach.... 


On Thu, Feb 3, 2022 at 8:02 PM Earl Killian <earl.killian@...> wrote:
I hope that these discussions will begin with algorithms and applications that need the additional performance, and proceed to analyze how proposed instructions address the greater need.

On Feb 3, 2022, at 14:45, Peter Lieber <peteralieber@...> wrote:

In the Graphics/ML SIG, we also discussed matrix operations as well.  We talked about a single matrix opcode with various functions like vector-matrix, matrix-matrix, and dot product functions, among others.  Just some initial ideas... 

The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort.

Regards,
Peter Lieber


On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:
I am hearing renewed interest in adding a dot-product extension to the RISC-V Vectors. This includes everything from adding a handful of FP and Int dot-product instructions all the way to completing the Zvediv extension.

What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings?

Thanks,
Ken




Re: Zvediv extension discussions

Earl Killian
 

I hope that these discussions will begin with algorithms and applications that need the additional performance, and proceed to analyze how proposed instructions address the greater need.

On Feb 3, 2022, at 14:45, Peter Lieber <peteralieber@...> wrote:

In the Graphics/ML SIG, we also discussed matrix operations as well.  We talked about a single matrix opcode with various functions like vector-matrix, matrix-matrix, and dot product functions, among others.  Just some initial ideas... 

The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort.

Regards,
Peter Lieber


On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:
I am hearing renewed interest in adding a dot-product extension to the RISC-V Vectors. This includes everything from adding a handful of FP and Int dot-product instructions all the way to completing the Zvediv extension.

What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings?

Thanks,
Ken




Re: Zvediv extension discussions

Peter Lieber
 

In the Graphics/ML SIG, we also discussed matrix operations as well.  We talked about a single matrix opcode with various functions like vector-matrix, matrix-matrix, and dot product functions, among others.  Just some initial ideas... 

The Zvediv extension would be a great start to get dimensionality to vectors, and we would want to keep matrices in mind when reviving the effort.

Regards,
Peter Lieber


On Thu, Feb 3, 2022 at 2:17 PM Ken Dockser <kad@...> wrote:
I am hearing renewed interest in adding a dot-product extension to the RISC-V Vectors. This includes everything from adding a handful of FP and Int dot-product instructions all the way to completing the Zvediv extension.

What is the most efficient way to revive the efforts in these areas? Can we reconvene the Vector TG meetings?

Thanks,
Ken

61 - 80 of 862