Vector element groups


Krste Asanovic
 

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions. I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste


Allen Baum
 

A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste







Jon Tate
 

It's interesting that there seems to be some overlap between this work and the work I've been doing for a matrix multiplication subextension proposal for machine learning workloads. Perhaps we should approach this element group concept from a more general direction?

On Fri, Jul 15, 2022 at 10:30 AM Allen Baum <allen.baum@...> wrote:
A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste






--
Jon Tate
Software Engineer, Project Shodan, Google


Earl Killian
 

While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking would be when individual specs implement the concept. I would think it would require some pretty good justification to not have an exception.

On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would like to elucidate the issues?

On Jul 15, 2022, at 08:30, Allen Baum <allen.baum@...> wrote:

A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste








Zalman Stern
 

In as much as this is a concept, it may be worth commenting on how this should interact with 64-bit instructions that encode SEW in the instruction. (Which, if I am not mistaken, is a possible future direction.) Specifically, I would expect the types of instructions mentioned as having, or requiring, fixed SEW to not get longer forms. It may be worth introducing standard terminology for an instruction that is specified as not using SEW.

Adding both an array notation and a graphical representation to define instruction behavior would help a great deal. It is a general problem throughout documenting ML operators that understanding prose describing array operations is imprecise and invariably relies on the reader already having a pretty good idea of what is going on in order to fill in the gaps. This concept would be stronger if it were a coding style guide for a more formal language in addition to describing vocabulary. Clarity and precision of specification becomes more important as lane grouping and cross-lane interaction increases in the instruction operation.

-Z-


On Fri, Jul 15, 2022 at 9:11 AM Earl Killian <earl.killian@...> wrote:
While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking would be when individual specs implement the concept. I would think it would require some pretty good justification to not have an exception.

On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would like to elucidate the issues?

On Jul 15, 2022, at 08:30, Allen Baum <allen.baum@...> wrote:

A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste








Ken Dockser
 

Thanks for putting this concept proposal together, Krste.

I have several initial comments and questions:
  1. I am all for the concept of element groups. As you point out, they are especially useful in cryptography where we need to operate on data sizes that are greater than the (current) largest SEW=64.
  2. SHA256 is used as an example in the document, but with an incorrect number of elements in a group. To be competitive, SHA256 instructions require 128-bit (not 256-bit) inputs and outputs. More specifically, the outputs need to be a group of 4 (not 8) 32-bit values and each of the three inputs need to be 4 32-bit values.
  3. If LMUL is allowed to be used for the concatenation of registers in narrow implementations (i.e., VLEN < EGW) to form the groups, what would be the meaning of LMUL>1 for wider implementations (i.e., when VLEN >= EGW)?
  4. How would vl be interpreted when VLMUL is used for the concatenation of registers to form groups?
  5. Would instructions that are based on element groups be required to support the use of LMUL to form those groups on narrower implementations? Or, could the instruction be defined to require a minimum VLEN
Thanks,
Ken


Allen Baum
 

There is a plan to insert the formal Sail code for an operator in the spec. 
That is an ongoing project that has been started (there's a prototype), but no resource. 

Vector has many state variables that cn affect the result, 
 (i.e.  ~80? different legal configurations or elen, vleng, sew, lmul, etc), 
and parameterizing the functions to take all the possibilities into account 
without replicating the functions is proving to be a bit challenging.
But once that Sail code is written, having it appear in the spec text doesn't look too difficult
(for pure operators. Loads and stores are a different matter....)

I'm not sure that will be as readable and understandable as a picture, though.

On Fri, Jul 15, 2022 at 10:09 AM Zalman Stern <zalman@...> wrote:
In as much as this is a concept, it may be worth commenting on how this should interact with 64-bit instructions that encode SEW in the instruction. (Which, if I am not mistaken, is a possible future direction.) Specifically, I would expect the types of instructions mentioned as having, or requiring, fixed SEW to not get longer forms. It may be worth introducing standard terminology for an instruction that is specified as not using SEW.

Adding both an array notation and a graphical representation to define instruction behavior would help a great deal. It is a general problem throughout documenting ML operators that understanding prose describing array operations is imprecise and invariably relies on the reader already having a pretty good idea of what is going on in order to fill in the gaps. This concept would be stronger if it were a coding style guide for a more formal language in addition to describing vocabulary. Clarity and precision of specification becomes more important as lane grouping and cross-lane interaction increases in the instruction operation.

-Z-


On Fri, Jul 15, 2022 at 9:11 AM Earl Killian <earl.killian@...> wrote:
While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking would be when individual specs implement the concept. I would think it would require some pretty good justification to not have an exception.

On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would like to elucidate the issues?

On Jul 15, 2022, at 08:30, Allen Baum <allen.baum@...> wrote:

A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
Implementations are recommended to raise an illegal instruction exception for 
vl value that is not a multiple of the element group size.

On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:

I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case.
This replaces the EDIV concept with a more general group concept that
is also less costly to implement as it does not require new vector
memory instructions.  I put file up as separate document in vector
spec github:

https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc

I'm hoping to discuss in next vector crypto meeting, but also welcome
discussion on these lists (or github issues) as well.

Krste








Yann Loisel
 

Hi Ken
Not sure to follow you on your 128-bit inputs and outputs for the SHA256.
The spec speaks about 8 32-bit working variables, a, b, c, ..., g,h, used as the current state, so 256 bits and then an 8-group of 32 bit- values.
Could you please elaborate here ?
Having an algorithmic representation could be helpful for the overall discussion too.
Thanks
yann

On Fri, Jul 15, 2022 at 8:07 PM Ken Dockser <kad@...> wrote:
Thanks for putting this concept proposal together, Krste.

I have several initial comments and questions:
  1. I am all for the concept of element groups. As you point out, they are especially useful in cryptography where we need to operate on data sizes that are greater than the (current) largest SEW=64.
  2. SHA256 is used as an example in the document, but with an incorrect number of elements in a group. To be competitive, SHA256 instructions require 128-bit (not 256-bit) inputs and outputs. More specifically, the outputs need to be a group of 4 (not 8) 32-bit values and each of the three inputs need to be 4 32-bit values.
  3. If LMUL is allowed to be used for the concatenation of registers in narrow implementations (i.e., VLEN < EGW) to form the groups, what would be the meaning of LMUL>1 for wider implementations (i.e., when VLEN >= EGW)?
  4. How would vl be interpreted when VLMUL is used for the concatenation of registers to form groups?
  5. Would instructions that are based on element groups be required to support the use of LMUL to form those groups on narrower implementations? Or, could the instruction be defined to require a minimum VLEN
Thanks,
Ken



--

Yann Loisel
Principal Security Architect


Nicolas Brunie
 

Hi Yann,
   I think Ken is referencing the optimization of splitting the sha256's state in two and merging rounds. It is for example described here : https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html

Regards,
Nicolas 

Le mar. 19 juil. 2022 à 01:47, Yann Loisel <yann.loisel@...> a écrit :
Hi Ken
Not sure to follow you on your 128-bit inputs and outputs for the SHA256.
The spec speaks about 8 32-bit working variables, a, b, c, ..., g,h, used as the current state, so 256 bits and then an 8-group of 32 bit- values.
Could you please elaborate here ?
Having an algorithmic representation could be helpful for the overall discussion too.
Thanks
yann

On Fri, Jul 15, 2022 at 8:07 PM Ken Dockser <kad@...> wrote:
Thanks for putting this concept proposal together, Krste.

I have several initial comments and questions:
  1. I am all for the concept of element groups. As you point out, they are especially useful in cryptography where we need to operate on data sizes that are greater than the (current) largest SEW=64.
  2. SHA256 is used as an example in the document, but with an incorrect number of elements in a group. To be competitive, SHA256 instructions require 128-bit (not 256-bit) inputs and outputs. More specifically, the outputs need to be a group of 4 (not 8) 32-bit values and each of the three inputs need to be 4 32-bit values.
  3. If LMUL is allowed to be used for the concatenation of registers in narrow implementations (i.e., VLEN < EGW) to form the groups, what would be the meaning of LMUL>1 for wider implementations (i.e., when VLEN >= EGW)?
  4. How would vl be interpreted when VLMUL is used for the concatenation of registers to form groups?
  5. Would instructions that are based on element groups be required to support the use of LMUL to form those groups on narrower implementations? Or, could the instruction be defined to require a minimum VLEN
Thanks,
Ken



--

Yann Loisel
Principal Security Architect


Krste Asanovic
 

On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said:
| On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and
| not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would
| like to elucidate the issues?

We already have vector loads and stores with static EEW in the
instruction, which ignore dynamic SEW. Future 64-bit encodings would
also have static EEWs in instruction. If static encoding space was available,
we would not have had dynamic SEW at all.

The current EG proposal does require vl to be set.

Krste


Krste Asanovic
 

On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said:
| While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking
| would be when individual specs implement the concept. I would think it would require some pretty good justification to not have
| an exception.

On further thought, I do think it makes sense to require raising of an
illegal instruction exception when vl is not a multiple of element
group size rather than leaving reserved. Will be updating the doc
with rationale.

Krste


Abel Bernabeu
 

Krste,

Sorry it took me a long time to provide feedback.

Yes, this is the kind of feature we could need for graphics and GPGPU-style SIMT. Many thanks for taking the time to think about how the idea behind Zediv can be introduced.

This is the kind of concept that is needed for designing with vectors for things that are typically designed with warps.

One comment I have is that groups of 3 elements are not power of two and turn out to be:
- popular for graphics
- demanded by OpenCL as well

Is there anything we can do from the graphics SIG to help drive this work?

Regards.




On Fri, Jul 22, 2022 at 10:36 AM Krste Asanovic <krste@...> wrote:

>>>>> On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said:
| While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking
| would be when individual specs implement the concept. I would think it would require some pretty good justification to not have
| an exception.

On further thought, I do think it makes sense to require raising of an
illegal instruction exception when vl is not a multiple of element
group size rather than leaving reserved.  Will be updating the doc
with rationale.

Krste






Krste Asanovic
 

On Fri, 26 Aug 2022 13:58:28 +0200, Abel Bernabeu <abel.bernabeu@...> said:
| Krste,
| Sorry it took me a long time to provide feedback.

| Yes, this is the kind of feature we could need for graphics and GPGPU-style SIMT. Many thanks for taking the
| time to think about how the idea behind Zediv can be introduced.

| This is the kind of concept that is needed for designing with vectors for things that are typically designed
| with warps.

| One comment I have is that groups of 3 elements are not power of two and turn out to be:
| - popular for graphics
| - demanded by OpenCL as well

- and a real pain to handle as an element group.

I tried looking at ways to incorporate non-POT group sizes, but they
just introduce too many corner cases that implementations will not
want to handle.

Handling them as four-element groups seems fairly common in other
graphics-oriented programmable hardware, at least old packed-SIMD
ISAs. I took a quick look and OpenCL even specifies that the 3-vectors
are aligned on 4-element boundaries in memory, so that would work fine
with the element group model.

Of course, the 3-element vectors can instead be handled as 3-field
segments, loading the three components into separate vector registers.

Krste


| Is there anything we can do from the graphics SIG to help drive this work?

| Regards.

| On Fri, Jul 22, 2022 at 10:36 AM Krste Asanovic <krste@...> wrote:

|||||| On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said:
| | While I share some concern about the cited language, as this is a concept, and not a spec, I think the
| time to require checking
| | would be when individual specs implement the concept. I would think it would require some pretty good
| justification to not have
| | an exception.

| On further thought, I do think it makes sense to require raising of an
| illegal instruction exception when vl is not a multiple of element
| group size rather than leaving reserved.  Will be updating the doc
| with rationale.

| Krste

|


Abel Bernabeu
 

Krste,

Yes, I double checked and you are right. Padding OpenCL vectors to POT is fine.

In terms of the instructions affected by the element groups semantics, I see three cases:
- dot (this was already taken into account in Zvediv)
- workgroup reduction operations
- shuffles, named vector gather in RVV

The impact of element groups is different depending on the case.

For dot, the semantics would define that the scope of the operation would be the work-item.

For workgroup reduction operations, the element groups are the leaves of a binary tree of workitems. The element group width defines the number of components for the vectors on the leaves.

For shuffles (vrgather), the element groups are only reordered within their group, like in this proposal for a graphics swizzle:


One good thing regarding shuffling in element groups is that there would be no need for changing existing vrgather implementations. Only new simple instructions are needed for generating the indices pattern taking into account the number of elements per group would be enough (similarly to the swizzle example above).

My question is, would you define the elements group infrastructure as an extension (more or less what you have) and later add the use cases in additional extensions? That would be four extensions:

- basic CSRs for element groups
- element groups aware dot
- element groups aware workgroup reductions
- element groups aware shuffles

I kind of like it like that, in four extensions, because the list of identified cases may grow during the discussion and you still want to release something within a reasonable time.

Regards.

On Mon, Aug 29, 2022 at 1:34 AM <krste@...> wrote:

>>>>> On Fri, 26 Aug 2022 13:58:28 +0200, Abel Bernabeu <abel.bernabeu@...> said:

| Krste,
| Sorry it took me a long time to provide feedback.

| Yes, this is the kind of feature we could need for graphics and GPGPU-style SIMT. Many thanks for taking the
| time to think about how the idea behind Zediv can be introduced.

| This is the kind of concept that is needed for designing with vectors for things that are typically designed
| with warps.

| One comment I have is that groups of 3 elements are not power of two and turn out to be:
| - popular for graphics
| - demanded by OpenCL as well

- and a real pain to handle as an element group.

I tried looking at ways to incorporate non-POT group sizes, but they
just introduce too many corner cases that implementations will not
want to handle.

Handling them as four-element groups seems fairly common in other
graphics-oriented programmable hardware, at least old packed-SIMD
ISAs.  I took a quick look and OpenCL even specifies that the 3-vectors
are aligned on 4-element boundaries in memory, so that would work fine
with the element group model.

Of course, the 3-element vectors can instead be handled as 3-field
segments, loading the three components into separate vector registers.

Krste


| Is there anything we can do from the graphics SIG to help drive this work?

| Regards.

| On Fri, Jul 22, 2022 at 10:36 AM Krste Asanovic <krste@...> wrote:

|||||| On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said:
|     | While I share some concern about the cited language, as this is a concept, and not a spec, I think the
|     time to require checking
|     | would be when individual specs implement the concept. I would think it would require some pretty good
|     justification to not have
|     | an exception.

|     On further thought, I do think it makes sense to require raising of an
|     illegal instruction exception when vl is not a multiple of element
|     group size rather than leaving reserved.  Will be updating the doc
|     with rationale.

|     Krste

|     


Krste Asanovic
 

On Sun, 4 Sep 2022 03:33:22 +0200, Abel Bernabeu <abel.bernabeu@...> said:
| Krste,
| Yes, I double checked and you are right. Padding OpenCL vectors to POT is fine.

| In terms of the instructions affected by the element groups semantics, I see three cases:
| - dot (this was already taken into account in Zvediv)
| - workgroup reduction operations
| - shuffles, named vector gather in RVV

| The impact of element groups is different depending on the case.

| For dot, the semantics would define that the scope of the operation would be the work-item.

| For workgroup reduction operations, the element groups are the leaves of a binary tree of workitems. The element group width defines
| the number of components for the vectors on the leaves.

| For shuffles (vrgather), the element groups are only reordered within their group, like in this proposal for a graphics swizzle:

| https://github.com/riscv/riscv-v-spec/compare/master...abel-bernabeu:riscv-v-spec:master

| One good thing regarding shuffling in element groups is that there would be no need for changing existing vrgather implementations.
| Only new simple instructions are needed for generating the indices pattern taking into account the number of elements per group would
| be enough (similarly to the swizzle example above).

| My question is, would you define the elements group infrastructure as an extension (more or less what you have) and later add the use
| cases in additional extensions? That would be four extensions:

The element group is an architectural design pattern meant to be used
by a specific extension, not an extension itself.

| - basic CSRs for element groups
| - element groups aware dot
| - element groups aware workgroup reductions
| - element groups aware shuffles

| I kind of like it like that, in four extensions, because the list of identified cases may grow during the discussion and you still
| want to release something within a reasonable time.

Each of these separate extensions could define the specific element
groups they want to support.

For one example, given in the doc, a vector dot extension might be defined to
take two vectors of 4-element groups of 8b elements as inputs and
accumulate the dot products into a vector of 32b accumulators (e.g.,
INT32[i] += INT8[i*4+j] * INT8[i*4+j], j=0,3)

Similarly, EG reduction operations would probably take the shape of
something like reduce-over-N elements, where N=2,4,8, and where input
and output could differ in EEW.

Recasting the old ediv vrgather into the EG model, we could define an
operation that took vectors of N*8b index groups into N*SEW data
groups and produced N*SEW output data groups (e.g. N=4, 8, 16). There
is an advantage to this over just generating the appropriate index
vector, as knowing the source data vector portion at dispatch time is
much simpler than having to handle random index values at execution
time.


Krste


| Regards.

| On Mon, Aug 29, 2022 at 1:34 AM <krste@...> wrote:

|||||| On Fri, 26 Aug 2022 13:58:28 +0200, Abel Bernabeu <abel.bernabeu@...> said:

| | Krste,
| | Sorry it took me a long time to provide feedback.

| | Yes, this is the kind of feature we could need for graphics and GPGPU-style SIMT. Many thanks for taking the
| | time to think about how the idea behind Zediv can be introduced.

| | This is the kind of concept that is needed for designing with vectors for things that are typically designed
| | with warps.

| | One comment I have is that groups of 3 elements are not power of two and turn out to be:
| | - popular for graphics
| | - demanded by OpenCL as well

| - and a real pain to handle as an element group.

| I tried looking at ways to incorporate non-POT group sizes, but they
| just introduce too many corner cases that implementations will not
| want to handle.

| Handling them as four-element groups seems fairly common in other
| graphics-oriented programmable hardware, at least old packed-SIMD
| ISAs.  I took a quick look and OpenCL even specifies that the 3-vectors
| are aligned on 4-element boundaries in memory, so that would work fine
| with the element group model.

| Of course, the 3-element vectors can instead be handled as 3-field
| segments, loading the three components into separate vector registers.

| Krste

| | Is there anything we can do from the graphics SIG to help drive this work?

| | Regards.

| | On Fri, Jul 22, 2022 at 10:36 AM Krste Asanovic <krste@...> wrote:

| |||||| On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said:
| |     | While I share some concern about the cited language, as this is a concept, and not a spec, I think the
| |     time to require checking
| |     | would be when individual specs implement the concept. I would think it would require some pretty good
| |     justification to not have
| |     | an exception.

| |     On further thought, I do think it makes sense to require raising of an
| |     illegal instruction exception when vl is not a multiple of element
| |     group size rather than leaving reserved.  Will be updating the doc
| |     with rationale.

| |     Krste

| |