Date
1 - 15 of 15
Vector element groups
I've been working up a scheme to handle vector element groups in
general, with vector crypto being the first anticipated use case. This replaces the EDIV concept with a more general group concept that is also less costly to implement as it does not require new vector memory instructions. I put file up as separate document in vector spec github: https://github.com/riscv/riscv-v-spec/blob/master/element_groups.adoc I'm hoping to discuss in next vector crypto meeting, but also welcome discussion on these lists (or github issues) as well. Krste |
|
A quick look, and ugh: yet another architectural option unconnected to any opcode or CSR bit that we have to specify to get the correct operation in Sail:
On Thu, Jul 14, 2022 at 10:31 PM Krste Asanovic <krste@...> wrote:
|
|
Jon Tate
It's interesting that there seems to be some overlap between this work and the work I've been doing for a matrix multiplication subextension proposal for machine learning workloads. Perhaps we should approach this element group concept from a more general direction? On Fri, Jul 15, 2022 at 10:30 AM Allen Baum <allen.baum@...> wrote:
--
Jon Tate Software Engineer, Project Shodan, Google |
|
Earl Killian
While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checking would be when individual specs implement the concept. I would think it would require some pretty good justification to not have an exception.
toggle quoted message
Show quoted text
On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, and not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would like to elucidate the issues?
|
|
Zalman Stern
In as much as this is a concept, it may be worth commenting on how this should interact with 64-bit instructions that encode SEW in the instruction. (Which, if I am not mistaken, is a possible future direction.) Specifically, I would expect the types of instructions mentioned as having, or requiring, fixed SEW to not get longer forms. It may be worth introducing standard terminology for an instruction that is specified as not using SEW. Adding both an array notation and a graphical representation to define instruction behavior would help a great deal. It is a general problem throughout documenting ML operators that understanding prose describing array operations is imprecise and invariably relies on the reader already having a pretty good idea of what is going on in order to fill in the gaps. This concept would be stronger if it were a coding style guide for a more formal language in addition to describing vocabulary. Clarity and precision of specification becomes more important as lane grouping and cross-lane interaction increases in the instruction operation. -Z- On Fri, Jul 15, 2022 at 9:11 AM Earl Killian <earl.killian@...> wrote:
|
|
Thanks for putting this concept proposal together, Krste.
I have several initial comments and questions:
Ken |
|
There is a plan to insert the formal Sail code for an operator in the spec. That is an ongoing project that has been started (there's a prototype), but no resource. Vector has many state variables that cn affect the result, (i.e. ~80? different legal configurations or elen, vleng, sew, lmul, etc), and parameterizing the functions to take all the possibilities into account without replicating the functions is proving to be a bit challenging. But once that Sail code is written, having it appear in the spec text doesn't look too difficult (for pure operators. Loads and stores are a different matter....) I'm not sure that will be as readable and understandable as a picture, though. On Fri, Jul 15, 2022 at 10:09 AM Zalman Stern <zalman@...> wrote:
|
|
Hi Ken Not sure to follow you on your 128-bit inputs and outputs for the SHA256. The spec speaks about 8 32-bit working variables, a, b, c, ..., g,h, used as the current state, so 256 bits and then an 8-group of 32 bit- values. Could you please elaborate here ? Having an algorithmic representation could be helpful for the overall discussion too. Thanks yann On Fri, Jul 15, 2022 at 8:07 PM Ken Dockser <kad@...> wrote: Thanks for putting this concept proposal together, Krste. --
Yann Loisel Principal Security Architect SiFive France |
|
Nicolas Brunie
Hi Yann, I think Ken is referencing the optimization of splitting the sha256's state in two and merging rounds. It is for example described here : https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sha-extensions.html Regards, Nicolas Le mar. 19 juil. 2022 à 01:47, Yann Loisel <yann.loisel@...> a écrit :
|
|
| On another topic, I have this vague feeling that it would be best if we had VL and SEW always set for vector instructions, andOn Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said: | not be implicit in the opcode, but I have not fleshed out this thought. Perhaps someone who has thought about it more would | like to elucidate the issues? We already have vector loads and stores with static EEW in the instruction, which ignore dynamic SEW. Future 64-bit encodings would also have static EEWs in instruction. If static encoding space was available, we would not have had dynamic SEW at all. The current EG proposal does require vl to be set. Krste |
|
| While I share some concern about the cited language, as this is a concept, and not a spec, I think the time to require checkingOn Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said: | would be when individual specs implement the concept. I would think it would require some pretty good justification to not have | an exception. On further thought, I do think it makes sense to require raising of an illegal instruction exception when vl is not a multiple of element group size rather than leaving reserved. Will be updating the doc with rationale. Krste |
|
Abel Bernabeu
Krste, Sorry it took me a long time to provide feedback. Yes, this is the kind of feature we could need for graphics and GPGPU-style SIMT. Many thanks for taking the time to think about how the idea behind Zediv can be introduced. This is the kind of concept that is needed for designing with vectors for things that are typically designed with warps. One comment I have is that groups of 3 elements are not power of two and turn out to be: - popular for graphics - demanded by OpenCL as well Is there anything we can do from the graphics SIG to help drive this work? Regards. On Fri, Jul 22, 2022 at 10:36 AM Krste Asanovic <krste@...> wrote:
|
|
| Krste,On Fri, 26 Aug 2022 13:58:28 +0200, Abel Bernabeu <abel.bernabeu@...> said: | Sorry it took me a long time to provide feedback. | Yes, this is the kind of feature we could need for graphics and GPGPU-style SIMT. Many thanks for taking the | time to think about how the idea behind Zediv can be introduced. | This is the kind of concept that is needed for designing with vectors for things that are typically designed | with warps. | One comment I have is that groups of 3 elements are not power of two and turn out to be: | - popular for graphics | - demanded by OpenCL as well - and a real pain to handle as an element group. I tried looking at ways to incorporate non-POT group sizes, but they just introduce too many corner cases that implementations will not want to handle. Handling them as four-element groups seems fairly common in other graphics-oriented programmable hardware, at least old packed-SIMD ISAs. I took a quick look and OpenCL even specifies that the 3-vectors are aligned on 4-element boundaries in memory, so that would work fine with the element group model. Of course, the 3-element vectors can instead be handled as 3-field segments, loading the three components into separate vector registers. Krste | Is there anything we can do from the graphics SIG to help drive this work? | Regards. | On Fri, Jul 22, 2022 at 10:36 AM Krste Asanovic <krste@...> wrote: |||||| On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said: | | While I share some concern about the cited language, as this is a concept, and not a spec, I think the | time to require checking | | would be when individual specs implement the concept. I would think it would require some pretty good | justification to not have | | an exception. | On further thought, I do think it makes sense to require raising of an | illegal instruction exception when vl is not a multiple of element | group size rather than leaving reserved. Will be updating the doc | with rationale. | Krste | |
|
Abel Bernabeu
Krste, Yes, I double checked and you are right. Padding OpenCL vectors to POT is fine. In terms of the instructions affected by the element groups semantics, I see three cases: - dot (this was already taken into account in Zvediv) - workgroup reduction operations - shuffles, named vector gather in RVV The impact of element groups is different depending on the case. For dot, the semantics would define that the scope of the operation would be the work-item. For workgroup reduction operations, the element groups are the leaves of a binary tree of workitems. The element group width defines the number of components for the vectors on the leaves. For shuffles (vrgather), the element groups are only reordered within their group, like in this proposal for a graphics swizzle: One good thing regarding shuffling in element groups is that there would be no need for changing existing vrgather implementations. Only new simple instructions are needed for generating the indices pattern taking into account the number of elements per group would be enough (similarly to the swizzle example above). My question is, would you define the elements group infrastructure as an extension (more or less what you have) and later add the use cases in additional extensions? That would be four extensions: - basic CSRs for element groups - element groups aware dot - element groups aware workgroup reductions - element groups aware shuffles I kind of like it like that, in four extensions, because the list of identified cases may grow during the discussion and you still want to release something within a reasonable time. Regards. On Mon, Aug 29, 2022 at 1:34 AM <krste@...> wrote:
|
|
| Krste,On Sun, 4 Sep 2022 03:33:22 +0200, Abel Bernabeu <abel.bernabeu@...> said: | Yes, I double checked and you are right. Padding OpenCL vectors to POT is fine. | In terms of the instructions affected by the element groups semantics, I see three cases: | - dot (this was already taken into account in Zvediv) | - workgroup reduction operations | - shuffles, named vector gather in RVV | The impact of element groups is different depending on the case. | For dot, the semantics would define that the scope of the operation would be the work-item. | For workgroup reduction operations, the element groups are the leaves of a binary tree of workitems. The element group width defines | the number of components for the vectors on the leaves. | For shuffles (vrgather), the element groups are only reordered within their group, like in this proposal for a graphics swizzle: | https://github.com/riscv/riscv-v-spec/compare/master...abel-bernabeu:riscv-v-spec:master | One good thing regarding shuffling in element groups is that there would be no need for changing existing vrgather implementations. | Only new simple instructions are needed for generating the indices pattern taking into account the number of elements per group would | be enough (similarly to the swizzle example above). | My question is, would you define the elements group infrastructure as an extension (more or less what you have) and later add the use | cases in additional extensions? That would be four extensions: The element group is an architectural design pattern meant to be used by a specific extension, not an extension itself. | - basic CSRs for element groups | - element groups aware dot | - element groups aware workgroup reductions | - element groups aware shuffles | I kind of like it like that, in four extensions, because the list of identified cases may grow during the discussion and you still | want to release something within a reasonable time. Each of these separate extensions could define the specific element groups they want to support. For one example, given in the doc, a vector dot extension might be defined to take two vectors of 4-element groups of 8b elements as inputs and accumulate the dot products into a vector of 32b accumulators (e.g., INT32[i] += INT8[i*4+j] * INT8[i*4+j], j=0,3) Similarly, EG reduction operations would probably take the shape of something like reduce-over-N elements, where N=2,4,8, and where input and output could differ in EEW. Recasting the old ediv vrgather into the EG model, we could define an operation that took vectors of N*8b index groups into N*SEW data groups and produced N*SEW output data groups (e.g. N=4, 8, 16). There is an advantage to this over just generating the appropriate index vector, as knowing the source data vector portion at dispatch time is much simpler than having to handle random index values at execution time. Krste | Regards. | On Mon, Aug 29, 2022 at 1:34 AM <krste@...> wrote: |||||| On Fri, 26 Aug 2022 13:58:28 +0200, Abel Bernabeu <abel.bernabeu@...> said: | | Krste, | | Sorry it took me a long time to provide feedback. | | Yes, this is the kind of feature we could need for graphics and GPGPU-style SIMT. Many thanks for taking the | | time to think about how the idea behind Zediv can be introduced. | | This is the kind of concept that is needed for designing with vectors for things that are typically designed | | with warps. | | One comment I have is that groups of 3 elements are not power of two and turn out to be: | | - popular for graphics | | - demanded by OpenCL as well | - and a real pain to handle as an element group. | I tried looking at ways to incorporate non-POT group sizes, but they | just introduce too many corner cases that implementations will not | want to handle. | Handling them as four-element groups seems fairly common in other | graphics-oriented programmable hardware, at least old packed-SIMD | ISAs. I took a quick look and OpenCL even specifies that the 3-vectors | are aligned on 4-element boundaries in memory, so that would work fine | with the element group model. | Of course, the 3-element vectors can instead be handled as 3-field | segments, loading the three components into separate vector registers. | Krste | | Is there anything we can do from the graphics SIG to help drive this work? | | Regards. | | On Fri, Jul 22, 2022 at 10:36 AM Krste Asanovic <krste@...> wrote: | |||||| On Fri, 15 Jul 2022 09:10:49 -0700, Earl Killian <earl.killian@...> said: | | | While I share some concern about the cited language, as this is a concept, and not a spec, I think the | | time to require checking | | | would be when individual specs implement the concept. I would think it would require some pretty good | | justification to not have | | | an exception. | | On further thought, I do think it makes sense to require raising of an | | illegal instruction exception when vl is not a multiple of element | | group size rather than leaving reserved. Will be updating the doc | | with rationale. | | Krste | | |
|