Guy pointed out to me that, since several V ISA-design issues have been punted to an eventual 64-bit instruction encoding, we should consider recording them somewhere. I've set up the github wiki for the purpose of recording design rationale that doesn't belong in the spec proper, and have seeded it with a very short list of hypothetical 64b features. Feel free to edit directly if you have write permissions. https://github.com/riscv/riscv-v-spec/wiki
|
|

Richard Newell
Hi all,
I am not sure if these require 64-bit encoding, but I am interested in extended data types, especially signed-integer-complex, single-precision floating-point complex, and unums.
Rich
G. Richard Newell
Assoc. Technical Fellow, FPGA Business Unit, Microchip Technology
(408) 643-6146 (office), (408) 882-4785 (mobile),
+1 (925) 478-7258 (Skype)
PGP: (2009 DSA-1024, ELG-4096) B751 FC13 8B4E 49DA 2270 35A2 20E4 E66A D0D0 2E34
(2016 SSA-4096, RSA-4096) 65F5 CCD6 23B3 BD3D CEDE AB58 171F F4DE E7D0 3ECA
|
toggle quoted message
Show quoted text
From: tech-vector-ext@... [mailto:tech-vector-ext@...]
On Behalf Of Andrew Waterman
Sent: Tuesday, March 10, 2020 5:33 PM
To: tech-vector-ext@...
Subject: [RISC-V] [tech-vector-ext] 64-bit instruction encoding wish list
EXTERNAL EMAIL:
Do not click links or open attachments unless you know the content is safe
Guy pointed out to me that, since several V ISA-design issues have been punted to an eventual 64-bit instruction encoding, we should consider recording them somewhere. I've set up the github wiki for the purpose of recording design rationale
that doesn't belong in the spec proper, and have seeded it with a very short list of hypothetical 64b features. Feel free to edit directly if you have write permissions. https://github.com/riscv/riscv-v-spec/wiki
|
|
My current "long instruction encoding" proposal has an example encoding for 64-bit V extension instructions:
this contains everything currently in that wiki except - Indexed memory accesses that implicitly scale the index by SEW/8 - Indexed memory accesses that decouple index width from data width
where can I find more information on this?
toggle quoted message
Show quoted text
Hi all,
I am not sure if these require 64-bit encoding, but I am interested in extended data types, especially signed-integer-complex, single-precision floating-point complex, and unums.
Rich
G. Richard Newell
Assoc. Technical Fellow, FPGA Business Unit, Microchip Technology
(408) 643-6146 (office), (408) 882-4785 (mobile),
+1 (925) 478-7258 (Skype)
PGP: (2009 DSA-1024, ELG-4096) B751 FC13 8B4E 49DA 2270 35A2 20E4 E66A D0D0 2E34
(2016 SSA-4096, RSA-4096) 65F5 CCD6 23B3 BD3D CEDE AB58 171F F4DE E7D0 3ECA
|
EXTERNAL EMAIL:
Do not click links or open attachments unless you know the content is safe
Guy pointed out to me that, since several V ISA-design issues have been punted to an eventual 64-bit instruction encoding, we should consider recording them somewhere. I've set up the github wiki for the purpose of recording design rationale
that doesn't belong in the spec proper, and have seeded it with a very short list of hypothetical 64b features. Feel free to edit directly if you have write permissions. https://github.com/riscv/riscv-v-spec/wiki
|
|

Nagendra Gulur
It appears I can not edit the wiki. But I can clarify one item.
Regarding "Indexed memory accesses that implicitly scale the index by SEW/8": Explanation: In scientific sparse matrix codes (and perhaps also DNN codes), sparse matrices are represented by column indices of non-zero values. In such cases, the loaded indices must be converted to element address offsets by scaling (left shifting) the indices by 0 / 1 / 2 / 3 positions. For eg: if SEW=32, then scale the indices by 4 (left shift by 2). The idea of the instruction capability is to specify this scaling behavior -- it is not always desirable to have scaling going on, so there needs to be a way for the instruction to specify if and what scaling is to be done. Note that this scaling operation is tied to vector loads that load the index data and not to the indexed vector loads that use these (now-scaled) indices.
I am not sure what Andrew had in mind regarding the other index width topic listed.
Since I can not edit the wiki, I have to raise another item for 64-bit encoding: vector reduction destination. Would it be possible to specify vector reduction destination explicitly in the instruction rather than always the implicit vd[0]?
|
|
regarding vector reduction destination: the V spec seems to allow for really large vector machines with thousands of vector elements. I'm not sure what the right bit width for the field with the reduction destination would be.
toggle quoted message
Show quoted text
It appears I can not edit the wiki. But I can clarify one item.
Regarding "Indexed memory accesses that implicitly scale the index by SEW/8": Explanation: In scientific sparse matrix codes (and perhaps also DNN codes), sparse matrices are represented by column indices of non-zero values. In such cases, the loaded indices must be converted to element address offsets by scaling (left shifting) the indices by 0 / 1 / 2 / 3 positions. For eg: if SEW=32, then scale the indices by 4 (left shift by 2). The idea of the instruction capability is to specify this scaling behavior -- it is not always desirable to have scaling going on, so there needs to be a way for the instruction to specify if and what scaling is to be done. Note that this scaling operation is tied to vector loads that load the index data and not to the indexed vector loads that use these (now-scaled) indices.
I am not sure what Andrew had in mind regarding the other index width topic listed.
Since I can not edit the wiki, I have to raise another item for 64-bit encoding: vector reduction destination. Would it be possible to specify vector reduction destination explicitly in the instruction rather than always the implicit vd[0]?
|
|

Nagendra Gulur
How about if the destination element number came from a scalar register? So we will need only 5 bits to specify the x register.
This may even work better than hard coding the destination inside the vector reduction instruction permitting software to dynamically control the destination.
Best regards Nagendra
toggle quoted message
Show quoted text
On Wed, Mar 11, 2020 at 9:11 AM Claire Wolf < claire@...> wrote: regarding vector reduction destination: the V spec seems to allow for really large vector machines with thousands of vector elements. I'm not sure what the right bit width for the field with the reduction destination would be.
It appears I can not edit the wiki. But I can clarify one item.
Regarding "Indexed memory accesses that implicitly scale the index by SEW/8": Explanation: In scientific sparse matrix codes (and perhaps also DNN codes), sparse matrices are represented by column indices of non-zero values. In such cases, the loaded indices must be converted to element address offsets by scaling (left shifting) the indices by 0 / 1 / 2 / 3 positions. For eg: if SEW=32, then scale the indices by 4 (left shift by 2). The idea of the instruction capability is to specify this scaling behavior -- it is not always desirable to have scaling going on, so there needs to be a way for the instruction to specify if and what scaling is to be done. Note that this scaling operation is tied to vector loads that load the index data and not to the indexed vector loads that use these (now-scaled) indices.
I am not sure what Andrew had in mind regarding the other index width topic listed.
Since I can not edit the wiki, I have to raise another item for 64-bit encoding: vector reduction destination. Would it be possible to specify vector reduction destination explicitly in the instruction rather than always the implicit vd[0]?
|
|
to me it seems like reading the dest element index from a scalar reg sounds like a significant microarchitectural overhead. can you describe why this is needed?
I would assume that use cases for this replace sequences of reduce operations interleaved with vector slide operations. If the reduction is a multi-cycle op and the slide is a single-cycle op then I would assume that getting rid of the single-cycle op won't change much in terms of performance. and if it's just to squeeze out the last bit of extra performance then maybe it would be an alternative to implement instruction function between reduce and vector slide (although that might cause issues with instruction throughput if some of the instructions are already 64-bit wide).
toggle quoted message
Show quoted text
How about if the destination element number came from a scalar register? So we will need only 5 bits to specify the x register.
This may even work better than hard coding the destination inside the vector reduction instruction permitting software to dynamically control the destination.
Best regards Nagendra On Wed, Mar 11, 2020 at 9:11 AM Claire Wolf < claire@...> wrote: regarding vector reduction destination: the V spec seems to allow for really large vector machines with thousands of vector elements. I'm not sure what the right bit width for the field with the reduction destination would be.
It appears I can not edit the wiki. But I can clarify one item.
Regarding "Indexed memory accesses that implicitly scale the index by SEW/8": Explanation: In scientific sparse matrix codes (and perhaps also DNN codes), sparse matrices are represented by column indices of non-zero values. In such cases, the loaded indices must be converted to element address offsets by scaling (left shifting) the indices by 0 / 1 / 2 / 3 positions. For eg: if SEW=32, then scale the indices by 4 (left shift by 2). The idea of the instruction capability is to specify this scaling behavior -- it is not always desirable to have scaling going on, so there needs to be a way for the instruction to specify if and what scaling is to be done. Note that this scaling operation is tied to vector loads that load the index data and not to the indexed vector loads that use these (now-scaled) indices.
I am not sure what Andrew had in mind regarding the other index width topic listed.
Since I can not edit the wiki, I have to raise another item for 64-bit encoding: vector reduction destination. Would it be possible to specify vector reduction destination explicitly in the instruction rather than always the implicit vd[0]?
|
|

Nagendra Gulur
Open to feedback here..
But my thought was that I will not need vslide1up if I am able to control the reduction destination.
A loop around the instructions:
vredsum vd[rs1], vs1[rs1], vs2[*] rs1 = rs1 + 1
can enable doing a vector-width worth of reductions into adjacent elements and will not require sliding data through the vector between reductions. Probably cheaper this way than shifting data through vd..
See any issues with the way I am thinking here?
Best Regards Nagendra
toggle quoted message
Show quoted text
On Wed, Mar 11, 2020 at 12:32 PM Claire Wolf < claire@...> wrote: to me it seems like reading the dest element index from a scalar reg sounds like a significant microarchitectural overhead. can you describe why this is needed?
I would assume that use cases for this replace sequences of reduce operations interleaved with vector slide operations. If the reduction is a multi-cycle op and the slide is a single-cycle op then I would assume that getting rid of the single-cycle op won't change much in terms of performance. and if it's just to squeeze out the last bit of extra performance then maybe it would be an alternative to implement instruction function between reduce and vector slide (although that might cause issues with instruction throughput if some of the instructions are already 64-bit wide). How about if the destination element number came from a scalar register? So we will need only 5 bits to specify the x register.
This may even work better than hard coding the destination inside the vector reduction instruction permitting software to dynamically control the destination.
Best regards Nagendra On Wed, Mar 11, 2020 at 9:11 AM Claire Wolf < claire@...> wrote: regarding vector reduction destination: the V spec seems to allow for really large vector machines with thousands of vector elements. I'm not sure what the right bit width for the field with the reduction destination would be.
It appears I can not edit the wiki. But I can clarify one item.
Regarding "Indexed memory accesses that implicitly scale the index by SEW/8": Explanation: In scientific sparse matrix codes (and perhaps also DNN codes), sparse matrices are represented by column indices of non-zero values. In such cases, the loaded indices must be converted to element address offsets by scaling (left shifting) the indices by 0 / 1 / 2 / 3 positions. For eg: if SEW=32, then scale the indices by 4 (left shift by 2). The idea of the instruction capability is to specify this scaling behavior -- it is not always desirable to have scaling going on, so there needs to be a way for the instruction to specify if and what scaling is to be done. Note that this scaling operation is tied to vector loads that load the index data and not to the indexed vector loads that use these (now-scaled) indices.
I am not sure what Andrew had in mind regarding the other index width topic listed.
Since I can not edit the wiki, I have to raise another item for 64-bit encoding: vector reduction destination. Would it be possible to specify vector reduction destination explicitly in the instruction rather than always the implicit vd[0]?
|
|
my response below is now off-topic, and covers the more flexible reductions wanted by Nagendra. i discourage any further followups here (instead, please search for another recent series of posts by Nagendra)
this thread should stay on-topic for 64b extension suggestions
-----
it is unlikely reductions will target scalar registers. as mentioned before, this forces a hard synchronization between the vector/scalar engines.
specifying the target element id is a nice workaround, and avoids slides. i'm not sure there are sufficient bits to encode the scalar register ID (rs1) needed to determine the element id. btw, i would rewrite your suggestion o be something like:
vredsum vd[x[rs1]], vs1[x[rs1]], vs2[*] addi rs1, rs1, 1
instead, i think it would be more better to add a generic element-to-element move, something like:
vredsum vd, vs1, vs2[*] vmv vd, vs2, rs1, rs2 . // vd[x[rs1]] = vs2[x[rs2]], where in this case rs2==x0
this would be more efficient than vslide1up, and be more generally useful than the change to vredsum you are proposing.
Guy
toggle quoted message
Show quoted text
On Wed, Mar 11, 2020 at 10:59 AM Nagendra Gulur <nagendra.gd@...> wrote: Open to feedback here..
But my thought was that I will not need vslide1up if I am able to control the reduction destination.
A loop around the instructions:
vredsum vd[rs1], vs1[rs1], vs2[*] rs1 = rs1 + 1
can enable doing a vector-width worth of reductions into adjacent elements and will not require sliding data through the vector between reductions. Probably cheaper this way than shifting data through vd..
See any issues with the way I am thinking here?
Best Regards Nagendra
On Wed, Mar 11, 2020 at 12:32 PM Claire Wolf <claire@...> wrote:
to me it seems like reading the dest element index from a scalar reg sounds like a significant microarchitectural overhead. can you describe why this is needed?
I would assume that use cases for this replace sequences of reduce operations interleaved with vector slide operations. If the reduction is a multi-cycle op and the slide is a single-cycle op then I would assume that getting rid of the single-cycle op won't change much in terms of performance. and if it's just to squeeze out the last bit of extra performance then maybe it would be an alternative to implement instruction function between reduce and vector slide (although that might cause issues with instruction throughput if some of the instructions are already 64-bit wide).
On Wed, 11 Mar 2020 at 16:15, Nagendra Gulur <nagendra.gd@...> wrote:
How about if the destination element number came from a scalar register? So we will need only 5 bits to specify the x register.
This may even work better than hard coding the destination inside the vector reduction instruction permitting software to dynamically control the destination.
Best regards Nagendra
On Wed, Mar 11, 2020 at 9:11 AM Claire Wolf <claire@...> wrote:
regarding vector reduction destination: the V spec seems to allow for really large vector machines with thousands of vector elements. I'm not sure what the right bit width for the field with the reduction destination would be.
On Wed, 11 Mar 2020 at 14:57, Nagendra Gulur <nagendra.gd@...> wrote:
It appears I can not edit the wiki. But I can clarify one item.
Regarding "Indexed memory accesses that implicitly scale the index by SEW/8": Explanation: In scientific sparse matrix codes (and perhaps also DNN codes), sparse matrices are represented by column indices of non-zero values. In such cases, the loaded indices must be converted to element address offsets by scaling (left shifting) the indices by 0 / 1 / 2 / 3 positions. For eg: if SEW=32, then scale the indices by 4 (left shift by 2). The idea of the instruction capability is to specify this scaling behavior -- it is not always desirable to have scaling going on, so there needs to be a way for the instruction to specify if and what scaling is to be done. Note that this scaling operation is tied to vector loads that load the index data and not to the indexed vector loads that use these (now-scaled) indices.
I am not sure what Andrew had in mind regarding the other index width topic listed.
Since I can not edit the wiki, I have to raise another item for 64-bit encoding: vector reduction destination. Would it be possible to specify vector reduction destination explicitly in the instruction rather than always the implicit vd[0]?
|
|