my response below is now off-topic, and covers the more flexible
reductions wanted by Nagendra. i discourage any further followups here
(instead, please search for another recent series of posts by
Nagendra)
this thread should stay on-topic for 64b extension suggestions
-----
it is unlikely reductions will target scalar registers. as mentioned
before, this forces a hard synchronization between the vector/scalar
engines.
specifying the target element id is a nice workaround, and avoids
slides. i'm not sure there are sufficient bits to encode the scalar
register ID (rs1) needed to determine the element id. btw, i would
rewrite your suggestion o be something like:
vredsum vd[x[rs1]], vs1[x[rs1]], vs2[*]
addi rs1, rs1, 1
instead, i think it would be more better to add a generic
element-to-element move, something like:
vredsum vd, vs1, vs2[*]
vmv vd, vs2, rs1, rs2 . // vd[x[rs1]] = vs2[x[rs2]], where in this
case rs2==x0
this would be more efficient than vslide1up, and be more generally
useful than the change to vredsum you are proposing.
Guy
toggle quoted message
Show quoted text
On Wed, Mar 11, 2020 at 10:59 AM Nagendra Gulur <nagendra.gd@...> wrote:
Open to feedback here..
But my thought was that I will not need vslide1up if I am able to control the reduction destination.
A loop around the instructions:
vredsum vd[rs1], vs1[rs1], vs2[*]
rs1 = rs1 + 1
can enable doing a vector-width worth of reductions into adjacent elements and will not require sliding data through the vector between reductions. Probably cheaper this way than shifting data through vd..
See any issues with the way I am thinking here?
Best Regards
Nagendra
On Wed, Mar 11, 2020 at 12:32 PM Claire Wolf <claire@...> wrote:
to me it seems like reading the dest element index from a scalar reg sounds like a significant microarchitectural overhead. can you describe why this is needed?
I would assume that use cases for this replace sequences of reduce operations interleaved with vector slide operations. If the reduction is a multi-cycle op and the slide is a single-cycle op then I would assume that getting rid of the single-cycle op won't change much in terms of performance. and if it's just to squeeze out the last bit of extra performance then maybe it would be an alternative to implement instruction function between reduce and vector slide (although that might cause issues with instruction throughput if some of the instructions are already 64-bit wide).
On Wed, 11 Mar 2020 at 16:15, Nagendra Gulur <nagendra.gd@...> wrote:
How about if the destination element number came from a scalar register? So we will need only 5 bits to specify the x register.
This may even work better than hard coding the destination inside the vector reduction instruction permitting software to dynamically control the destination.
Best regards
Nagendra
On Wed, Mar 11, 2020 at 9:11 AM Claire Wolf <claire@...> wrote:
regarding vector reduction destination: the V spec seems to allow for really large vector machines with thousands of vector elements. I'm not sure what the right bit width for the field with the reduction destination would be.
On Wed, 11 Mar 2020 at 14:57, Nagendra Gulur <nagendra.gd@...> wrote:
It appears I can not edit the wiki. But I can clarify one item.
Regarding "Indexed memory accesses that implicitly scale the index by SEW/8":
Explanation: In scientific sparse matrix codes (and perhaps also DNN codes), sparse matrices are represented by column indices of non-zero values. In such cases, the loaded indices must be converted to element address offsets by scaling (left shifting) the indices by 0 / 1 / 2 / 3 positions. For eg: if SEW=32, then scale the indices by 4 (left shift by 2). The idea of the instruction capability is to specify this scaling behavior -- it is not always desirable to have scaling going on, so there needs to be a way for the instruction to specify if and what scaling is to be done. Note that this scaling operation is tied to vector loads that load the index data and not to the indexed vector loads that use these (now-scaled) indices.
I am not sure what Andrew had in mind regarding the other index width topic listed.
Since I can not edit the wiki, I have to raise another item for 64-bit encoding: vector reduction destination. Would it be possible to specify vector reduction destination explicitly in the instruction rather than always the implicit vd[0]?