
Roger Ferrer Ibanez
Hi all,
I could not find any instruction that immediately computes this.
Apologies if I missed the obvious here.
Two options came to mind:
- vpopc.m and check whether the
result is 0 (all zeros) or VLMAX(SEW,
LMUL). I am under the impression that population count
is not a fast operation (though I guess it depends on the actual
VLEN)
- vfirst.m, returns -1 it the mask
is all zeros. For all ones we can do vmnot.m
first and then vfirst.m. Might
not be much faster than vpopc.m
but (at expense of vmnot.m) does
not need to compute VLMAX(SEW,LMUL).
Perhaps there are other alternatives?
Thoughts on whether it'd make sense to have a specific
instruction for these checks? As in one instruction that returns
one of three possible results (e.g. 1 for all ones, -1 for all
zeros, 0 otherwise) in a GPR.
Thank you very much,
--
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|
Hi all,
I could not find any instruction that immediately computes this.
Apologies if I missed the obvious here.
Two options came to mind:
- vpopc.m and check whether the
result is 0 (all zeros) or VLMAX(SEW,
LMUL). I am under the impression that population count
is not a fast operation (though I guess it depends on the actual
VLEN)
I think this approach is sufficient, actually.
On the machines I've worked on so far, vpopc.m is no slower than vfirst.m.
For machines with very wide spatial vectors, you could imagine vpopc.m being slightly higher latency than vfirst.m (say, one extra clock cycle) because of the depth of the reduction tree. But this shouldn't be a dominant effect: in a machine like that, surely the data movement latency will be a more prominent factor than the reduction latency, since the latter scales logarithmically.
PS. You probably already have the current vector length in a GPR, and that quantity is probably the more appropriate thing to compare against than VLMAX. So you probably don't need to go to the trouble of materializing VLMAX. - vfirst.m, returns -1 it the mask
is all zeros. For all ones we can do vmnot.m
first and then vfirst.m. Might
not be much faster than vpopc.m
but (at expense of vmnot.m) does
not need to compute VLMAX(SEW,LMUL).
Perhaps there are other alternatives?
Thoughts on whether it'd make sense to have a specific
instruction for these checks? As in one instruction that returns
one of three possible results (e.g. 1 for all ones, -1 for all
zeros, 0 otherwise) in a GPR.
Thank you very much,
--
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|

Roger Ferrer Ibanez
Hi Andrew,
thanks for the prompt and insightful answer. I'll use vpopc.m then.
On 20/5/21 8:25, Andrew Waterman wrote:
PS. You probably already have the
current vector length in a GPR, and that quantity is probably
the more appropriate thing to compare against than VLMAX. So
you probably don't need to go to the trouble of materializing
VLMAX.
Indeed, my question was motivated while looking at some code that
operates on whole registers but it can definitely be generalised
to any vector length.
Kind regards,
--
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|

Krste Asanovic
Actually, vfirst,m can be implemented with an early out on long temporal vector machines, whereas vpopc.m has to process all bits.
If the common case for the input data is that all bits would be set/clear, then choice doesn’t really matter, but if common to be able to early out (i.e. test fails), I’d go with vfirst.m
toggle quoted message
Show quoted text
Hi Andrew, thanks for the prompt and insightful answer. I'll use vpopc.m then.
On 20/5/21 8:25, Andrew Waterman wrote:
PS. You probably already have the
current vector length in a GPR, and that quantity is probably
the more appropriate thing to compare against than VLMAX. So
you probably don't need to go to the trouble of materializing
VLMAX.
Indeed, my question was motivated while looking at some code that
operates on whole registers but it can definitely be generalised
to any vector length. Kind regards,
--
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|
On Thu, May 20, 2021 at 12:16 AM Krste Asanovic < krste@...> wrote: Actually, vfirst,m can be implemented with an early out on long temporal vector machines, whereas vpopc.m has to process all bits.
If the common case for the input data is that all bits would be set/clear, then choice doesn’t really matter, but if common to be able to early out (i.e. test fails), I’d go with vfirst.m
Yeah, it would've been more precise of me to have compared vpopc.m against Roger's hypothetical new instruction, which also must process all bits.
Krste
Hi Andrew, thanks for the prompt and insightful answer. I'll use vpopc.m then.
On 20/5/21 8:25, Andrew Waterman wrote:
PS. You probably already have the
current vector length in a GPR, and that quantity is probably
the more appropriate thing to compare against than VLMAX. So
you probably don't need to go to the trouble of materializing
VLMAX.
Indeed, my question was motivated while looking at some code that
operates on whole registers but it can definitely be generalised
to any vector length. Kind regards,
--
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|
On Thu, May 20, 2021 at 12:27 AM Andrew Waterman < andrew@...> wrote:
On Thu, May 20, 2021 at 12:16 AM Krste Asanovic < krste@...> wrote: Actually, vfirst,m can be implemented with an early out on long temporal vector machines, whereas vpopc.m has to process all bits.
If the common case for the input data is that all bits would be set/clear, then choice doesn’t really matter, but if common to be able to early out (i.e. test fails), I’d go with vfirst.m
Yeah, it would've been more precise of me to have compared vpopc.m against Roger's hypothetical new instruction, which also must process all bits.
Er, nevermind, I got that wrong again. Roger's instruction can also early-out with slightly more complexity (if at least one 1 and at least one 0 is detected).
Krste
Hi Andrew, thanks for the prompt and insightful answer. I'll use vpopc.m then.
On 20/5/21 8:25, Andrew Waterman wrote:
PS. You probably already have the
current vector length in a GPR, and that quantity is probably
the more appropriate thing to compare against than VLMAX. So
you probably don't need to go to the trouble of materializing
VLMAX.
Indeed, my question was motivated while looking at some code that
operates on whole registers but it can definitely be generalised
to any vector length. Kind regards,
--
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|
It depends -- exactly what do you plan to do after determining if a mask is all-0 or all-1 or other? vpopc and vfirst can both special-case these common results via precomputation, so they both take minimal cycles. in that regard, they are equivalent and there is no need to add your special instruction. the problem is that both vpopc.m and vfirst.m write to the X register file, which forces synchronization between scalar and vector units. this may cost extra cycles of stalling ... which may negatively affect performance. you could introduce a new instruction or a CSR read which checks the mask result in an asynchronous fashion (or not). so, what exactly do you plan to do after knowing the result is all-0 or all-1 ? do you want to initiate a branch or something else? does a precise (synchronized) result matter, or can you tolerate decoupling delays? for example, it could be possible to specify that a CSR contains the result of a mask being all-0, all-1, or otherwise, and that this CSR is asynchronously updated. hence, a scalar control loop may operate until the all-0 result is finally true without causing any hard synchronization with the vector unit. this sort of approach would work for some computaitons, eg mandelbrot, which require a change in the control flow after all units have achieved a certain status, and where there is no harm to continuing an extra iteration or two due to latency between vector instructions and the CSR. g On Wed, May 19, 2021 at 10:49 PM Roger Ferrer Ibanez <roger.ferrer@...> wrote: Hi all,
I could not find any instruction that immediately computes this. Apologies if I missed the obvious here.
Two options came to mind:
vpopc.m and check whether the result is 0 (all zeros) or VLMAX(SEW, LMUL). I am under the impression that population count is not a fast operation (though I guess it depends on the actual VLEN) vfirst.m, returns -1 it the mask is all zeros. For all ones we can do vmnot.m first and then vfirst.m. Might not be much faster than vpopc.m but (at expense of vmnot.m) does not need to compute VLMAX(SEW,LMUL).
Perhaps there are other alternatives?
Thoughts on whether it'd make sense to have a specific instruction for these checks? As in one instruction that returns one of three possible results (e.g. 1 for all ones, -1 for all zeros, 0 otherwise) in a GPR.
Thank you very much,
-- Roger Ferrer Ibáñez - roger.ferrer@... Barcelona Supercomputing Center - Centro Nacional de Supercomputación
WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|

Roger Ferrer Ibanez
Hi Guy, On 20/5/21 12:09, Guy Lemieux wrote: so, what exactly do you plan to do after knowing the result is all-0 or all-1 ? do you want to initiate a branch or something else? does a precise (synchronized) result matter, or can you tolerate decoupling delays? The code I've been looking at, uses this for a branch. FWIW: this is the SLEEF library (vector math library). An example of how it uses the check can be found at https://github.com/shibatch/sleef/blob/master/src/libm/sleefsimddp.c#L340(Not claiming that this specific library as written is a good or bad fit for RVV, just looking at the code to get an idea of what are its expectations) Kind regards, -- Roger Ferrer Ibáñez - roger.ferrer@... Barcelona Supercomputing Center - Centro Nacional de Supercomputación http://bsc.es/disclaimer
|
|
yeeesh glad i don’t have to stare at that code too long.
i know it’s not your code ...
i think it could use a abs followed by a max reduction, then do the rest as scalar ops?
these macros appear to be targeted towards fixed-width simd. in particular i think they are making an assumption of very short vectors. in this snippet, it appears to want to compute all elements of the vector the same way ... with longer vectors, i would expect to use masks to separate the different computation types so it can be individualized for each element.
i haven’t studied the code in depth, but on the surface the all-mask-ones case seems to be not very useful here, nor does it really help with performance.
guy
toggle quoted message
Show quoted text
Hi Guy,
On 20/5/21 12:09, Guy Lemieux wrote:
> so, what exactly do you plan to do after knowing the result is all-0
> or all-1 ? do you want to initiate a branch or something else? does a
> precise (synchronized) result matter, or can you tolerate decoupling
> delays?
The code I've been looking at, uses this for a branch.
FWIW: this is the SLEEF library (vector math library). An example of how
it uses the check can be found at
https://github.com/shibatch/sleef/blob/master/src/libm/sleefsimddp.c#L340
(Not claiming that this specific library as written is a good or bad fit
for RVV, just looking at the code to get an idea of what are its
expectations)
Kind regards,
--
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación
http://bsc.es/disclaimer
|
|