On Thu, May 20, 2021 at 12:27 AM Andrew Waterman <andrew@...
On Thu, May 20, 2021 at 12:16 AM Krste Asanovic <krste@...
Actually, vfirst,m can be implemented with an early out on long temporal vector machines, whereas vpopc.m has to process all bits.
If the common case for the input data is that all bits would be set/clear, then choice doesn’t really matter, but if common to be able to early out (i.e. test fails), I’d go with vfirst.m
Yeah, it would've been more precise of me to have compared vpopc.m against Roger's hypothetical new instruction, which also must process all bits.
Er, nevermind, I got that wrong again. Roger's instruction can also early-out with slightly more complexity (if at least one 1 and at least one 0 is detected).
thanks for the prompt and insightful answer. I'll use vpopc.m then.
On 20/5/21 8:25, Andrew Waterman wrote:
PS. You probably already have the
current vector length in a GPR, and that quantity is probably
the more appropriate thing to compare against than VLMAX. So
you probably don't need to go to the trouble of materializing
Indeed, my question was motivated while looking at some code that
operates on whole registers but it can definitely be generalised
to any vector length.
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.