Here's the strlen example from spec:
# size_t strlen(const char *str)
# a0 holds *str
mv a3, a0 # Save start
vsetvli a1, x0, e8,m8, ta,ma # Vector of bytes of maximum length
vle8ff.v v8, (a3) # Load bytes
csrr a1, vl # Get bytes read
vmseq.vi v0, v8, 0 # Set v0[i] where v8[i] = 0
vfirst.m a2, v0 # Find first set bit
add a3, a3, a1 # Bump pointer
bltz a2, loop # Not found?
add a0, a0, a1 # Sum start + bump
add a3, a3, a2 # Add index
sub a0, a3, a0 # Subtract start address+bump
This exits when vfirst.m returns non-negative (i.e,, something triggered
exit condition) - the vfirst.m instruction can early-out when exit found
(though vle8ff/vmseq will still have to run to completion).
| So all the vleff use cases end up then using a vmpopc of some sort to determine the exit condition and never use the trimmed VL ? (other than, of course, toOn Fri, 16 Oct 2020 20:04:15 +0200, Roger Espasa <roger.espasa@...> said:
| control within the while how many elements should be operated upon). Do the compiler folks on the list agree that's the only use of vleff?
| On Fri, Oct 16, 2020 at 7:53 PM Bill Huffman <huffman@...> wrote:
| I don't think the cases where there was no fault look any different to software than the fault cases. Either can happen anywhere and the while loop may
| continue. The while loop isn't ended by a trimmed vl, it's ended by data it sees.
| On 10/16/20 10:47 AM, Roger Espasa wrote:
| EXTERNAL MAIL
| We're all in agreement that if the spec says "pick where you stop" we'd all pick to trim to VL=3. I was under the impression this was not yet closed
| (in light of the "stop at cache misses" discussion), but I sense everyone else is already on the "pick where you stop" camp.
| Speaking of which, did we ever close on whether vleff could trim even when there was no fault (i.e., just because there's a cache miss for example)?
| If the answer is "yes, you can arbitrarily stop on any element other than element 0", can someone show a while loop and how the compiler would then
| use vleff? I'm not seeing how they would use it, other than enclose the vleff loop in a second loop to make sure that "the index variable has reached
| the limit" (i.e., i<n, make sure that vleff has run enough times so that i has reached n).
| On Fri, Oct 16, 2020 at 7:33 PM <krste@...> wrote:
| As you get to pick where vl is trimmed, you would probably choose the
| vl=3 case here to simplify implementation.
|||||| On Fri, 16 Oct 2020 18:59:55 +0200, Roger Espasa <roger.espasa@...> said:
| | Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that).
| | However, I would NOT agree that a masked out element has been written, even if past the failing point.
| | roger.
| | On Fri, Oct 16, 2020 at 6:57 PM Roger Espasa <roger.espasa@...> wrote:
| | Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)
| -| If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same
| -| If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the
| | followed by an adder. At SEW=8b, there can be lots of elements within a line...
| | roger.
| | On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote:
| | The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for
| element 9 already
| | to have been overwritten with 1's (if vma allows it).
| | I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.
| | Bill
| | On 10/16/20 8:57 AM, Roger Espasa wrote:
| | EXTERNAL MAIL
| | Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:
| | image.png
| | On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:
| | On 2020-10-16 10:30 a.m., krste@... wrote:
| ||||||| On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
| || | First I am very happy that "arbitrary decisions by the
| || | micro-architecture" allow reduction of vl to any [non-zero] value.
| || | Even if such appear "random".
| || [...]
| || | A check for vl=0 on platforms that allow it is eminently doable, low
| || | overhead for many use cases AND guarantees forward progress under
| || | SOFTWARE control.
| || If we allowed implementation to return vl=0, how does software
| || guarantee forward progress?
| | The forward progress is to advance to another task.
| | In the case of machine mode it can potentially "resolve" the cause of
| | the vl=0 return and re-execute the loop (without the overhead of the trap).
| || | I see it as no different [in fundamental principle] than other cases
| || | such as RVI integer divide by zero behaviour that does not trap but can
| || | be readily checked for.
| || | Also RVI integer overflow that if you want to check for it is at most a
| || | few instructions including the branch.
| || I don't see how these examples relate to returning vl=0 on some
| || microarchitectural event. The examples here have results that depend
| || only on architectural values, so can be deterministically handled.
| | The similarity is the avoidance of trap handling, when it is sufficient
| | to check instead register state.
| || vl=0 is more related to load-reserved/store-conditional failure, where
| || we need to add implementation constraints to guarantee forward
| || progress.
| | Ok. I can see providing guidance as to when vl=0 is allowed, but not to
| | exclude it outright.
| || Krste
| | x[DELETED ATTACHMENT image.png, PNG image]