Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)


krste@...
 

- [tech-cmo] so they don't get bothered with this off-topic discussion

On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@sifive.com> said:
| [DH]: I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
| Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
| or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.
| A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
| SOFTWARE control.

| Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero
| on the 11th.   That check could be in hardware, or it could be in
| the software that's calling the FF instruction.

I don't want us to rathole on how to guarantee forward progress for
vl=0 case, but do want to note that this kind of forward progress is
nasty to guarantee, implying there's long-lasting microarch state to
keep around - what if you're context swapped out before you get to the
11th? Do you have to force the first one after a context swap to not
trim? What if there's a sequence of ff's and second one goes back to
vl=0?

| But this does not need to be in the RISC-V architectural standard. Not yet.

Let's agree on this and move on.

|  As long as the VL=0  encoding is free,  not used for  some other purpose, you can do that in your implementation.

|  Your implementation might not be able to pass the RISC-V architectural for FF,  which I assume will probably assert an error if they find FF and
| with VL=0.  but if  your hardware has a chicken bit to reduce the threshold of VL= FFs to zero, or if you have a binary translator from  the
| compliance tests  to your software guaranteed forward progress, sure.

|  Build something like that, so there's a lot of people who want it, and a few years from now we can put it into a future version of the vector
| standard.

To be clear, if this is ever done, it will be with a separate
encoding, not expanding behavior of current instructions. Returning
vl=0 is not a "free" part of encoding. Software might rightly want to
take advantage of knowing vl>0 so you cannot allow same instruction to
return vl=0 after the fact, so need a different opcode/mode.

Krste


| On 10/16/2020 4:48, David Horner wrote:

| First I am very happy that "arbitrary decisions by the micro-architecture" allow reduction of vl to any [non-zero] value.

| Even if such appear "random".

| On 2020-10-16 2:01 a.m., krste@berkeley.edu wrote:

| - I'm sure there's probably
| papers out there with this already).

| Exactly.
| I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
| Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
| or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.

| A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
| SOFTWARE control.

| I see it as no different [in fundamental principle] than other cases such as RVI integer divide by zero behaviour that does not trap but can
| be  readily checked for.
| Also RVI integer overflow that if you want to check for it is at most a few instructions including the branch.

| (sending replies to vector list - as this is off topic for CMOs)

| My opinion is that baking SIMT execution model into ISA for purposes
| of exposing microarchitectural performance (i.e., cache misses)
| exposes too much of the machine, forcing application software to add
| extra retry loops (2nd nested loop inside of stripmining) and forcing
| system software to deal with complex traps.

|   [  Random historical connection - having a partial completion mask based
|      on cache misses is a vector version of the Stanford proposal for
|      "informing memory operations" where scalar core can branch on cache miss.
|                 https://dl.acm.org/doi/10.1145/232974.233000 ]
|   Most of the benefit for SIMT execution around microarchitectural
| hiccups can be obtained under the hood in the microarchitecture (and
| there are several hundred ISCA/MICRO/HPCA papers on doing that - I
| might be exaggerating, but only slightly - and I know Andy worked in
| this space at some point), and should outperform putting this handling
| into software.

| That said, I think it's OK to allow FF V loads to stop anywhere past
| element 0 including at a long-latency cache miss, mainly because it
| doesn't change anything in software model.

| I'm not sure it will really help perf that much in practice.  While
| it's easy to construct an example where it looks like it would help, I
| think in general most loops touch multiple vector operands, hardware
| prefetchers do well on vector streams, vector units are more efficient
| on larger chunks, scatter-gathers missing in cache limit perf anyway,
| etc., so it's probably a fairly brittle optimization (yes, you could
| add a predictor to figure out whether to wait for the other elements
| or go ahead with a partial vector result - I'm sure there's probably
| papers out there with this already).

| Krste

| On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@cadence.com> said:

| | My take is the same as Andrew has outlined below.
| |       Bill

| | On 10/15/20 4:30 PM, andrew@sifive.com wrote:

| |     EXTERNAL MAIL
|     |     Forwarding this to tech-vector-ext; couple comments below.
|     |     On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@sifive.com> wrote:

| |         In vector meeting last Friday  I listened to both Krste and David Horner's  different opinions about fault-on-first and vector
| length trimming. I realized (and may have
| |         convinced other attendees) that the  RISC-V "fault-on-first"  vector length trimming need not be done just for things like
| page-faults.
|         |         Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, 
| because vector element zero is the forward progress
| |         mechanism.
|         |         Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the  element zero of the
| operation,  but that it can randomly stop with the appropriate
| |         indication for vector length  trimming at any point in the middle of the instruction.
|         |     Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). 
| Enumerating all cases in which the length might be
| |     trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.

| |         This is part of what David Horner wants.   However, it does not give him the  fault-on-first with zero length complete
| mechanism.   It could, if there were something else in
| |         the system that guaranteed forward progress
|         |     My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward
| progress, and cleanly matches the while-loop vectorization
| |     model.

| |         ---+ Expanded

| |         From vector meeting last Friday: trimming, fault-on-first.  I realized that it is similar to the forms of SW visible non-faulting
| speculative loads some machines, especially
| |         VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The
| NaT-ness is implied by trimmed vector length.
| |         It could be implied by a mask showing which vector operations had completed.

| |         All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. 
| Software control flow must fall through such a
| |         check operation,  and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different
| instructions, so there must be a branch.

| |         The RISC-V Fault-on-first approach  has the correctness check for non-faulting implied by redoing the instruction.  i.e. it is
| its own non-faulting check.  it gets away with
| |         this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on
| vector element zero, i.e. never allowing a trim
| |         to zero length.   if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask,
| but to make forward progress it would have to
| |         guarantee that at least one vector element had completed.

| |         David Horner's desire for fault-on-first that may have performed no operations at all is (1)  reasonable IMHO (I think I managed
| to explain that the Krste), but (2) Would
| |         require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above.
| Which is almost certainly a bigger
| |         architectural change than RISC-V should make it this time.

| |         Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've
| ever done. Particularly those vector instruction
| |         sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what
| amounts to a vector instruction set and
| |         microarchitecture.  https://pharr.org/matt/papers/ispc_inpar_2012.pdf ).  It would be unfortunate for such an SIMT program to
| lose  work completed after the first fault.

| |         MORAL:  fault-on-first may be suitable for vector load that might speculate past the end of the vector -  where the length is 
| not known or inconvenient when the vector load
| |         instruction is started. Fault-on-first is  suboptimal for running SIMT on top of vectors.   i.e. fault-on-first  is the
| equivalent of precise exceptions for in order
| |         execution,  and for a single thread executing vector instructions, whereas  completion mask  allows out of order within a vector
| and/or vector length  threading.

| |         IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is
| totally fine to have the fault-on-first stuff
| |         return up to the  first really long latency cost miss, as long as it always  guarantees that at least vector element zero was
| complete. Because vector element zero complete
| |         is what guarantees forward progress.

| |         Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to
| actually implement a page-fault that did
| |         copy-on-write or  swapped in from disk.   but that would be visible to the operating system, not the user program.  However, such
| an OS implementation  would have to
| |         guarantee that it would not kill a process as a result  of a true permissions error page-fault. Or, if the virtual memory
| architecture made the distinction between
| |         permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy  on read,  the OS does not need
| to be involved.

| |          EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless
| you only trim only on true faults and not  COW
| |         or COR or disk swappage-faults).   However,  fault-on-first on any page-fault is a much  lower bandwidth  information leak 
| channel  than is fault-on-first on long latency
| |         cache misses.  so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to
| implement fault-on-first on any cache miss.
| |         However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system
| where all of the CPUs are dedicated to a single
| |         problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to
| them!

| |         Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually  much
| less of a virtualization hole than
| |         fault-on-first for page-faults.   The operating system or hypervisor has very little control over cache misses.  the OS and
| hypervisor have almost full control over
| |         page-faults.  The usual rule in security and virtualization is that an application should not be able to detect that it has had
| an "innocent"  page-fault, such as COW or COR
| |         or disk swapping.

| |         --
| |         --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis
|     |

| --
| --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis

Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.