Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)


Forwarding this to tech-vector-ext; couple comments below.

On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote:

In vector meeting last Friday  I listened to both Krste and David Horner's  different opinions about fault-on-first and vector length trimming. I realized (and may have convinced other attendees) that the  RISC-V "fault-on-first"  vector length trimming need not be done just for things like page-faults.

Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed,  because vector element zero is the forward progress mechanism.

Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the  element zero of the operation,  but that it can randomly stop with the appropriate indication for vector length  trimming at any point in the middle of the instruction.

Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values).  Enumerating all cases in which the length might be trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.

This is part of what David Horner wants.   However, it does not give him the  fault-on-first with zero length complete mechanism.   It could, if there were something else in the system that guaranteed forward progress

My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization model.

---+ Expanded


From vector meeting last Friday: trimming, fault-on-first.  I realized that it is similar to the forms of SW visible non-faulting speculative loads some machines, especially VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The NaT-ness is implied by trimmed vector length.  It could be implied by a mask showing which vector operations had completed.


All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison.  Software control flow must fall through such a check operation,  and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different instructions, so there must be a branch.


The RISC-V Fault-on-first approach  has the correctness check for non-faulting implied by redoing the instruction.  i.e. it is its own non-faulting check.  it gets away with this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on vector element zero, i.e. never allowing a trim to zero length.   if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, but to make forward progress it would have to guarantee that at least one vector element had completed.


David Horner's desire for fault-on-first that may have performed no operations at all is (1)  reasonable IMHO (I think I managed to explain that the Krste), but (2) Would require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. Which is almost certainly a bigger architectural change than RISC-V should make it this time.


Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've ever done. Particularly those vector instruction sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what amounts to a vector instruction set and microarchitecture. ).  It would be unfortunate for such an SIMT program to lose  work completed after the first fault.


MORAL:  fault-on-first may be suitable for vector load that might speculate past the end of the vector -  where the length is  not known or inconvenient when the vector load instruction is started. Fault-on-first is  suboptimal for running SIMT on top of vectors.   i.e. fault-on-first  is the equivalent of precise exceptions for in order execution,  and for a single thread executing vector instructions, whereas  completion mask  allows out of order within a vector and/or vector length  threading.


IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is totally fine to have the fault-on-first stuff return up to the  first really long latency cost miss, as long as it always  guarantees that at least vector element zero was complete. Because vector element zero complete is what guarantees forward progress.

Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to actually implement a page-fault that did copy-on-write or  swapped in from disk.   but that would be visible to the operating system, not the user program.  However, such an OS implementation  would have to guarantee that it would not kill a process as a result  of a true permissions error page-fault. Or, if the virtual memory architecture made the distinction between permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy  on read,  the OS does not need to be involved.

 EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless you only trim only on true faults and not  COW or COR or disk swappage-faults).   However,  fault-on-first on any page-fault is a much  lower bandwidth  information leak  channel  than is fault-on-first on long latency cache misses.  so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to implement fault-on-first on any cache miss.  However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system where all of the CPUs are dedicated to a single problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to them!

Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually  much less of a virtualization hole than fault-on-first for page-faults.   The operating system or hypervisor has very little control over cache misses.  the OS and hypervisor have almost full control over page-faults.  The usual rule in security and virtualization is that an application should not be able to detect that it has had an "innocent"  page-fault, such as COW or COR or disk swapping.




--- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis

Join to automatically receive all group messages.