Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)
krste@...
- [tech-cmo] so they don't get bothered with this off-topic discussion
| [DH]: I see this openness/lack of arbitrary constraint as precisely the strength of RISCV.On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@...> said: | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under | SOFTWARE control. | Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero | on the 11th. That check could be in hardware, or it could be in | the software that's calling the FF instruction. I don't want us to rathole on how to guarantee forward progress for vl=0 case, but do want to note that this kind of forward progress is nasty to guarantee, implying there's long-lasting microarch state to keep around - what if you're context swapped out before you get to the 11th? Do you have to force the first one after a context swap to not trim? What if there's a sequence of ff's and second one goes back to vl=0? | But this does not need to be in the RISC-V architectural standard. Not yet. Let's agree on this and move on. | As long as the VL=0 encoding is free, not used for some other purpose, you can do that in your implementation. | Your implementation might not be able to pass the RISC-V architectural for FF, which I assume will probably assert an error if they find FF and | with VL=0. but if your hardware has a chicken bit to reduce the threshold of VL= FFs to zero, or if you have a binary translator from the | compliance tests to your software guaranteed forward progress, sure. | Build something like that, so there's a lot of people who want it, and a few years from now we can put it into a future version of the vector | standard. To be clear, if this is ever done, it will be with a separate encoding, not expanding behavior of current instructions. Returning vl=0 is not a "free" part of encoding. Software might rightly want to take advantage of knowing vl>0 so you cannot allow same instruction to return vl=0 after the fact, so need a different opcode/mode. Krste | On 10/16/2020 4:48, David Horner wrote: | First I am very happy that "arbitrary decisions by the micro-architecture" allow reduction of vl to any [non-zero] value. | Even if such appear "random". | On 2020-10-16 2:01 a.m., krste@... wrote: | - I'm sure there's probably | papers out there with this already). | Exactly. | I see this openness/lack of arbitrary constraint as precisely the strength of RISCV. | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under | SOFTWARE control. | I see it as no different [in fundamental principle] than other cases such as RVI integer divide by zero behaviour that does not trap but can | be readily checked for. | Also RVI integer overflow that if you want to check for it is at most a few instructions including the branch. | (sending replies to vector list - as this is off topic for CMOs) | My opinion is that baking SIMT execution model into ISA for purposes | of exposing microarchitectural performance (i.e., cache misses) | exposes too much of the machine, forcing application software to add | extra retry loops (2nd nested loop inside of stripmining) and forcing | system software to deal with complex traps. | [ Random historical connection - having a partial completion mask based | on cache misses is a vector version of the Stanford proposal for | "informing memory operations" where scalar core can branch on cache miss. | https://dl.acm.org/doi/10.1145/232974.233000 ] | Most of the benefit for SIMT execution around microarchitectural | hiccups can be obtained under the hood in the microarchitecture (and | there are several hundred ISCA/MICRO/HPCA papers on doing that - I | might be exaggerating, but only slightly - and I know Andy worked in | this space at some point), and should outperform putting this handling | into software. | That said, I think it's OK to allow FF V loads to stop anywhere past | element 0 including at a long-latency cache miss, mainly because it | doesn't change anything in software model. | I'm not sure it will really help perf that much in practice. While | it's easy to construct an example where it looks like it would help, I | think in general most loops touch multiple vector operands, hardware | prefetchers do well on vector streams, vector units are more efficient | on larger chunks, scatter-gathers missing in cache limit perf anyway, | etc., so it's probably a fairly brittle optimization (yes, you could | add a predictor to figure out whether to wait for the other elements | or go ahead with a partial vector result - I'm sure there's probably | papers out there with this already). | Krste | On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@...> said: | | My take is the same as Andrew has outlined below. | | Bill | | On 10/15/20 4:30 PM, andrew@... wrote: | | EXTERNAL MAIL | | Forwarding this to tech-vector-ext; couple comments below. | | On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote: | | In vector meeting last Friday I listened to both Krste and David Horner's different opinions about fault-on-first and vector | length trimming. I realized (and may have | | convinced other attendees) that the RISC-V "fault-on-first" vector length trimming need not be done just for things like | page-faults. | | Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, | because vector element zero is the forward progress | | mechanism. | | Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the element zero of the | operation, but that it can randomly stop with the appropriate | | indication for vector length trimming at any point in the middle of the instruction. | | Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). | Enumerating all cases in which the length might be | | trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go. | | This is part of what David Horner wants. However, it does not give him the fault-on-first with zero length complete | mechanism. It could, if there were something else in | | the system that guaranteed forward progress | | My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward | progress, and cleanly matches the while-loop vectorization | | model. | | ---+ Expanded | | From vector meeting last Friday: trimming, fault-on-first. I realized that it is similar to the forms of SW visible non-faulting | speculative loads some machines, especially | | VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The | NaT-ness is implied by trimmed vector length. | | It could be implied by a mask showing which vector operations had completed. | | All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. | Software control flow must fall through such a | | check operation, and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different | instructions, so there must be a branch. | | The RISC-V Fault-on-first approach has the correctness check for non-faulting implied by redoing the instruction. i.e. it is | its own non-faulting check. it gets away with | | this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on | vector element zero, i.e. never allowing a trim | | to zero length. if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, | but to make forward progress it would have to | | guarantee that at least one vector element had completed. | | David Horner's desire for fault-on-first that may have performed no operations at all is (1) reasonable IMHO (I think I managed | to explain that the Krste), but (2) Would | | require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. | Which is almost certainly a bigger | | architectural change than RISC-V should make it this time. | | Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've | ever done. Particularly those vector instruction | | sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what | amounts to a vector instruction set and | | microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would be unfortunate for such an SIMT program to | lose work completed after the first fault. | | MORAL: fault-on-first may be suitable for vector load that might speculate past the end of the vector - where the length is | not known or inconvenient when the vector load | | instruction is started. Fault-on-first is suboptimal for running SIMT on top of vectors. i.e. fault-on-first is the | equivalent of precise exceptions for in order | | execution, and for a single thread executing vector instructions, whereas completion mask allows out of order within a vector | and/or vector length threading. | | IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is | totally fine to have the fault-on-first stuff | | return up to the first really long latency cost miss, as long as it always guarantees that at least vector element zero was | complete. Because vector element zero complete | | is what guarantees forward progress. | | Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to | actually implement a page-fault that did | | copy-on-write or swapped in from disk. but that would be visible to the operating system, not the user program. However, such | an OS implementation would have to | | guarantee that it would not kill a process as a result of a true permissions error page-fault. Or, if the virtual memory | architecture made the distinction between | | permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy on read, the OS does not need | to be involved. | | EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless | you only trim only on true faults and not COW | | or COR or disk swappage-faults). However, fault-on-first on any page-fault is a much lower bandwidth information leak | channel than is fault-on-first on long latency | | cache misses. so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to | implement fault-on-first on any cache miss. | | However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system | where all of the CPUs are dedicated to a single | | problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to | them! | | Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually much | less of a virtualization hole than | | fault-on-first for page-faults. The operating system or hypervisor has very little control over cache misses. the OS and | hypervisor have almost full control over | | page-faults. The usual rule in security and virtualization is that an application should not be able to detect that it has had | an "innocent" page-fault, such as COW or COR | | or disk swapping. | | -- | | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis | | | -- | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis |
|