Date
21 - 25 of 25
[RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)
If we allow regular FoF loads to return with vl=0, we must provide a
forward-progress guarantee, otherwise the instructions are practically unusable. The forward-progress guarantee must not add overhead to the common cases where returning vl=0 serves no useful purpose. I believe this is difficult to describe, especially when code may have several FoF loads in a stripmine loop. If allowing FoF loads to return vl=0 requires application overhead to support the forward-progresss guarantee, then we should have a separate encoding for that instruction so that the common case is not burdened by the esoteric case. You're incorrectly characterizing FoF below. The FoF loads are not intended for software to dynamically probe the microarch state to check for possible faults (though it can be misused that way). The point is to support software vector-length speculation, where whether an access is really needed is not known ahead of time. The FoF instructions allow software vector-length speculation in a safe way, where the first element is checked as normal and raises any necessary traps to the OS, while the later elements are not processed if they're problematic. Only if software attempts to actually process the later elements, because processing the earlier elements deems it necessary, is the required trap actioned. The trap is serviced by the OS not the application. Most commonly, it will be a page fault, sometimes a protection violation. Neither are reported to the application (in general), because the application can do nothing about these traps. This is different from the other cases you bring up (integer overflow, FP flags). There is no difficulty in providing forward progress on FoF loads in a microarchitecture, as otherwise regular vector loads wouldn't work. FoF loads are only a small modification to regular vector loads, basically flushing the pipeline to change vl on a trap instead of taking the trap and setting vstart. The only way I would contemplate allowing trimming to vl=0 for the 1.0 standard was if there was a forward-progress guarantee that did not burden regular uses of FoF loads. Also, the guarantee would have to actually enable some improvement in an implementation (as otherwise, no one would choose to trim to 0, and we can then keep the spec simple). Krste | On 2020-10-17 6:49 p.m., krste@... wrote:On Sat, 17 Oct 2020 22:39:37 -0400, "David Horner" <ds2horner@...> said: || - [tech-cmo] so they don't get bothered with this off-topic discussion || ||||||| On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@...> said: || | [DH]: I see this openness/lack of arbitrary constraint as precisely the strength of RISCV. || | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) || | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. || | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under || | SOFTWARE control. || || | Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero || | on the 11th. That check could be in hardware, or it could be in || | the software that's calling the FF instruction. || || I don't want us to rathole on how to guarantee forward progress for || vl=0 case, but do want to note that this kind of forward progress is || nasty to guarantee, implying there's long-lasting microarch state to || keep around - what if you're context swapped out before you get to the || 11th? Do you have to force the first one after a context swap to not || trim? What if there's a sequence of ff's and second one goes back to || vl=0? | Krste: I gather your answer is more in the context of lr/sc type forward | guarantees, instructions that are designed not to trap when delivering | on their primary function. | So I agree that determining an appropriate deferred trap is | problematic. | However, the intent of vl*ff is to operate in an environment | anticipating exception behaviour. | It is the instruction's raison d'être. If the full vl is expected | to always be returned [with very, very few exceptions] we would not have | this instruction, but rather direct the EE to reduce vl or abort the | process. | So rather than a rathole we have the Elephant-In-The-Room. What | does the EE do when deferred forward progress is not possible? | Given that the application is anticipating "trouble" with the read | memory access, does it make sense to only address the "safe" case? | With float exceptions RISCV does not provide trap handlers, but | rather FFlags for the application to electively check. | With integer overflow or zero divide RISCV does not provide trap | handlers, but requires the application to include code to detect the | condition. | Trap handlers for vl*ff are only incidental. They are no more | special to vl*ff than any other of the vl*, or the RVI lw,lh,lb, etc. | In apparent contradiction to the spec, a valid implementations can | "trap" as it would for the non-ff but not service the fault, only reduce | the vl accordingly until the fault occurs on the first element. | Thus central to the functioning of the instruction is what happens | when the fault occurs on the first element. | Punting to the handler is not an answer. Return at least one | element or trap does not define the operational characteristics [even if | it may arguably be an ISA architectural answer]. | There is nothing prohibiting the trap from returning vl=0. And I | argue that EEs will indeed elect to do that when there can be no forward | progress [e.g. the requested address is mapped execute only]. | Platforms will stipulate a behaviour and vl=0 will be a choice. | What we should try to address is how to allow greatest portability and | least software fragmentation. | I believe this should be accomplished exactly was effected for the | integer overflow. Exclude the checking code if you do not need it, and | include it if you are not assured that it is superflous. | In other words vl=0 must be handled , either by avoidance or | explicitly as indication that is nothing to process. || | But this does not need to be in the RISC-V architectural standard. Not yet. || || Let's agree on this and move on. | There is no value in ignoring the issue. || || | As long as the VL=0 encoding is free, not used for some other purpose, you can do that in your implementation. || || | Your implementation might not be able to pass the RISC-V architectural for FF, which I assume will probably assert an error if they find FF and || | with VL=0. but if your hardware has a chicken bit to reduce the threshold of VL= FFs to zero, or if you have a binary translator from the || | compliance tests to your software guaranteed forward progress, sure. || || | Build something like that, so there's a lot of people who want it, and a few years from now we can put it into a future version of the vector || | standard. || || To be clear, if this is ever done, it will be with a separate || encoding, not expanding behavior of current instructions. Returning || vl=0 is not a "free" part of encoding. Software might rightly want to || take advantage of knowing vl>0 so you cannot allow same instruction to || return vl=0 after the fact, so need a different opcode/mode. | And it is precisely because this backward compatibility is not managed | if we tactically ignore vl=0 that we must address it, and allow vl=0 for | V1.0. || || Krste || || || | On 10/16/2020 4:48, David Horner wrote: || || | First I am very happy that "arbitrary decisions by the micro-architecture" allow reduction of vl to any [non-zero] value. || || | Even if such appear "random". || || | On 2020-10-16 2:01 a.m., krste@... wrote: || || | - I'm sure there's probably || | papers out there with this already). || || | Exactly. || | I see this openness/lack of arbitrary constraint as precisely the strength of RISCV. || | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) || | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. || || | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under || | SOFTWARE control. || || | I see it as no different [in fundamental principle] than other cases such as RVI integer divide by zero behaviour that does not trap but can || | be readily checked for. || | Also RVI integer overflow that if you want to check for it is at most a few instructions including the branch. || || | (sending replies to vector list - as this is off topic for CMOs) || || | My opinion is that baking SIMT execution model into ISA for purposes || | of exposing microarchitectural performance (i.e., cache misses) || | exposes too much of the machine, forcing application software to add || | extra retry loops (2nd nested loop inside of stripmining) and forcing || | system software to deal with complex traps. || || | [ Random historical connection - having a partial completion mask based || | on cache misses is a vector version of the Stanford proposal for || | "informing memory operations" where scalar core can branch on cache miss. || | https://dl.acm.org/doi/10.1145/232974.233000 ] || | Most of the benefit for SIMT execution around microarchitectural || | hiccups can be obtained under the hood in the microarchitecture (and || | there are several hundred ISCA/MICRO/HPCA papers on doing that - I || | might be exaggerating, but only slightly - and I know Andy worked in || | this space at some point), and should outperform putting this handling || | into software. || || | That said, I think it's OK to allow FF V loads to stop anywhere past || | element 0 including at a long-latency cache miss, mainly because it || | doesn't change anything in software model. || || | I'm not sure it will really help perf that much in practice. While || | it's easy to construct an example where it looks like it would help, I || | think in general most loops touch multiple vector operands, hardware || | prefetchers do well on vector streams, vector units are more efficient || | on larger chunks, scatter-gathers missing in cache limit perf anyway, || | etc., so it's probably a fairly brittle optimization (yes, you could || | add a predictor to figure out whether to wait for the other elements || | or go ahead with a partial vector result - I'm sure there's probably || | papers out there with this already). || || | Krste || || | On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@...> said: || || | | My take is the same as Andrew has outlined below. || | | Bill || || | | On 10/15/20 4:30 PM, andrew@... wrote: || || | | EXTERNAL MAIL || | | Forwarding this to tech-vector-ext; couple comments below. || | | On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote: || || | | In vector meeting last Friday I listened to both Krste and David Horner's different opinions about fault-on-first and vector || | length trimming. I realized (and may have || | | convinced other attendees) that the RISC-V "fault-on-first" vector length trimming need not be done just for things like || | page-faults. || | | Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, || | because vector element zero is the forward progress || | | mechanism. || | | Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the element zero of the || | operation, but that it can randomly stop with the appropriate || | | indication for vector length trimming at any point in the middle of the instruction. || | | Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). || | Enumerating all cases in which the length might be || | | trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go. || || | | This is part of what David Horner wants. However, it does not give him the fault-on-first with zero length complete || | mechanism. It could, if there were something else in || | | the system that guaranteed forward progress || | | My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward || | progress, and cleanly matches the while-loop vectorization || | | model. || || | | ---+ Expanded || || | | From vector meeting last Friday: trimming, fault-on-first. I realized that it is similar to the forms of SW visible non-faulting || | speculative loads some machines, especially || | | VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The || | NaT-ness is implied by trimmed vector length. || | | It could be implied by a mask showing which vector operations had completed. || || | | All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. || | Software control flow must fall through such a || | | check operation, and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different || | instructions, so there must be a branch. || || | | The RISC-V Fault-on-first approach has the correctness check for non-faulting implied by redoing the instruction. i.e. it is || | its own non-faulting check. it gets away with || | | this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on || | vector element zero, i.e. never allowing a trim || | | to zero length. if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, || | but to make forward progress it would have to || | | guarantee that at least one vector element had completed. || || | | David Horner's desire for fault-on-first that may have performed no operations at all is (1) reasonable IMHO (I think I managed || | to explain that the Krste), but (2) Would || | | require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. || | Which is almost certainly a bigger || | | architectural change than RISC-V should make it this time. || || | | Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've || | ever done. Particularly those vector instruction || | | sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what || | amounts to a vector instruction set and || | | microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would be unfortunate for such an SIMT program to || | lose work completed after the first fault. || || | | MORAL: fault-on-first may be suitable for vector load that might speculate past the end of the vector - where the length is || | not known or inconvenient when the vector load || | | instruction is started. Fault-on-first is suboptimal for running SIMT on top of vectors. i.e. fault-on-first is the || | equivalent of precise exceptions for in order || | | execution, and for a single thread executing vector instructions, whereas completion mask allows out of order within a vector || | and/or vector length threading. || || | | IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is || | totally fine to have the fault-on-first stuff || | | return up to the first really long latency cost miss, as long as it always guarantees that at least vector element zero was || | complete. Because vector element zero complete || | | is what guarantees forward progress. || || | | Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to || | actually implement a page-fault that did || | | copy-on-write or swapped in from disk. but that would be visible to the operating system, not the user program. However, such || | an OS implementation would have to || | | guarantee that it would not kill a process as a result of a true permissions error page-fault. Or, if the virtual memory || | architecture made the distinction between || | | permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy on read, the OS does not need || | to be involved. || || | | EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless || | you only trim only on true faults and not COW || | | or COR or disk swappage-faults). However, fault-on-first on any page-fault is a much lower bandwidth information leak || | channel than is fault-on-first on long latency || | | cache misses. so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to || | implement fault-on-first on any cache miss. || | | However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system || | where all of the CPUs are dedicated to a single || | | problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to || | them! || || | | Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually much || | less of a virtualization hole than || | | fault-on-first for page-faults. The operating system or hypervisor has very little control over cache misses. the OS and || | hypervisor have almost full control over || | | page-faults. The usual rule in security and virtualization is that an application should not be able to detect that it has had || | an "innocent" page-fault, such as COW or COR || | | or disk swapping. || || | | -- || | | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis || | | || || | -- || | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis | |
|
David Horner
You're incorrectly characterizing
FoF below. The FoF loads are not
intended for software to dynamically probe the microarch state to check for possible faults That is not what I am advocating.
(though it can be misused that
way). The
point is to support software vector-length speculation, where whether an access is really needed is not known ahead of time. That is not precisely the full use case.
Rather your intended use case is : When the application is assured that a constrained load can
succeed,
[ the system guarantees a termination condition for the
load exists
,that it is detectable from the data read up to and
including the end point,
and that all the data from the start point to the end
point is readable]
then FoF provides a convenient and expedited way to
advance through the load.
And if you define "not known ahead of time" to mean before
each successive load, then that time frameis not precisely true
either.
The load could be performed one unit at a time, and each time
the need would be known.
The unit requested could be of arbitrary length [successive
packets of ethernet data or crypto segments].
I'm not trying to be obtuse and oppositional.
The value of FoF is to avoid the complexities of such
tracking,
but if an EE were to reasonably guarantee that the data to be
loaded
can be speculatively read up to a page boundary, then FoF is
not needed,
nor does it necessarily provide any hard advantage over the
regular strided load.
[some implementations may detect such things as debug
breakpoints and not trigger them, but as far as the software is
concerned it has the speculative to-the-end-of-the-page
guarantee, thus it will be content even if the debugger is
annoyed]
The FoF loads are not
intended for software to dynamically probe the microarch state to check for possible faults (though it can be misused that way). The detection of microarch state is incidental to the
characterization I attribute to FoF.
And it is not only microarch state that can be revealed but
system and EE level state.
FoF fails in situations that are not covered by your use
case.
Specifically, what does the EE do when it detects a situation
that forward progress is not possible.
e.g. the data requested is not mapped into the process.
As I understand your use case the [standard] FoF load is
aborted and its process as well.
The "enhanced/dangerous" FoF load will be allowed vl=0 to
identify the "abort" case.
Consider this scenario:
A process requests the EE to maps into another process' [e.g.
child's] address space pages to scan,
and the asychronous [child] co-process does the scanning.
FoF return vl=0 is eminently suited to this use case.
It is certainly possible to add to the
handshaking/synchronization process the current end point of the
data
that would need to be checked as each page is processed.
This can be substantial overhead and delay.
It is certainly possible to ensure that each request
overreaches the natural page alignment.
However, as FoF allow the processor to reduce vl at any
point, it could continually reduce vl so that it is better
aligned to cache, anticipating that following request will be
optimized. The program will still work, and detect potential
page failures, but the false positives could be substantial and
even more costly and substantially variable across
implementations. [not to mention the EE thinking the process is
attempting to do side channel attack].
These use cases argue for vl=0 return. And as I mentioned
before, these use cases will motivate the EE to return vl=0,
even without the application using the "new/corrupted" FoF
encoding for vl=0 allowed.
On Tue, Oct 20, 2020 at 5:08
AM Krste Asanovic <krste@...>
wrote:
I believe I have shown practical uses above. The forward-progress guarantee must not add overhead to theI certainly agree. But when does returning vl=0 serve no useful purpose? this is difficult to describe, especially when code may have severalThere are different forward-progress guarantees. As I mentioned before separate encoding
will not provide a practical benefit.
Once the new encoding is introduced,
legacy processors will just have their
EE emulate it by allowing vl=0 return
under the same conditions and the
linkeditor will replace the new FoF with the old.
As mentioned before, if we think
outside the box of the "classic" use case,
there certainly are meaningful and
significant ways that applications can
handle EE level events (analogous to
divide by zero).
The default case is just such a
non-burdensome approach.
Check vl=0 if you are not guaranteed to
succeed.
Ignore vl=0 at your peril if you are
unsure (you could end up in an infinite loop).
Ignore vl=0 if you are guaranteed not
to read past valid memory.
Also, the guarantee would have to The spec will need to address this case
in any event, even if to say we do not recommend EE return
with vl=0.
The spec cannot mandate that EE not
return vl=0. Certification does not extend to runtime
constrained EEs.
Code needs to be aware that this can
happen.
The net is, I don't believe the
"prohibition" significantly simplifies the spec.
It may actually make it more
contentious.
You simplified integer divide over
other ISA that mandated a trap for divide by zero.
With this approach we mandate a trap
for FoF when vl=0 would be sufficient.
Where it is inevitable that EE will do
the sensible thing and
return vl=0; when forward progress
[within reasonable constraints] is not possible.
|
|
krste@...
- [tech-cmo] so they don't get bothered with this off-topic discussion
| [DH]: I see this openness/lack of arbitrary constraint as precisely the strength of RISCV.On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@...> said: | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under | SOFTWARE control. | Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero | on the 11th. That check could be in hardware, or it could be in | the software that's calling the FF instruction. I don't want us to rathole on how to guarantee forward progress for vl=0 case, but do want to note that this kind of forward progress is nasty to guarantee, implying there's long-lasting microarch state to keep around - what if you're context swapped out before you get to the 11th? Do you have to force the first one after a context swap to not trim? What if there's a sequence of ff's and second one goes back to vl=0? | But this does not need to be in the RISC-V architectural standard. Not yet. Let's agree on this and move on. | As long as the VL=0 encoding is free, not used for some other purpose, you can do that in your implementation. | Your implementation might not be able to pass the RISC-V architectural for FF, which I assume will probably assert an error if they find FF and | with VL=0. but if your hardware has a chicken bit to reduce the threshold of VL= FFs to zero, or if you have a binary translator from the | compliance tests to your software guaranteed forward progress, sure. | Build something like that, so there's a lot of people who want it, and a few years from now we can put it into a future version of the vector | standard. To be clear, if this is ever done, it will be with a separate encoding, not expanding behavior of current instructions. Returning vl=0 is not a "free" part of encoding. Software might rightly want to take advantage of knowing vl>0 so you cannot allow same instruction to return vl=0 after the fact, so need a different opcode/mode. Krste | On 10/16/2020 4:48, David Horner wrote: | First I am very happy that "arbitrary decisions by the micro-architecture" allow reduction of vl to any [non-zero] value. | Even if such appear "random". | On 2020-10-16 2:01 a.m., krste@... wrote: | - I'm sure there's probably | papers out there with this already). | Exactly. | I see this openness/lack of arbitrary constraint as precisely the strength of RISCV. | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under | SOFTWARE control. | I see it as no different [in fundamental principle] than other cases such as RVI integer divide by zero behaviour that does not trap but can | be readily checked for. | Also RVI integer overflow that if you want to check for it is at most a few instructions including the branch. | (sending replies to vector list - as this is off topic for CMOs) | My opinion is that baking SIMT execution model into ISA for purposes | of exposing microarchitectural performance (i.e., cache misses) | exposes too much of the machine, forcing application software to add | extra retry loops (2nd nested loop inside of stripmining) and forcing | system software to deal with complex traps. | [ Random historical connection - having a partial completion mask based | on cache misses is a vector version of the Stanford proposal for | "informing memory operations" where scalar core can branch on cache miss. | https://dl.acm.org/doi/10.1145/232974.233000 ] | Most of the benefit for SIMT execution around microarchitectural | hiccups can be obtained under the hood in the microarchitecture (and | there are several hundred ISCA/MICRO/HPCA papers on doing that - I | might be exaggerating, but only slightly - and I know Andy worked in | this space at some point), and should outperform putting this handling | into software. | That said, I think it's OK to allow FF V loads to stop anywhere past | element 0 including at a long-latency cache miss, mainly because it | doesn't change anything in software model. | I'm not sure it will really help perf that much in practice. While | it's easy to construct an example where it looks like it would help, I | think in general most loops touch multiple vector operands, hardware | prefetchers do well on vector streams, vector units are more efficient | on larger chunks, scatter-gathers missing in cache limit perf anyway, | etc., so it's probably a fairly brittle optimization (yes, you could | add a predictor to figure out whether to wait for the other elements | or go ahead with a partial vector result - I'm sure there's probably | papers out there with this already). | Krste | On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@...> said: | | My take is the same as Andrew has outlined below. | | Bill | | On 10/15/20 4:30 PM, andrew@... wrote: | | EXTERNAL MAIL | | Forwarding this to tech-vector-ext; couple comments below. | | On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote: | | In vector meeting last Friday I listened to both Krste and David Horner's different opinions about fault-on-first and vector | length trimming. I realized (and may have | | convinced other attendees) that the RISC-V "fault-on-first" vector length trimming need not be done just for things like | page-faults. | | Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, | because vector element zero is the forward progress | | mechanism. | | Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the element zero of the | operation, but that it can randomly stop with the appropriate | | indication for vector length trimming at any point in the middle of the instruction. | | Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). | Enumerating all cases in which the length might be | | trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go. | | This is part of what David Horner wants. However, it does not give him the fault-on-first with zero length complete | mechanism. It could, if there were something else in | | the system that guaranteed forward progress | | My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward | progress, and cleanly matches the while-loop vectorization | | model. | | ---+ Expanded | | From vector meeting last Friday: trimming, fault-on-first. I realized that it is similar to the forms of SW visible non-faulting | speculative loads some machines, especially | | VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The | NaT-ness is implied by trimmed vector length. | | It could be implied by a mask showing which vector operations had completed. | | All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. | Software control flow must fall through such a | | check operation, and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different | instructions, so there must be a branch. | | The RISC-V Fault-on-first approach has the correctness check for non-faulting implied by redoing the instruction. i.e. it is | its own non-faulting check. it gets away with | | this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on | vector element zero, i.e. never allowing a trim | | to zero length. if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, | but to make forward progress it would have to | | guarantee that at least one vector element had completed. | | David Horner's desire for fault-on-first that may have performed no operations at all is (1) reasonable IMHO (I think I managed | to explain that the Krste), but (2) Would | | require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. | Which is almost certainly a bigger | | architectural change than RISC-V should make it this time. | | Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've | ever done. Particularly those vector instruction | | sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what | amounts to a vector instruction set and | | microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would be unfortunate for such an SIMT program to | lose work completed after the first fault. | | MORAL: fault-on-first may be suitable for vector load that might speculate past the end of the vector - where the length is | not known or inconvenient when the vector load | | instruction is started. Fault-on-first is suboptimal for running SIMT on top of vectors. i.e. fault-on-first is the | equivalent of precise exceptions for in order | | execution, and for a single thread executing vector instructions, whereas completion mask allows out of order within a vector | and/or vector length threading. | | IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is | totally fine to have the fault-on-first stuff | | return up to the first really long latency cost miss, as long as it always guarantees that at least vector element zero was | complete. Because vector element zero complete | | is what guarantees forward progress. | | Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to | actually implement a page-fault that did | | copy-on-write or swapped in from disk. but that would be visible to the operating system, not the user program. However, such | an OS implementation would have to | | guarantee that it would not kill a process as a result of a true permissions error page-fault. Or, if the virtual memory | architecture made the distinction between | | permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy on read, the OS does not need | to be involved. | | EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless | you only trim only on true faults and not COW | | or COR or disk swappage-faults). However, fault-on-first on any page-fault is a much lower bandwidth information leak | channel than is fault-on-first on long latency | | cache misses. so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to | implement fault-on-first on any cache miss. | | However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system | where all of the CPUs are dedicated to a single | | problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to | them! | | Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually much | less of a virtualization hole than | | fault-on-first for page-faults. The operating system or hypervisor has very little control over cache misses. the OS and | hypervisor have almost full control over | | page-faults. The usual rule in security and virtualization is that an application should not be able to detect that it has had | an "innocent" page-fault, such as COW or COR | | or disk swapping. | | -- | | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis | | | -- | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis |
|
i have not totally been following this discussion. but at convex we handled this very simply if Vl = 0, no vector operation was executed, and the vector instruction was executed and sequential operation proceeded. to the best of my knowledge this never came up as an issue -------------------------------------------------------------------- On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@...> said: | [DH]: I see this openness/lack of arbitrary constraint as precisely the strength of RISCV. | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under | SOFTWARE control. | Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero | on the 11th. That check could be in hardware, or it could be in | the software that's calling the FF instruction. I don't want us to rathole on how to guarantee forward progress for vl=0 case, but do want to note that this kind of forward progress is nasty to guarantee, implying there's long-lasting microarch state to keep around - what if you're context swapped out before you get to the 11th? Do you have to force the first one after a context swap to not trim? What if there's a sequence of ff's and second one goes back to vl=0? WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer |
|
David Horner
On 2020-10-21 6:33 p.m., swallach
wrote:
https://github.com/riscv/riscv-v-spec/issues/587#issuecomment-711087236 To clarify, Andrew's reading of the spec has vstart>= vl behaviour superseding vl=0 implied behaviour. Thus some vector instructions are executed even when vl=0. vfirst
and vpopc are two of them.
|
|