Forwarding this to tech-vector-ext; couple comments below. On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 < andy.glew@...> wrote:
In vector meeting last Friday I listened to both Krste and David
Horner's different opinions about fault-on-first and vector
length trimming. I realized (and may have convinced other
attendees)
that the RISC-V
"fault-on-first" vector
length
trimming need not be done just for things like page-faults.
Fault-on-first could be done for the first long latency cache
miss, as long as vector element
zero has been completed, because
vector
element zero is the forward progress mechanism.
Indeed, IMHO the correct semantic requirement for fault-on-first
is that it completes the element zero of the operation, but that
it can randomly stop with the appropriate indication for vector
length trimming at any point in the middle of the instruction. Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). Enumerating all cases in which the length might be trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.
This is part of what David Horner wants. However, it does not give
him the fault-on-first
with zero length complete
mechanism. It could, if
there were
something else in the system that guaranteed forward progress
My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization model.
---+ Expanded
From
vector meeting
last Friday: trimming, fault-on-first. I
realized that it is similar to the forms of SW visible
non-faulting speculative
loads some machines, especially VLIWs, have. However, instead of
delivering a
NaN or NaT, it is non-faulting except for vector element 0, where
it faults.
The NaT-ness is implied by trimmed vector length. It could be implied by a mask
showing which
vector operations had completed.
All such
SW
non-faulting loads need a "was this correct" operation, which
might
just be a faulting load and a comparison.
Software control flow must fall through such a check
operation, and through a
redo of the faulting load if
necessary. In scalar, non-faulting and faulting loads are
different
instructions, so there must be a branch.
The
RISC-V
Fault-on-first approach has
the
correctness check for non-faulting implied by redoing the
instruction. i.e. it is
its own non-faulting check. it
gets away with this because the trend
vector length indicates which parts were valid and not. forward
progress is
guaranteed by trapping on vector element zero, i.e. never allowing
a trim to
zero length. if a
non-faulting vector
approach was used instead of fault-on-first, it could return a
vector complete
mask, but to make forward progress it would have to guarantee that
at least one
vector element had completed.
David
Horner's
desire for fault-on-first that may have performed no operations at
all is
(1) reasonable IMHO (I
think I managed
to explain that the Krste), but (2) Would require some other
mechanism for
forward progress. E.g. instead of trapping on element zero, the
bitmask that I
described above. Which is almost certainly a bigger architectural
change than
RISC-V should make it this time.
Although
more and
more I am happier that I included such a completion bitmask in
newly every
vector instruction set that I've ever done. Particularly those
vector
instruction sets that were supposed to implement SIMT efficiently.
(I think of
SIMT as a programming model that is implemented on top of what
amounts to a
vector instruction set and microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would be unfortunate for
such an SIMT
program to lose work
completed after the
first fault.
MORAL: fault-on-first may be
suitable for vector
load that might speculate past the end of the vector - where the length is not known or inconvenient
when the vector
load instruction is started. Fault-on-first is
suboptimal for running SIMT on top of vectors. i.e. fault-on-first is the equivalent of precise
exceptions for
in order execution, and
for a single
thread executing vector instructions, whereas
completion mask allows
out of
order within a vector and/or vector length
threading.
IMHO an important realization I
made in that
meeting is that fault-on-first does not need to be just about
faulting. It is
totally fine to have the fault-on-first stuff return up to the first really long latency
cost miss, as long
as it always guarantees
that at least
vector element zero was complete. Because vector element zero
complete is what
guarantees forward progress.
Furthermore,
it is not even required that fault-on-first stop at the first
page-fault. An implementation could actually choose to actually
implement a page-fault that did copy-on-write or swapped in from
disk. but that would be visible to the operating system, not the
user program. However, such an OS implementation would have to
guarantee that it would not kill a process as a result of a true
permissions error page-fault. Or, if the virtual memory
architecture made the distinction between permissions faults and
the sorts of page-fault that is for disk swapping or copy-on-write
or copy on read, the OS does not need to be involved.
EVERYTHING
about fault-on-first is a microarchitecture security/information
leak channel and/or a virtualization hole. (Unless you only trim
only on true faults and not COW or COR or disk
swappage-faults). However, fault-on-first on any page-fault is
a much lower bandwidth information leak channel than is
fault-on-first on long latency cache misses. so a general purpose
system might choose to implement fault-on-first on any page-fault,
but might not want to implement fault-on-first on any cache miss.
However, there are some systems for which that sort of security
issue is not a concern. E.g. a data center or embedded system
where all of the CPUs are dedicated to a single problem. In which
case, if they can gain performance by doing fault-on-first on
particular long latency cache misses, power to them!
Interestingly,
although fault-on-first on long latency cache misses is a
high-bandwidth information leak, it is actually much less of a
virtualization hole than fault-on-first for page-faults. The
operating system or hypervisor has very little control over cache
misses. the OS and hypervisor have almost full control over
page-faults. The usual rule in security and virtualization is
that an application should not be able to detect that it has had
an "innocent" page-fault, such as COW or COR or disk swapping.
--
---
Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition
<= Computeritis
|
|
This is part of what David Horner wants. However,
it does not give him the fault-on-first
with zero length complete mechanism. It
could, if there were something else in the system that
guaranteed forward progress
My take is that requiring that element 0 either complete
or trap is already a solid mechanism for guaranteeing
forward progress, and cleanly matches the while-loop
vectorization model.
Yep, it's sufficient for the needs of while loop Vectorization.
It is suboptimal for "SIMT on vector". For that you need a
completion mask. and it is far too late to add that to the
RISC-V vector spec.
|
|
My take is the same as Andrew has outlined below.
Bill
toggle quoted message
Show quoted text
EXTERNAL MAIL
Forwarding this to tech-vector-ext; couple comments below.
On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 < andy.glew@...> wrote:
In vector meeting last Friday I listened to both Krste and David Horner's different opinions about fault-on-first and vector length trimming. I realized (and may have convinced other attendees) that the
RISC-V "fault-on-first" vector length trimming need not be done just for things like page-faults.
Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed,
because vector element zero is the forward progress mechanism.
Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the element zero of the operation, but that it can randomly stop with the appropriate indication for vector length trimming at any point in the middle of the instruction.
Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). Enumerating all cases in which the length might be trimmed seems like a fool's errand, so just saying it can be truncated to >=
1 for any reason is the way to go.
This is part of what David Horner wants. However, it does not give him the
fault-on-first with zero length complete mechanism. It could, if there were something else in the system that guaranteed forward progress
My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization model.
---+ Expanded
From vector meeting last Friday: trimming, fault-on-first.
I realized that it is similar to the forms of SW visible non-faulting speculative loads some machines, especially VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The NaT-ness
is implied by trimmed vector length. It could be implied by a mask showing which vector operations had completed.
All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison.
Software control flow must fall through such a check operation,
and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different instructions, so there must be a branch.
The RISC-V Fault-on-first approach
has the correctness check for non-faulting implied by redoing the instruction.
i.e. it is its own non-faulting check. it gets away with this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on vector element zero, i.e. never allowing a trim to zero
length. if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, but to make forward progress it would have to guarantee that at least one vector element had completed.
David Horner's desire for fault-on-first that may have performed no operations at all is (1)
reasonable IMHO (I think I managed to explain that the Krste), but (2) Would require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. Which is almost certainly a bigger architectural
change than RISC-V should make it this time.
Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've ever done. Particularly those vector instruction sets that were supposed to
implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what amounts to a vector instruction set and microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ).
It would be unfortunate for such an SIMT program to lose work completed after the first fault.
MORAL: fault-on-first may be suitable for vector load that might speculate past the end of the vector -
where the length is not known or inconvenient when the vector load instruction is started. Fault-on-first is
suboptimal for running SIMT on top of vectors. i.e. fault-on-first
is the equivalent of precise exceptions for in order execution,
and for a single thread executing vector instructions, whereas completion mask
allows out of order within a vector and/or vector length threading.
IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is totally fine to have the fault-on-first stuff return up to the
first really long latency cost miss, as long as it always guarantees that at least vector element zero was complete. Because vector element zero complete is what guarantees forward progress.
Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to actually implement a page-fault that did copy-on-write or swapped in
from disk. but that would be visible to the operating system, not the user program. However, such an OS implementation would have to guarantee that it would not kill a process as a result of a true permissions error page-fault. Or, if the virtual memory
architecture made the distinction between permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy on read, the OS does not need to be involved.
EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless you only trim only on true faults and not COW or COR or disk swappage-faults).
However, fault-on-first on any page-fault is a much lower bandwidth information leak channel than is fault-on-first on long latency cache misses. so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want
to implement fault-on-first on any cache miss. However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system where all of the CPUs are dedicated to a single problem. In which case, if they can
gain performance by doing fault-on-first on particular long latency cache misses, power to them!
Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually much less of a virtualization hole than fault-on-first for page-faults. The
operating system or hypervisor has very little control over cache misses. the OS and hypervisor have almost full control over page-faults. The usual rule in security and virtualization is that an application should not be able to detect that it has had an
"innocent" page-fault, such as COW or COR or disk swapping.
--
--- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis
|
|

Krste Asanovic
(sending replies to vector list - as this is off topic for CMOs) My opinion is that baking SIMT execution model into ISA for purposes of exposing microarchitectural performance (i.e., cache misses) exposes too much of the machine, forcing application software to add extra retry loops (2nd nested loop inside of stripmining) and forcing system software to deal with complex traps. [ Random historical connection - having a partial completion mask based on cache misses is a vector version of the Stanford proposal for "informing memory operations" where scalar core can branch on cache miss. https://dl.acm.org/doi/10.1145/232974.233000 ] Most of the benefit for SIMT execution around microarchitectural hiccups can be obtained under the hood in the microarchitecture (and there are several hundred ISCA/MICRO/HPCA papers on doing that - I might be exaggerating, but only slightly - and I know Andy worked in this space at some point), and should outperform putting this handling into software. That said, I think it's OK to allow FF V loads to stop anywhere past element 0 including at a long-latency cache miss, mainly because it doesn't change anything in software model. I'm not sure it will really help perf that much in practice. While it's easy to construct an example where it looks like it would help, I think in general most loops touch multiple vector operands, hardware prefetchers do well on vector streams, vector units are more efficient on larger chunks, scatter-gathers missing in cache limit perf anyway, etc., so it's probably a fairly brittle optimization (yes, you could add a predictor to figure out whether to wait for the other elements or go ahead with a partial vector result - I'm sure there's probably papers out there with this already). Krste On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@...> said:
| My take is the same as Andrew has outlined below. | Bill | On 10/15/20 4:30 PM, andrew@... wrote: | EXTERNAL MAIL | Forwarding this to tech-vector-ext; couple comments below. | On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote: | In vector meeting last Friday I listened to both Krste and David Horner's different opinions about fault-on-first and vector length trimming. I realized (and may have | convinced other attendees) that the RISC-V "fault-on-first" vector length trimming need not be done just for things like page-faults. | Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, because vector element zero is the forward progress | mechanism. | Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the element zero of the operation, but that it can randomly stop with the appropriate | indication for vector length trimming at any point in the middle of the instruction. | Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). Enumerating all cases in which the length might be | trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go. | This is part of what David Horner wants. However, it does not give him the fault-on-first with zero length complete mechanism. It could, if there were something else in | the system that guaranteed forward progress | My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization | model. | ---+ Expanded | From vector meeting last Friday: trimming, fault-on-first. I realized that it is similar to the forms of SW visible non-faulting speculative loads some machines, especially | VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The NaT-ness is implied by trimmed vector length. | It could be implied by a mask showing which vector operations had completed. | All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. Software control flow must fall through such a | check operation, and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different instructions, so there must be a branch. | The RISC-V Fault-on-first approach has the correctness check for non-faulting implied by redoing the instruction. i.e. it is its own non-faulting check. it gets away with | this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on vector element zero, i.e. never allowing a trim | to zero length. if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, but to make forward progress it would have to | guarantee that at least one vector element had completed. | David Horner's desire for fault-on-first that may have performed no operations at all is (1) reasonable IMHO (I think I managed to explain that the Krste), but (2) Would | require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. Which is almost certainly a bigger | architectural change than RISC-V should make it this time. | Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've ever done. Particularly those vector instruction | sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what amounts to a vector instruction set and | microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would be unfortunate for such an SIMT program to lose work completed after the first fault. | MORAL: fault-on-first may be suitable for vector load that might speculate past the end of the vector - where the length is not known or inconvenient when the vector load | instruction is started. Fault-on-first is suboptimal for running SIMT on top of vectors. i.e. fault-on-first is the equivalent of precise exceptions for in order | execution, and for a single thread executing vector instructions, whereas completion mask allows out of order within a vector and/or vector length threading. | IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is totally fine to have the fault-on-first stuff | return up to the first really long latency cost miss, as long as it always guarantees that at least vector element zero was complete. Because vector element zero complete | is what guarantees forward progress. | Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to actually implement a page-fault that did | copy-on-write or swapped in from disk. but that would be visible to the operating system, not the user program. However, such an OS implementation would have to | guarantee that it would not kill a process as a result of a true permissions error page-fault. Or, if the virtual memory architecture made the distinction between | permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy on read, the OS does not need to be involved. | EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless you only trim only on true faults and not COW | or COR or disk swappage-faults). However, fault-on-first on any page-fault is a much lower bandwidth information leak channel than is fault-on-first on long latency | cache misses. so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to implement fault-on-first on any cache miss. | However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system where all of the CPUs are dedicated to a single | problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to them! | Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually much less of a virtualization hole than | fault-on-first for page-faults. The operating system or hypervisor has very little control over cache misses. the OS and hypervisor have almost full control over | page-faults. The usual rule in security and virtualization is that an application should not be able to detect that it has had an "innocent" page-fault, such as COW or COR | or disk swapping. | -- | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis |
|
|
On 2020-10-15 7:30 p.m., Andrew Waterman wrote: My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization model. I agree, however, it still does not answer the ISA visible behavioural question: "Is the trap allowed to set vl=0 on return?" Can this be compliant behaviour for certain platforms? If so, then it would be equivalent to hardware doing the same thing, and thus the actual Vector hardware instruction should also be allowed this behaviour for the given platform. This is a corollary of instruction emulation by trapping on unimplemented op codes.
|
|
First I am very happy that "arbitrary decisions by the micro-architecture" allow reduction of vl to any [non-zero] value. Even if such appear "random". On 2020-10-16 2:01 a.m., krste@... wrote: - I'm sure there's probably papers out there with this already). Exactly. I see this openness/lack of arbitrary constraint as precisely the strength of RISCV. Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under SOFTWARE control. I see it as no different [in fundamental principle] than other cases such as RVI integer divide by zero behaviour that does not trap but can be readily checked for. Also RVI integer overflow that if you want to check for it is at most a few instructions including the branch. (sending replies to vector list - as this is off topic for CMOs)
My opinion is that baking SIMT execution model into ISA for purposes of exposing microarchitectural performance (i.e., cache misses) exposes too much of the machine, forcing application software to add extra retry loops (2nd nested loop inside of stripmining) and forcing system software to deal with complex traps.
[ Random historical connection - having a partial completion mask based on cache misses is a vector version of the Stanford proposal for "informing memory operations" where scalar core can branch on cache miss. https://dl.acm.org/doi/10.1145/232974.233000 ] Most of the benefit for SIMT execution around microarchitectural hiccups can be obtained under the hood in the microarchitecture (and there are several hundred ISCA/MICRO/HPCA papers on doing that - I might be exaggerating, but only slightly - and I know Andy worked in this space at some point), and should outperform putting this handling into software.
That said, I think it's OK to allow FF V loads to stop anywhere past element 0 including at a long-latency cache miss, mainly because it doesn't change anything in software model.
I'm not sure it will really help perf that much in practice. While it's easy to construct an example where it looks like it would help, I think in general most loops touch multiple vector operands, hardware prefetchers do well on vector streams, vector units are more efficient on larger chunks, scatter-gathers missing in cache limit perf anyway, etc., so it's probably a fairly brittle optimization (yes, you could add a predictor to figure out whether to wait for the other elements or go ahead with a partial vector result - I'm sure there's probably papers out there with this already).
Krste
On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@...> said:
| My take is the same as Andrew has outlined below. | Bill
| On 10/15/20 4:30 PM, andrew@... wrote:
| EXTERNAL MAIL | Forwarding this to tech-vector-ext; couple comments below. | On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote:
| In vector meeting last Friday I listened to both Krste and David Horner's different opinions about fault-on-first and vector length trimming. I realized (and may have | convinced other attendees) that the RISC-V "fault-on-first" vector length trimming need not be done just for things like page-faults. | Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, because vector element zero is the forward progress | mechanism. | Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the element zero of the operation, but that it can randomly stop with the appropriate | indication for vector length trimming at any point in the middle of the instruction. | Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). Enumerating all cases in which the length might be | trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.
| This is part of what David Horner wants. However, it does not give him the fault-on-first with zero length complete mechanism. It could, if there were something else in | the system that guaranteed forward progress | My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization | model.
| ---+ Expanded
| From vector meeting last Friday: trimming, fault-on-first. I realized that it is similar to the forms of SW visible non-faulting speculative loads some machines, especially | VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The NaT-ness is implied by trimmed vector length. | It could be implied by a mask showing which vector operations had completed.
| All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. Software control flow must fall through such a | check operation, and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different instructions, so there must be a branch.
| The RISC-V Fault-on-first approach has the correctness check for non-faulting implied by redoing the instruction. i.e. it is its own non-faulting check. it gets away with | this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on vector element zero, i.e. never allowing a trim | to zero length. if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, but to make forward progress it would have to | guarantee that at least one vector element had completed.
| David Horner's desire for fault-on-first that may have performed no operations at all is (1) reasonable IMHO (I think I managed to explain that the Krste), but (2) Would | require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. Which is almost certainly a bigger | architectural change than RISC-V should make it this time.
| Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've ever done. Particularly those vector instruction | sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what amounts to a vector instruction set and | microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would be unfortunate for such an SIMT program to lose work completed after the first fault.
| MORAL: fault-on-first may be suitable for vector load that might speculate past the end of the vector - where the length is not known or inconvenient when the vector load | instruction is started. Fault-on-first is suboptimal for running SIMT on top of vectors. i.e. fault-on-first is the equivalent of precise exceptions for in order | execution, and for a single thread executing vector instructions, whereas completion mask allows out of order within a vector and/or vector length threading.
| IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is totally fine to have the fault-on-first stuff | return up to the first really long latency cost miss, as long as it always guarantees that at least vector element zero was complete. Because vector element zero complete | is what guarantees forward progress.
| Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to actually implement a page-fault that did | copy-on-write or swapped in from disk. but that would be visible to the operating system, not the user program. However, such an OS implementation would have to | guarantee that it would not kill a process as a result of a true permissions error page-fault. Or, if the virtual memory architecture made the distinction between | permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy on read, the OS does not need to be involved.
| EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless you only trim only on true faults and not COW | or COR or disk swappage-faults). However, fault-on-first on any page-fault is a much lower bandwidth information leak channel than is fault-on-first on long latency | cache misses. so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to implement fault-on-first on any cache miss. | However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system where all of the CPUs are dedicated to a single | problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to them!
| Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually much less of a virtualization hole than | fault-on-first for page-faults. The operating system or hypervisor has very little control over cache misses. the OS and hypervisor have almost full control over | page-faults. The usual rule in security and virtualization is that an application should not be able to detect that it has had an "innocent" page-fault, such as COW or COR | or disk swapping.
| -- | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis |
|
|

Krste Asanovic
On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
| First I am very happy that "arbitrary decisions by the | micro-architecture" allow reduction of vl to any [non-zero] value. | Even if such appear "random". [...] | A check for vl=0 on platforms that allow it is eminently doable, low | overhead for many use cases AND guarantees forward progress under | SOFTWARE control. If we allowed implementation to return vl=0, how does software guarantee forward progress? | I see it as no different [in fundamental principle] than other cases | such as RVI integer divide by zero behaviour that does not trap but can | be readily checked for. | Also RVI integer overflow that if you want to check for it is at most a | few instructions including the branch. I don't see how these examples relate to returning vl=0 on some microarchitectural event. The examples here have results that depend only on architectural values, so can be deterministically handled. vl=0 is more related to load-reserved/store-conditional failure, where we need to add implementation constraints to guarantee forward progress. Krste
|
|
On 2020-10-16 10:30 a.m., krste@... wrote:
On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
| First I am very happy that "arbitrary decisions by the | micro-architecture" allow reduction of vl to any [non-zero] value.
| Even if such appear "random". [...] | A check for vl=0 on platforms that allow it is eminently doable, low | overhead for many use cases AND guarantees forward progress under | SOFTWARE control.
If we allowed implementation to return vl=0, how does software guarantee forward progress?
The forward progress is to advance to another task. In the case of machine mode it can potentially "resolve" the cause of the vl=0 return and re-execute the loop (without the overhead of the trap). | I see it as no different [in fundamental principle] than other cases | such as RVI integer divide by zero behaviour that does not trap but can | be readily checked for. | Also RVI integer overflow that if you want to check for it is at most a | few instructions including the branch.
I don't see how these examples relate to returning vl=0 on some microarchitectural event. The examples here have results that depend only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient to check instead register state. vl=0 is more related to load-reserved/store-conditional failure, where we need to add implementation constraints to guarantee forward progress.
Ok. I can see providing guidance as to when vl=0 is allowed, but not to exclude it outright. Krste
|
|

Roger Espasa
Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:
toggle quoted message
Show quoted text
On Fri, Oct 16, 2020 at 4:56 PM David Horner < ds2horner@...> wrote:
On 2020-10-16 10:30 a.m., krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?
The forward progress is to advance to another task.
In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).
>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event. The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.
Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.
> Krste
|
|
The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for element 9 already to have been overwritten with 1's (if vma allows it).
I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.
Bill
On 10/16/20 8:57 AM, Roger Espasa wrote:
toggle quoted message
Show quoted text
EXTERNAL MAIL
Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:
On Fri, Oct 16, 2020 at 4:56 PM David Horner < ds2horner@...> wrote:
On 2020-10-16 10:30 a.m.,
krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?
The forward progress is to advance to another task.
In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).
>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event. The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.
Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.
> Krste
|
|

Roger Espasa
Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)
-> If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same -> If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line followed by an adder. At SEW=8b, there can be lots of elements within a line...
roger.
toggle quoted message
Show quoted text
On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman < huffman@...> wrote:
The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for element 9 already to have been overwritten with 1's (if vma allows it).
I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.
Bill
On 10/16/20 8:57 AM, Roger Espasa wrote:
EXTERNAL MAIL
Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:
On Fri, Oct 16, 2020 at 4:56 PM David Horner < ds2horner@...> wrote:
On 2020-10-16 10:30 a.m.,
krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?
The forward progress is to advance to another task.
In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).
>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event. The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.
Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.
> Krste
|
|

Roger Espasa
Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that). However, I would NOT agree that a masked out element has been written, even if past the failing point.
roger.
toggle quoted message
Show quoted text
Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)
-> If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same -> If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line followed by an adder. At SEW=8b, there can be lots of elements within a line...
roger.
On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman < huffman@...> wrote:
The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for element 9 already to have been overwritten with 1's (if vma allows it).
I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.
Bill
On 10/16/20 8:57 AM, Roger Espasa wrote:
EXTERNAL MAIL
Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:
On Fri, Oct 16, 2020 at 4:56 PM David Horner < ds2horner@...> wrote:
On 2020-10-16 10:30 a.m.,
krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?
The forward progress is to advance to another task.
In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).
>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event. The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.
Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.
> Krste
|
|
Roger,
I think it's an implementation choice whether vl is trimmed to 3 or 6 (or theoretically other values). I don't know a reason why the implementation couldn't always trim vl to the same value that vstart would have been set
to if the exception were being taken. Does anyone know such a reason? It seems simplest to me always to trim vl to the value vstart would have been set to.
I meant element 9. If vma=1, then inactive elements can be undisturbed or set to 1's. Element 'a' couldn't have been loaded in the case described because it was in a line with a fault. In general, I think our discussions
would have allowed element 'a' to be written if there were some other reason for trimming vl.
Bill
On 10/16/20 9:59 AM, Roger Espasa wrote:
toggle quoted message
Show quoted text
EXTERNAL MAIL
Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that).
However, I would NOT agree that a masked out element has been written, even if past the failing point.
roger.
Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)
-> If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same
-> If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line followed by an adder. At SEW=8b, there can be lots of elements within a line...
roger.
On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman < huffman@...> wrote:
The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for element 9 already to have been overwritten with 1's (if vma allows it).
I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.
Bill
On 10/16/20 8:57 AM, Roger Espasa wrote:
EXTERNAL MAIL
Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:
On Fri, Oct 16, 2020 at 4:56 PM David Horner < ds2horner@...> wrote:
On 2020-10-16 10:30 a.m.,
krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?
The forward progress is to advance to another task.
In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).
>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event. The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.
Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.
> Krste
|
|

Krste Asanovic
As you get to pick where vl is trimmed, you would probably choose the vl=3 case here to simplify implementation. Krste On Fri, 16 Oct 2020 18:59:55 +0200, Roger Espasa <roger.espasa@...> said:
| Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that). | However, I would NOT agree that a masked out element has been written, even if past the failing point. | roger. | On Fri, Oct 16, 2020 at 6:57 PM Roger Espasa <roger.espasa@...> wrote: | Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this) -| If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same -| If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line | followed by an adder. At SEW=8b, there can be lots of elements within a line... | roger. | On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote: | The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for element 9 already | to have been overwritten with 1's (if vma allows it). | I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions. | Bill | On 10/16/20 8:57 AM, Roger Espasa wrote: | EXTERNAL MAIL | Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list: | image.png | On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote: | On 2020-10-16 10:30 a.m., krste@... wrote: || ||||||| On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said: || | First I am very happy that "arbitrary decisions by the || | micro-architecture" allow reduction of vl to any [non-zero] value. || || | Even if such appear "random". || [...] || | A check for vl=0 on platforms that allow it is eminently doable, low || | overhead for many use cases AND guarantees forward progress under || | SOFTWARE control. || || If we allowed implementation to return vl=0, how does software || guarantee forward progress? | The forward progress is to advance to another task. | In the case of machine mode it can potentially "resolve" the cause of | the vl=0 return and re-execute the loop (without the overhead of the trap). || || | I see it as no different [in fundamental principle] than other cases || | such as RVI integer divide by zero behaviour that does not trap but can || | be readily checked for. || | Also RVI integer overflow that if you want to check for it is at most a || | few instructions including the branch. || || I don't see how these examples relate to returning vl=0 on some || microarchitectural event. The examples here have results that depend || only on architectural values, so can be deterministically handled. | The similarity is the avoidance of trap handling, when it is sufficient | to check instead register state. || || vl=0 is more related to load-reserved/store-conditional failure, where || we need to add implementation constraints to guarantee forward || progress. | Ok. I can see providing guidance as to when vl=0 is allowed, but not to | exclude it outright. || Krste | | x[DELETED ATTACHMENT image.png, PNG image]
|
|

Roger Espasa
We're all in agreement that if the spec says "pick where you stop" we'd all pick to trim to VL=3. I was under the impression this was not yet closed (in light of the "stop at cache misses" discussion), but I sense everyone else is already on the "pick where you stop" camp.
Speaking of which, did we ever close on whether vleff could trim even when there was no fault (i.e., just because there's a cache miss for example)? If the answer is "yes, you can arbitrarily stop on any element other than element 0", can someone show a while loop and how the compiler would then use vleff? I'm not seeing how they would use it, other than enclose the vleff loop in a second loop to make sure that "the index variable has reached the limit" (i.e., i<n, make sure that vleff has run enough times so that i has reached n).
roger.
toggle quoted message
Show quoted text
On Fri, Oct 16, 2020 at 7:33 PM < krste@...> wrote:
As you get to pick where vl is trimmed, you would probably choose the
vl=3 case here to simplify implementation.
Krste
>>>>> On Fri, 16 Oct 2020 18:59:55 +0200, Roger Espasa <roger.espasa@...> said:
| Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that).
| However, I would NOT agree that a masked out element has been written, even if past the failing point.
| roger.
| On Fri, Oct 16, 2020 at 6:57 PM Roger Espasa <roger.espasa@...> wrote:
| Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)
-| If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same
-| If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line
| followed by an adder. At SEW=8b, there can be lots of elements within a line...
| roger.
| On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote:
| The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for element 9 already
| to have been overwritten with 1's (if vma allows it).
| I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.
| Bill
| On 10/16/20 8:57 AM, Roger Espasa wrote:
| EXTERNAL MAIL
| Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:
| image.png
| On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:
| On 2020-10-16 10:30 a.m., krste@... wrote:
||
||||||| On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
|| | First I am very happy that "arbitrary decisions by the
|| | micro-architecture" allow reduction of vl to any [non-zero] value.
||
|| | Even if such appear "random".
|| [...]
|| | A check for vl=0 on platforms that allow it is eminently doable, low
|| | overhead for many use cases AND guarantees forward progress under
|| | SOFTWARE control.
||
|| If we allowed implementation to return vl=0, how does software
|| guarantee forward progress?
| The forward progress is to advance to another task.
| In the case of machine mode it can potentially "resolve" the cause of
| the vl=0 return and re-execute the loop (without the overhead of the trap).
||
|| | I see it as no different [in fundamental principle] than other cases
|| | such as RVI integer divide by zero behaviour that does not trap but can
|| | be readily checked for.
|| | Also RVI integer overflow that if you want to check for it is at most a
|| | few instructions including the branch.
||
|| I don't see how these examples relate to returning vl=0 on some
|| microarchitectural event. The examples here have results that depend
|| only on architectural values, so can be deterministically handled.
| The similarity is the avoidance of trap handling, when it is sufficient
| to check instead register state.
||
|| vl=0 is more related to load-reserved/store-conditional failure, where
|| we need to add implementation constraints to guarantee forward
|| progress.
| Ok. I can see providing guidance as to when vl=0 is allowed, but not to
| exclude it outright.
|| Krste
| x[DELETED ATTACHMENT image.png, PNG image]
|
|
I don't think the cases where there was no fault look any different to software than the fault cases. Either can happen anywhere and the while loop may continue. The while loop isn't ended by a trimmed vl, it's ended by data
it sees.
Bill
On 10/16/20 10:47 AM, Roger Espasa wrote:
toggle quoted message
Show quoted text
EXTERNAL MAIL
We're all in agreement that if the spec says "pick where you stop" we'd all pick to trim to VL=3. I was under the impression this was not yet closed (in light of the "stop at cache misses" discussion), but I sense everyone else is already on
the "pick where you stop" camp.
Speaking of which, did we ever close on whether vleff could trim even when there was no fault (i.e., just because there's a cache miss for example)?
If the answer is "yes, you can arbitrarily stop on any element other than element 0", can someone show a while loop and how the compiler would then use vleff? I'm not seeing how they would use it, other than enclose the vleff loop in a second loop to make
sure that "the index variable has reached the limit" (i.e., i<n, make sure that vleff has run enough times so that i has reached n).
roger.
On Fri, Oct 16, 2020 at 7:33 PM < krste@...> wrote:
As you get to pick where vl is trimmed, you would probably choose the
vl=3 case here to simplify implementation.
Krste
>>>>> On Fri, 16 Oct 2020 18:59:55 +0200, Roger Espasa <roger.espasa@...> said:
| Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that).
| However, I would NOT agree that a masked out element has been written, even if past the failing point.
| roger.
| On Fri, Oct 16, 2020 at 6:57 PM Roger Espasa <roger.espasa@...> wrote:
| Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)
-| If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same
-| If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line
| followed by an adder. At SEW=8b, there can be lots of elements within a line...
| roger.
| On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote:
| The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for element 9 already
| to have been overwritten with 1's (if vma allows it).
| I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.
| Bill
| On 10/16/20 8:57 AM, Roger Espasa wrote:
| EXTERNAL MAIL
| Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:
| image.png
| On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:
| On 2020-10-16 10:30 a.m.,
krste@... wrote:
||
||||||| On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
|| | First I am very happy that "arbitrary decisions by the
|| | micro-architecture" allow reduction of vl to any [non-zero] value.
||
|| | Even if such appear "random".
|| [...]
|| | A check for vl=0 on platforms that allow it is eminently doable, low
|| | overhead for many use cases AND guarantees forward progress under
|| | SOFTWARE control.
||
|| If we allowed implementation to return vl=0, how does software
|| guarantee forward progress?
| The forward progress is to advance to another task.
| In the case of machine mode it can potentially "resolve" the cause of
| the vl=0 return and re-execute the loop (without the overhead of the trap).
||
|| | I see it as no different [in fundamental principle] than other cases
|| | such as RVI integer divide by zero behaviour that does not trap but can
|| | be readily checked for.
|| | Also RVI integer overflow that if you want to check for it is at most a
|| | few instructions including the branch.
||
|| I don't see how these examples relate to returning vl=0 on some
|| microarchitectural event. The examples here have results that depend
|| only on architectural values, so can be deterministically handled.
| The similarity is the avoidance of trap handling, when it is sufficient
| to check instead register state.
||
|| vl=0 is more related to load-reserved/store-conditional failure, where
|| we need to add implementation constraints to guarantee forward
|| progress.
| Ok. I can see providing guidance as to when vl=0 is allowed, but not to
| exclude it outright.
|| Krste
| x[DELETED ATTACHMENT image.png, PNG image]
|
|

Roger Espasa
So all the vleff use cases end up then using a vmpopc of some sort to determine the exit condition and never use the trimmed VL ? (other than, of course, to control within the while how many elements should be operated upon). Do the compiler folks on the list agree that's the only use of vleff?
roger.
toggle quoted message
Show quoted text
On Fri, Oct 16, 2020 at 7:53 PM Bill Huffman < huffman@...> wrote:
I don't think the cases where there was no fault look any different to software than the fault cases. Either can happen anywhere and the while loop may continue. The while loop isn't ended by a trimmed vl, it's ended by data
it sees.
Bill
On 10/16/20 10:47 AM, Roger Espasa wrote:
EXTERNAL MAIL
We're all in agreement that if the spec says "pick where you stop" we'd all pick to trim to VL=3. I was under the impression this was not yet closed (in light of the "stop at cache misses" discussion), but I sense everyone else is already on
the "pick where you stop" camp.
Speaking of which, did we ever close on whether vleff could trim even when there was no fault (i.e., just because there's a cache miss for example)?
If the answer is "yes, you can arbitrarily stop on any element other than element 0", can someone show a while loop and how the compiler would then use vleff? I'm not seeing how they would use it, other than enclose the vleff loop in a second loop to make
sure that "the index variable has reached the limit" (i.e., i<n, make sure that vleff has run enough times so that i has reached n).
roger.
On Fri, Oct 16, 2020 at 7:33 PM < krste@...> wrote:
As you get to pick where vl is trimmed, you would probably choose the
vl=3 case here to simplify implementation.
Krste
>>>>> On Fri, 16 Oct 2020 18:59:55 +0200, Roger Espasa <roger.espasa@...> said:
| Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that).
| However, I would NOT agree that a masked out element has been written, even if past the failing point.
| roger.
| On Fri, Oct 16, 2020 at 6:57 PM Roger Espasa <roger.espasa@...> wrote:
| Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)
-| If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same
-| If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line
| followed by an adder. At SEW=8b, there can be lots of elements within a line...
| roger.
| On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote:
| The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for element 9 already
| to have been overwritten with 1's (if vma allows it).
| I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.
| Bill
| On 10/16/20 8:57 AM, Roger Espasa wrote:
| EXTERNAL MAIL
| Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:
| image.png
| On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:
| On 2020-10-16 10:30 a.m.,
krste@... wrote:
||
||||||| On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
|| | First I am very happy that "arbitrary decisions by the
|| | micro-architecture" allow reduction of vl to any [non-zero] value.
||
|| | Even if such appear "random".
|| [...]
|| | A check for vl=0 on platforms that allow it is eminently doable, low
|| | overhead for many use cases AND guarantees forward progress under
|| | SOFTWARE control.
||
|| If we allowed implementation to return vl=0, how does software
|| guarantee forward progress?
| The forward progress is to advance to another task.
| In the case of machine mode it can potentially "resolve" the cause of
| the vl=0 return and re-execute the loop (without the overhead of the trap).
||
|| | I see it as no different [in fundamental principle] than other cases
|| | such as RVI integer divide by zero behaviour that does not trap but can
|| | be readily checked for.
|| | Also RVI integer overflow that if you want to check for it is at most a
|| | few instructions including the branch.
||
|| I don't see how these examples relate to returning vl=0 on some
|| microarchitectural event. The examples here have results that depend
|| only on architectural values, so can be deterministically handled.
| The similarity is the avoidance of trap handling, when it is sufficient
| to check instead register state.
||
|| vl=0 is more related to load-reserved/store-conditional failure, where
|| we need to add implementation constraints to guarantee forward
|| progress.
| Ok. I can see providing guidance as to when vl=0 is allowed, but not to
| exclude it outright.
|| Krste
| x[DELETED ATTACHMENT image.png, PNG image]
|
|

Krste Asanovic
Here's the strlen example from spec: .text .balign 4 .global strlen # size_t strlen(const char *str) # a0 holds *str strlen: mv a3, a0 # Save start loop: vsetvli a1, x0, e8,m8, ta,ma # Vector of bytes of maximum length vle8ff.v v8, (a3) # Load bytes csrr a1, vl # Get bytes read vmseq.vi v0, v8, 0 # Set v0[i] where v8[i] = 0 vfirst.m a2, v0 # Find first set bit add a3, a3, a1 # Bump pointer bltz a2, loop # Not found? add a0, a0, a1 # Sum start + bump add a3, a3, a2 # Add index sub a0, a3, a0 # Subtract start address+bump ret This exits when vfirst.m returns non-negative (i.e,, something triggered exit condition) - the vfirst.m instruction can early-out when exit found (though vle8ff/vmseq will still have to run to completion). Krste On Fri, 16 Oct 2020 20:04:15 +0200, Roger Espasa <roger.espasa@...> said:
| So all the vleff use cases end up then using a vmpopc of some sort to determine the exit condition and never use the trimmed VL ? (other than, of course, to | control within the while how many elements should be operated upon). Do the compiler folks on the list agree that's the only use of vleff? | roger. | On Fri, Oct 16, 2020 at 7:53 PM Bill Huffman <huffman@...> wrote: | I don't think the cases where there was no fault look any different to software than the fault cases. Either can happen anywhere and the while loop may | continue. The while loop isn't ended by a trimmed vl, it's ended by data it sees. | Bill | On 10/16/20 10:47 AM, Roger Espasa wrote: | EXTERNAL MAIL | We're all in agreement that if the spec says "pick where you stop" we'd all pick to trim to VL=3. I was under the impression this was not yet closed | (in light of the "stop at cache misses" discussion), but I sense everyone else is already on the "pick where you stop" camp. | Speaking of which, did we ever close on whether vleff could trim even when there was no fault (i.e., just because there's a cache miss for example)? | If the answer is "yes, you can arbitrarily stop on any element other than element 0", can someone show a while loop and how the compiler would then | use vleff? I'm not seeing how they would use it, other than enclose the vleff loop in a second loop to make sure that "the index variable has reached | the limit" (i.e., i<n, make sure that vleff has run enough times so that i has reached n). | roger. | On Fri, Oct 16, 2020 at 7:33 PM <krste@...> wrote: | As you get to pick where vl is trimmed, you would probably choose the | vl=3 case here to simplify implementation. | Krste |||||| On Fri, 16 Oct 2020 18:59:55 +0200, Roger Espasa <roger.espasa@...> said: | | Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that). | | However, I would NOT agree that a masked out element has been written, even if past the failing point. | | roger. | | On Fri, Oct 16, 2020 at 6:57 PM Roger Espasa <roger.espasa@...> wrote: | | Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this) | -| If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same | -| If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the | line | | followed by an adder. At SEW=8b, there can be lots of elements within a line... | | roger. | | On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote: | | The way the discussion has been going, I think either would be permissible. Not only that, but it would have been permissible for | element 9 already | | to have been overwritten with 1's (if vma allows it). | | I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions. | | Bill | | On 10/16/20 8:57 AM, Roger Espasa wrote: | | EXTERNAL MAIL | | Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list: | | image.png | | On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote: | | On 2020-10-16 10:30 a.m., krste@... wrote: | || | ||||||| On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said: | || | First I am very happy that "arbitrary decisions by the | || | micro-architecture" allow reduction of vl to any [non-zero] value. | || | || | Even if such appear "random". | || [...] | || | A check for vl=0 on platforms that allow it is eminently doable, low | || | overhead for many use cases AND guarantees forward progress under | || | SOFTWARE control. | || | || If we allowed implementation to return vl=0, how does software | || guarantee forward progress? | | The forward progress is to advance to another task. | | In the case of machine mode it can potentially "resolve" the cause of | | the vl=0 return and re-execute the loop (without the overhead of the trap). | || | || | I see it as no different [in fundamental principle] than other cases | || | such as RVI integer divide by zero behaviour that does not trap but can | || | be readily checked for. | || | Also RVI integer overflow that if you want to check for it is at most a | || | few instructions including the branch. | || | || I don't see how these examples relate to returning vl=0 on some | || microarchitectural event. The examples here have results that depend | || only on architectural values, so can be deterministically handled. | | The similarity is the avoidance of trap handling, when it is sufficient | | to check instead register state. | || | || vl=0 is more related to load-reserved/store-conditional failure, where | || we need to add implementation constraints to guarantee forward | || progress. | | Ok. I can see providing guidance as to when vl=0 is allowed, but not to | | exclude it outright. | || Krste | | | | x[DELETED ATTACHMENT image.png, PNG image]
|
|
[DH]: I see this openness/lack of arbitrary constraint as
precisely the strength of RISCV.
Limiting vector operations due to current constraints in
software (Linuz does it this way, compilers cannot optimize that
formulation[yet])
or hardware (reminiscent of delayed branch because prediction
was too expensive) is short sighted.
A check for vl=0 on platforms that allow it is eminently doable,
low overhead for many use cases AND guarantees forward progress
under SOFTWARE control.
Sure. You could guarantee forward progress, e.g. by allowing no
more than 10 successive "first fault" with VL=0, and requiring
trap on element zero on the 11th. That check could be in
hardware, or it could be in the software that's calling the FF
instruction.
But this does not need to be in the RISC-V architectural
standard. Not yet.
As long as the VL=0 encoding is free, not used for some other
purpose, you can do that in your implementation.
Your implementation might not be able to pass the RISC-V
architectural for FF, which I assume will probably assert an
error if they find FF and with VL=0. but if your hardware has a
chicken bit to reduce the threshold of VL= FFs to zero, or if you
have a binary translator from the compliance tests to your
software guaranteed forward progress, sure.
Build something like that, so there's a lot of people who want
it, and a few years from now we can put it into a future version
of the vector standard.
On 10/16/2020 4:48, David Horner wrote:
First I
am very happy that "arbitrary decisions by the micro-architecture"
allow reduction of vl to any [non-zero] value.
Even if such appear "random".
On 2020-10-16 2:01 a.m., krste@... wrote:
- I'm sure there's probably
papers out there with this already).
Exactly.
I see this openness/lack of arbitrary constraint as precisely
the strength of RISCV.
Limiting vector operations due to current constraints in software
(Linuz does it this way, compilers cannot optimize that
formulation[yet])
or hardware (reminiscent of delayed branch because prediction was
too expensive) is short sighted.
A check for vl=0 on platforms that allow it is eminently doable,
low overhead for many use cases AND guarantees forward progress
under SOFTWARE control.
I see it as no different [in fundamental principle] than other
cases such as RVI integer divide by zero behaviour that does not
trap but can be readily checked for.
Also RVI integer overflow that if you want to check for it is at
most a few instructions including the branch.
(sending replies to vector list - as this
is off topic for CMOs)
My opinion is that baking SIMT execution model into ISA for
purposes
of exposing microarchitectural performance (i.e., cache misses)
exposes too much of the machine, forcing application software to
add
extra retry loops (2nd nested loop inside of stripmining) and
forcing
system software to deal with complex traps.
[ Random historical connection - having a partial completion
mask based
on cache misses is a vector version of the Stanford
proposal for
"informing memory operations" where scalar core can branch
on cache miss.
https://dl.acm.org/doi/10.1145/232974.233000 ]
Most of the benefit for SIMT execution around
microarchitectural
hiccups can be obtained under the hood in the microarchitecture
(and
there are several hundred ISCA/MICRO/HPCA papers on doing that -
I
might be exaggerating, but only slightly - and I know Andy
worked in
this space at some point), and should outperform putting this
handling
into software.
That said, I think it's OK to allow FF V loads to stop anywhere
past
element 0 including at a long-latency cache miss, mainly because
it
doesn't change anything in software model.
I'm not sure it will really help perf that much in practice.
While
it's easy to construct an example where it looks like it would
help, I
think in general most loops touch multiple vector operands,
hardware
prefetchers do well on vector streams, vector units are more
efficient
on larger chunks, scatter-gathers missing in cache limit perf
anyway,
etc., so it's probably a fairly brittle optimization (yes, you
could
add a predictor to figure out whether to wait for the other
elements
or go ahead with a partial vector result - I'm sure there's
probably
papers out there with this already).
Krste
On Fri, 16 Oct 2020 04:03:17
+0000, "Bill Huffman" <huffman@...>
said:
| My take is the same as Andrew has outlined below.
| Bill
| On 10/15/20 4:30 PM, andrew@... wrote:
| EXTERNAL MAIL
| Forwarding this to tech-vector-ext; couple comments
below.
| On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5
<andy.glew@...> wrote:
| In vector meeting last Friday I listened to both
Krste and David Horner's different opinions about
fault-on-first and vector length trimming. I realized (and may
have
| convinced other attendees) that the RISC-V
"fault-on-first" vector length trimming need not be done just
for things like page-faults.
| Fault-on-first could be done for the first
long latency cache miss, as long as vector element zero has been
completed, because vector element zero is the forward progress
| mechanism.
| Indeed, IMHO the correct semantic requirement
for fault-on-first is that it completes the element zero of the
operation, but that it can randomly stop with the appropriate
| indication for vector length trimming at any point in
the middle of the instruction.
| Indeed, I've found other microarchitectural
reasons to favor this approach (e.g., speculating through
mask-register values). Enumerating all cases in which the
length might be
| trimmed seems like a fool's errand, so just saying it can
be truncated to >= 1 for any reason is the way to go.
| This is part of what David Horner wants. However, it
does not give him the fault-on-first with zero length complete
mechanism. It could, if there were something else in
| the system that guaranteed forward progress
| My take is that requiring that element 0 either
complete or trap is already a solid mechanism for guaranteeing
forward progress, and cleanly matches the while-loop
vectorization
| model.
| ---+ Expanded
| From vector meeting last Friday: trimming,
fault-on-first. I realized that it is similar to the forms of
SW visible non-faulting speculative loads some machines,
especially
| VLIWs, have. However, instead of delivering a NaN or
NaT, it is non-faulting except for vector element 0, where it
faults. The NaT-ness is implied by trimmed vector length.
| It could be implied by a mask showing which vector
operations had completed.
| All such SW non-faulting loads need a "was this
correct" operation, which might just be a faulting load and a
comparison. Software control flow must fall through such a
| check operation, and through a redo of the faulting
load if necessary. In scalar, non-faulting and faulting loads
are different instructions, so there must be a branch.
| The RISC-V Fault-on-first approach has the
correctness check for non-faulting implied by redoing the
instruction. i.e. it is its own non-faulting check. it gets
away with
| this because the trend vector length indicates which
parts were valid and not. forward progress is guaranteed by
trapping on vector element zero, i.e. never allowing a trim
| to zero length. if a non-faulting vector approach
was used instead of fault-on-first, it could return a vector
complete mask, but to make forward progress it would have to
| guarantee that at least one vector element had
completed.
| David Horner's desire for fault-on-first that may have
performed no operations at all is (1) reasonable IMHO (I think
I managed to explain that the Krste), but (2) Would
| require some other mechanism for forward progress.
E.g. instead of trapping on element zero, the bitmask that I
described above. Which is almost certainly a bigger
| architectural change than RISC-V should make it this
time.
| Although more and more I am happier that I included
such a completion bitmask in newly every vector instruction set
that I've ever done. Particularly those vector instruction
| sets that were supposed to implement SIMT efficiently.
(I think of SIMT as a programming model that is implemented on
top of what amounts to a vector instruction set and
| microarchitecture.
https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would
be unfortunate for such an SIMT program to lose work completed
after the first fault.
| MORAL: fault-on-first may be suitable for vector load
that might speculate past the end of the vector - where the
length is not known or inconvenient when the vector load
| instruction is started. Fault-on-first is suboptimal
for running SIMT on top of vectors. i.e. fault-on-first is
the equivalent of precise exceptions for in order
| execution, and for a single thread executing vector
instructions, whereas completion mask allows out of order
within a vector and/or vector length threading.
| IMHO an important realization I made in that meeting
is that fault-on-first does not need to be just about faulting.
It is totally fine to have the fault-on-first stuff
| return up to the first really long latency cost miss,
as long as it always guarantees that at least vector element
zero was complete. Because vector element zero complete
| is what guarantees forward progress.
| Furthermore, it is not even required that
fault-on-first stop at the first page-fault. An implementation
could actually choose to actually implement a page-fault that
did
| copy-on-write or swapped in from disk. but that
would be visible to the operating system, not the user program.
However, such an OS implementation would have to
| guarantee that it would not kill a process as a
result of a true permissions error page-fault. Or, if the
virtual memory architecture made the distinction between
| permissions faults and the sorts of page-fault that is
for disk swapping or copy-on-write or copy on read, the OS
does not need to be involved.
| EVERYTHING about fault-on-first is a
microarchitecture security/information leak channel and/or a
virtualization hole. (Unless you only trim only on true faults
and not COW
| or COR or disk swappage-faults). However,
fault-on-first on any page-fault is a much lower bandwidth
information leak channel than is fault-on-first on long
latency
| cache misses. so a general purpose system might
choose to implement fault-on-first on any page-fault, but might
not want to implement fault-on-first on any cache miss.
| However, there are some systems for which that sort of
security issue is not a concern. E.g. a data center or embedded
system where all of the CPUs are dedicated to a single
| problem. In which case, if they can gain performance
by doing fault-on-first on particular long latency cache misses,
power to them!
| Interestingly, although fault-on-first on long latency
cache misses is a high-bandwidth information leak, it is
actually much less of a virtualization hole than
| fault-on-first for page-faults. The operating system
or hypervisor has very little control over cache misses. the OS
and hypervisor have almost full control over
| page-faults. The usual rule in security and
virtualization is that an application should not be able to
detect that it has had an "innocent" page-fault, such as COW or
COR
| or disk swapping.
| --
| --- Sorry: Typos (Speech-Os?) Writing Errors <=
Speech Recognition <= Computeritis
--
---
Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition
<= Computeritis
|
|
On 2020-10-17 6:49 p.m., krste@... wrote: - [tech-cmo] so they don't get bothered with this off-topic discussion
On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@...> said:
| [DH]: I see this openness/lack of arbitrary constraint as precisely the strength of RISCV. | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under | SOFTWARE control. | Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero | on the 11th. That check could be in hardware, or it could be in | the software that's calling the FF instruction.
I don't want us to rathole on how to guarantee forward progress for vl=0 case, but do want to note that this kind of forward progress is nasty to guarantee, implying there's long-lasting microarch state to keep around - what if you're context swapped out before you get to the 11th? Do you have to force the first one after a context swap to not trim? What if there's a sequence of ff's and second one goes back to vl=0? Krste: I gather your answer is more in the context of lr/sc type forward guarantees, instructions that are designed not to trap when delivering on their primary function. So I agree that determining an appropriate deferred trap is problematic. However, the intent of vl*ff is to operate in an environment anticipating exception behaviour. It is the instruction's raison d'être. If the full vl is expected to always be returned [with very, very few exceptions] we would not have this instruction, but rather direct the EE to reduce vl or abort the process. So rather than a rathole we have the Elephant-In-The-Room. What does the EE do when deferred forward progress is not possible? Given that the application is anticipating "trouble" with the read memory access, does it make sense to only address the "safe" case? With float exceptions RISCV does not provide trap handlers, but rather FFlags for the application to electively check. With integer overflow or zero divide RISCV does not provide trap handlers, but requires the application to include code to detect the condition. Trap handlers for vl*ff are only incidental. They are no more special to vl*ff than any other of the vl*, or the RVI lw,lh,lb, etc. In apparent contradiction to the spec, a valid implementations can "trap" as it would for the non-ff but not service the fault, only reduce the vl accordingly until the fault occurs on the first element. Thus central to the functioning of the instruction is what happens when the fault occurs on the first element. Punting to the handler is not an answer. Return at least one element or trap does not define the operational characteristics [even if it may arguably be an ISA architectural answer]. There is nothing prohibiting the trap from returning vl=0. And I argue that EEs will indeed elect to do that when there can be no forward progress [e.g. the requested address is mapped execute only]. Platforms will stipulate a behaviour and vl=0 will be a choice. What we should try to address is how to allow greatest portability and least software fragmentation. I believe this should be accomplished exactly was effected for the integer overflow. Exclude the checking code if you do not need it, and include it if you are not assured that it is superflous. In other words vl=0 must be handled , either by avoidance or explicitly as indication that is nothing to process. | But this does not need to be in the RISC-V architectural standard. Not yet.
Let's agree on this and move on. There is no value in ignoring the issue. | As long as the VL=0 encoding is free, not used for some other purpose, you can do that in your implementation.
| Your implementation might not be able to pass the RISC-V architectural for FF, which I assume will probably assert an error if they find FF and | with VL=0. but if your hardware has a chicken bit to reduce the threshold of VL= FFs to zero, or if you have a binary translator from the | compliance tests to your software guaranteed forward progress, sure.
| Build something like that, so there's a lot of people who want it, and a few years from now we can put it into a future version of the vector | standard.
To be clear, if this is ever done, it will be with a separate encoding, not expanding behavior of current instructions. Returning vl=0 is not a "free" part of encoding. Software might rightly want to take advantage of knowing vl>0 so you cannot allow same instruction to return vl=0 after the fact, so need a different opcode/mode.
And it is precisely because this backward compatibility is not managed if we tactically ignore vl=0 that we must address it, and allow vl=0 for V1.0. Krste
| On 10/16/2020 4:48, David Horner wrote:
| First I am very happy that "arbitrary decisions by the micro-architecture" allow reduction of vl to any [non-zero] value. | Even if such appear "random".
| On 2020-10-16 2:01 a.m., krste@... wrote: | - I'm sure there's probably | papers out there with this already). | Exactly. | I see this openness/lack of arbitrary constraint as precisely the strength of RISCV. | Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet]) | or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted. | A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases AND guarantees forward progress under | SOFTWARE control. | I see it as no different [in fundamental principle] than other cases such as RVI integer divide by zero behaviour that does not trap but can | be readily checked for. | Also RVI integer overflow that if you want to check for it is at most a few instructions including the branch. | (sending replies to vector list - as this is off topic for CMOs) | My opinion is that baking SIMT execution model into ISA for purposes | of exposing microarchitectural performance (i.e., cache misses) | exposes too much of the machine, forcing application software to add | extra retry loops (2nd nested loop inside of stripmining) and forcing | system software to deal with complex traps. | [ Random historical connection - having a partial completion mask based | on cache misses is a vector version of the Stanford proposal for | "informing memory operations" where scalar core can branch on cache miss. | https://dl.acm.org/doi/10.1145/232974.233000 ] | Most of the benefit for SIMT execution around microarchitectural | hiccups can be obtained under the hood in the microarchitecture (and | there are several hundred ISCA/MICRO/HPCA papers on doing that - I | might be exaggerating, but only slightly - and I know Andy worked in | this space at some point), and should outperform putting this handling | into software. | That said, I think it's OK to allow FF V loads to stop anywhere past | element 0 including at a long-latency cache miss, mainly because it | doesn't change anything in software model. | I'm not sure it will really help perf that much in practice. While | it's easy to construct an example where it looks like it would help, I | think in general most loops touch multiple vector operands, hardware | prefetchers do well on vector streams, vector units are more efficient | on larger chunks, scatter-gathers missing in cache limit perf anyway, | etc., so it's probably a fairly brittle optimization (yes, you could | add a predictor to figure out whether to wait for the other elements | or go ahead with a partial vector result - I'm sure there's probably | papers out there with this already). | Krste
| On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@...> said: | | My take is the same as Andrew has outlined below. | | Bill | | On 10/15/20 4:30 PM, andrew@... wrote: | | EXTERNAL MAIL | | Forwarding this to tech-vector-ext; couple comments below. | | On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote: | | In vector meeting last Friday I listened to both Krste and David Horner's different opinions about fault-on-first and vector | length trimming. I realized (and may have | | convinced other attendees) that the RISC-V "fault-on-first" vector length trimming need not be done just for things like | page-faults. | | Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, | because vector element zero is the forward progress | | mechanism. | | Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the element zero of the | operation, but that it can randomly stop with the appropriate | | indication for vector length trimming at any point in the middle of the instruction. | | Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). | Enumerating all cases in which the length might be | | trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go. | | This is part of what David Horner wants. However, it does not give him the fault-on-first with zero length complete | mechanism. It could, if there were something else in | | the system that guaranteed forward progress | | My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward | progress, and cleanly matches the while-loop vectorization | | model. | | ---+ Expanded | | From vector meeting last Friday: trimming, fault-on-first. I realized that it is similar to the forms of SW visible non-faulting | speculative loads some machines, especially | | VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The | NaT-ness is implied by trimmed vector length. | | It could be implied by a mask showing which vector operations had completed. | | All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. | Software control flow must fall through such a | | check operation, and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different | instructions, so there must be a branch. | | The RISC-V Fault-on-first approach has the correctness check for non-faulting implied by redoing the instruction. i.e. it is | its own non-faulting check. it gets away with | | this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on | vector element zero, i.e. never allowing a trim | | to zero length. if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, | but to make forward progress it would have to | | guarantee that at least one vector element had completed. | | David Horner's desire for fault-on-first that may have performed no operations at all is (1) reasonable IMHO (I think I managed | to explain that the Krste), but (2) Would | | require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. | Which is almost certainly a bigger | | architectural change than RISC-V should make it this time. | | Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've | ever done. Particularly those vector instruction | | sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what | amounts to a vector instruction set and | | microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would be unfortunate for such an SIMT program to | lose work completed after the first fault. | | MORAL: fault-on-first may be suitable for vector load that might speculate past the end of the vector - where the length is | not known or inconvenient when the vector load | | instruction is started. Fault-on-first is suboptimal for running SIMT on top of vectors. i.e. fault-on-first is the | equivalent of precise exceptions for in order | | execution, and for a single thread executing vector instructions, whereas completion mask allows out of order within a vector | and/or vector length threading. | | IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is | totally fine to have the fault-on-first stuff | | return up to the first really long latency cost miss, as long as it always guarantees that at least vector element zero was | complete. Because vector element zero complete | | is what guarantees forward progress. | | Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to | actually implement a page-fault that did | | copy-on-write or swapped in from disk. but that would be visible to the operating system, not the user program. However, such | an OS implementation would have to | | guarantee that it would not kill a process as a result of a true permissions error page-fault. Or, if the virtual memory | architecture made the distinction between | | permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy on read, the OS does not need | to be involved. | | EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless | you only trim only on true faults and not COW | | or COR or disk swappage-faults). However, fault-on-first on any page-fault is a much lower bandwidth information leak | channel than is fault-on-first on long latency | | cache misses. so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to | implement fault-on-first on any cache miss. | | However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system | where all of the CPUs are dedicated to a single | | problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to | them! | | Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually much | less of a virtualization hole than | | fault-on-first for page-faults. The operating system or hypervisor has very little control over cache misses. the OS and | hypervisor have almost full control over | | page-faults. The usual rule in security and virtualization is that an application should not be able to detect that it has had | an "innocent" page-fault, such as COW or COR | | or disk swapping. | | -- | | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis | | | -- | --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis
|
|