[DH]: I see this openness/lack of arbitrary constraint as
precisely the strength of RISCV.
Limiting vector operations due to current constraints in
software (Linuz does it this way, compilers cannot optimize that
formulation[yet])
or hardware (reminiscent of delayed branch because prediction
was too expensive) is short sighted.
A check for vl=0 on platforms that allow it is eminently doable,
low overhead for many use cases AND guarantees forward progress
under SOFTWARE control.
Sure. You could guarantee forward progress, e.g. by allowing no
more than 10 successive "first fault" with VL=0, and requiring
trap on element zero on the 11th. That check could be in
hardware, or it could be in the software that's calling the FF
instruction.
But this does not need to be in the RISC-V architectural
standard. Not yet.
As long as the VL=0 encoding is free, not used for some other
purpose, you can do that in your implementation.
Your implementation might not be able to pass the RISC-V
architectural for FF, which I assume will probably assert an
error if they find FF and with VL=0. but if your hardware has a
chicken bit to reduce the threshold of VL= FFs to zero, or if you
have a binary translator from the compliance tests to your
software guaranteed forward progress, sure.
Build something like that, so there's a lot of people who want
it, and a few years from now we can put it into a future version
of the vector standard.
On 10/16/2020 4:48, David Horner wrote:
First I
am very happy that "arbitrary decisions by the micro-architecture"
allow reduction of vl to any [non-zero] value.
Even if such appear "random".
On 2020-10-16 2:01 a.m., krste@... wrote:
- I'm sure there's probably
papers out there with this already).
Exactly.
I see this openness/lack of arbitrary constraint as precisely
the strength of RISCV.
Limiting vector operations due to current constraints in software
(Linuz does it this way, compilers cannot optimize that
formulation[yet])
or hardware (reminiscent of delayed branch because prediction was
too expensive) is short sighted.
A check for vl=0 on platforms that allow it is eminently doable,
low overhead for many use cases AND guarantees forward progress
under SOFTWARE control.
I see it as no different [in fundamental principle] than other
cases such as RVI integer divide by zero behaviour that does not
trap but can be readily checked for.
Also RVI integer overflow that if you want to check for it is at
most a few instructions including the branch.
(sending replies to vector list - as this
is off topic for CMOs)
My opinion is that baking SIMT execution model into ISA for
purposes
of exposing microarchitectural performance (i.e., cache misses)
exposes too much of the machine, forcing application software to
add
extra retry loops (2nd nested loop inside of stripmining) and
forcing
system software to deal with complex traps.
[ Random historical connection - having a partial completion
mask based
on cache misses is a vector version of the Stanford
proposal for
"informing memory operations" where scalar core can branch
on cache miss.
https://dl.acm.org/doi/10.1145/232974.233000 ]
Most of the benefit for SIMT execution around
microarchitectural
hiccups can be obtained under the hood in the microarchitecture
(and
there are several hundred ISCA/MICRO/HPCA papers on doing that -
I
might be exaggerating, but only slightly - and I know Andy
worked in
this space at some point), and should outperform putting this
handling
into software.
That said, I think it's OK to allow FF V loads to stop anywhere
past
element 0 including at a long-latency cache miss, mainly because
it
doesn't change anything in software model.
I'm not sure it will really help perf that much in practice.
While
it's easy to construct an example where it looks like it would
help, I
think in general most loops touch multiple vector operands,
hardware
prefetchers do well on vector streams, vector units are more
efficient
on larger chunks, scatter-gathers missing in cache limit perf
anyway,
etc., so it's probably a fairly brittle optimization (yes, you
could
add a predictor to figure out whether to wait for the other
elements
or go ahead with a partial vector result - I'm sure there's
probably
papers out there with this already).
Krste
On Fri, 16 Oct 2020 04:03:17
+0000, "Bill Huffman" <huffman@...>
said:
| My take is the same as Andrew has outlined below.
| Bill
| On 10/15/20 4:30 PM, andrew@... wrote:
| EXTERNAL MAIL
| Forwarding this to tech-vector-ext; couple comments
below.
| On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5
<andy.glew@...> wrote:
| In vector meeting last Friday I listened to both
Krste and David Horner's different opinions about
fault-on-first and vector length trimming. I realized (and may
have
| convinced other attendees) that the RISC-V
"fault-on-first" vector length trimming need not be done just
for things like page-faults.
| Fault-on-first could be done for the first
long latency cache miss, as long as vector element zero has been
completed, because vector element zero is the forward progress
| mechanism.
| Indeed, IMHO the correct semantic requirement
for fault-on-first is that it completes the element zero of the
operation, but that it can randomly stop with the appropriate
| indication for vector length trimming at any point in
the middle of the instruction.
| Indeed, I've found other microarchitectural
reasons to favor this approach (e.g., speculating through
mask-register values). Enumerating all cases in which the
length might be
| trimmed seems like a fool's errand, so just saying it can
be truncated to >= 1 for any reason is the way to go.
| This is part of what David Horner wants. However, it
does not give him the fault-on-first with zero length complete
mechanism. It could, if there were something else in
| the system that guaranteed forward progress
| My take is that requiring that element 0 either
complete or trap is already a solid mechanism for guaranteeing
forward progress, and cleanly matches the while-loop
vectorization
| model.
| ---+ Expanded
| From vector meeting last Friday: trimming,
fault-on-first. I realized that it is similar to the forms of
SW visible non-faulting speculative loads some machines,
especially
| VLIWs, have. However, instead of delivering a NaN or
NaT, it is non-faulting except for vector element 0, where it
faults. The NaT-ness is implied by trimmed vector length.
| It could be implied by a mask showing which vector
operations had completed.
| All such SW non-faulting loads need a "was this
correct" operation, which might just be a faulting load and a
comparison. Software control flow must fall through such a
| check operation, and through a redo of the faulting
load if necessary. In scalar, non-faulting and faulting loads
are different instructions, so there must be a branch.
| The RISC-V Fault-on-first approach has the
correctness check for non-faulting implied by redoing the
instruction. i.e. it is its own non-faulting check. it gets
away with
| this because the trend vector length indicates which
parts were valid and not. forward progress is guaranteed by
trapping on vector element zero, i.e. never allowing a trim
| to zero length. if a non-faulting vector approach
was used instead of fault-on-first, it could return a vector
complete mask, but to make forward progress it would have to
| guarantee that at least one vector element had
completed.
| David Horner's desire for fault-on-first that may have
performed no operations at all is (1) reasonable IMHO (I think
I managed to explain that the Krste), but (2) Would
| require some other mechanism for forward progress.
E.g. instead of trapping on element zero, the bitmask that I
described above. Which is almost certainly a bigger
| architectural change than RISC-V should make it this
time.
| Although more and more I am happier that I included
such a completion bitmask in newly every vector instruction set
that I've ever done. Particularly those vector instruction
| sets that were supposed to implement SIMT efficiently.
(I think of SIMT as a programming model that is implemented on
top of what amounts to a vector instruction set and
| microarchitecture.
https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would
be unfortunate for such an SIMT program to lose work completed
after the first fault.
| MORAL: fault-on-first may be suitable for vector load
that might speculate past the end of the vector - where the
length is not known or inconvenient when the vector load
| instruction is started. Fault-on-first is suboptimal
for running SIMT on top of vectors. i.e. fault-on-first is
the equivalent of precise exceptions for in order
| execution, and for a single thread executing vector
instructions, whereas completion mask allows out of order
within a vector and/or vector length threading.
| IMHO an important realization I made in that meeting
is that fault-on-first does not need to be just about faulting.
It is totally fine to have the fault-on-first stuff
| return up to the first really long latency cost miss,
as long as it always guarantees that at least vector element
zero was complete. Because vector element zero complete
| is what guarantees forward progress.
| Furthermore, it is not even required that
fault-on-first stop at the first page-fault. An implementation
could actually choose to actually implement a page-fault that
did
| copy-on-write or swapped in from disk. but that
would be visible to the operating system, not the user program.
However, such an OS implementation would have to
| guarantee that it would not kill a process as a
result of a true permissions error page-fault. Or, if the
virtual memory architecture made the distinction between
| permissions faults and the sorts of page-fault that is
for disk swapping or copy-on-write or copy on read, the OS
does not need to be involved.
| EVERYTHING about fault-on-first is a
microarchitecture security/information leak channel and/or a
virtualization hole. (Unless you only trim only on true faults
and not COW
| or COR or disk swappage-faults). However,
fault-on-first on any page-fault is a much lower bandwidth
information leak channel than is fault-on-first on long
latency
| cache misses. so a general purpose system might
choose to implement fault-on-first on any page-fault, but might
not want to implement fault-on-first on any cache miss.
| However, there are some systems for which that sort of
security issue is not a concern. E.g. a data center or embedded
system where all of the CPUs are dedicated to a single
| problem. In which case, if they can gain performance
by doing fault-on-first on particular long latency cache misses,
power to them!
| Interestingly, although fault-on-first on long latency
cache misses is a high-bandwidth information leak, it is
actually much less of a virtualization hole than
| fault-on-first for page-faults. The operating system
or hypervisor has very little control over cache misses. the OS
and hypervisor have almost full control over
| page-faults. The usual rule in security and
virtualization is that an application should not be able to
detect that it has had an "innocent" page-fault, such as COW or
COR
| or disk swapping.
| --
| --- Sorry: Typos (Speech-Os?) Writing Errors <=
Speech Recognition <= Computeritis
--
---
Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition
<= Computeritis
|