Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

David Horner

You're incorrectly characterizing FoF below.  The FoF loads are not
intended for software to dynamically probe the microarch state to
check for possible faults

That  is not what I am advocating.

(though it can be misused that way).  The
point is to support software vector-length speculation, where whether
an access is really needed is not known ahead of time.

That is not precisely the full use case.
Rather your intended use case is :

When the application is assured that a constrained load can succeed,
     [ the system guarantees a termination condition for the load exists
     ,that it  is detectable from the data read up to and including the end point,
      and that all the data from the start point to the end point is readable]
   then FoF provides a convenient and expedited way to advance through the load.

And if you define "not known ahead of time" to mean before each successive load, then that time frameis not precisely true either.
The load could be performed one unit at a time, and each time the need would be known.
The unit requested could be of arbitrary length [successive packets of ethernet data or crypto segments].
I'm not trying to be obtuse and oppositional.
The value of FoF is to avoid the complexities of such tracking,
but if an EE were to reasonably guarantee that the data to be loaded
can be speculatively read up to a page boundary, then FoF is not needed,
nor does it necessarily provide any hard advantage over the regular strided load. 
[some implementations may detect such things as debug breakpoints and not trigger them, but as far as the software is concerned it has the speculative to-the-end-of-the-page guarantee, thus it will be content even if the debugger is annoyed]

The FoF loads are not
intended for software to dynamically probe the microarch state to
check for possible faults (though it can be misused that way).

The detection of microarch state is incidental to the characterization I attribute to FoF.
And it is not only microarch state that can be revealed but system and EE level state.

FoF fails in situations that are not covered by your use case.
Specifically, what does the EE do when it detects a situation that forward progress is not possible.
e.g. the data requested is not mapped into the process.
As I understand your use case the [standard] FoF load is aborted and its process as well.
The "enhanced/dangerous" FoF load will be allowed vl=0 to identify the "abort" case.
Consider this scenario:
A process requests the EE to maps into another process' [e.g. child's] address space pages to scan,
and the asychronous [child] co-process does the scanning.
FoF return vl=0 is eminently suited to this use case.
It is certainly possible to add to the handshaking/synchronization process the current end point of the data
that would need to be checked as each page is processed.
This can be substantial overhead and delay.

It is certainly possible to ensure that each request overreaches the natural page alignment.
However, as FoF allow the processor to reduce vl at any point, it could continually reduce vl so that it is better aligned to cache, anticipating that following request will be optimized. The program will still work, and detect potential page failures, but the false positives could be substantial and even more costly and substantially variable across implementations. [not to mention the EE thinking the process is attempting to do side channel attack].

These use cases argue for vl=0 return. And as I mentioned before, these use cases will motivate the EE to return vl=0, even without the application using the "new/corrupted" FoF encoding for vl=0 allowed.

On Tue, Oct 20, 2020 at 5:08 AM Krste Asanovic <krste@...> wrote:

If we allow regular FoF loads to return with vl=0, we must provide a
forward-progress guarantee, otherwise the instructions are practically

I believe I have shown practical uses above.

 The forward-progress guarantee must not add overhead to the
common cases where returning vl=0 serves no useful purpose.

I certainly agree. But when does returning vl=0 serve no useful purpose?

this is difficult to describe, especially when code may have several
FoF loads in a stripmine loop.  If allowing FoF loads to return vl=0
requires application overhead to support the forward-progresss
guarantee, then we should have a separate encoding for that
instruction so that the common case is not burdened by the esoteric

There are different forward-progress guarantees.
As I mentioned before separate encoding will not provide a practical benefit.
Once the new encoding is introduced,
legacy processors will just have their EE emulate it by allowing vl=0 return
under the same conditions and the linkeditor will replace the new FoF with the old.

The FoF instructions allow software vector-length speculation in a
safe way, where the first element is checked as normal and raises any
necessary traps to the OS, while the later elements are not processed
if they're problematic.  Only if software attempts to actually process
the later elements, because processing the earlier elements deems it
necessary, is the required trap actioned.

The trap is serviced by the OS not the application.  Most commonly, it
will be a page fault, sometimes a protection violation.  Neither are
reported to the application (in general), because the application can
do nothing about these traps.  This is different from the other cases
you bring up (integer overflow, FP flags).

As mentioned before, if we think outside the box of the "classic" use case,
there certainly are meaningful and significant ways that applications can
handle EE level events (analogous to divide by zero).

There is no difficulty in providing forward progress on FoF loads in a
microarchitecture, as otherwise regular vector loads wouldn't work.
FoF loads are only a small modification to regular vector loads,
basically flushing the pipeline to change vl on a trap instead of
taking the trap and setting vstart.

The only way I would contemplate allowing trimming to vl=0 for the 1.0
standard was if there was a forward-progress guarantee that did not
burden regular uses of FoF loads. 

The default case is just such a non-burdensome approach.

Check vl=0 if you are not guaranteed to succeed.
Ignore vl=0 at your peril if you are unsure (you could end up in an infinite loop).
Ignore vl=0 if you are guaranteed not to read past valid memory.

Also, the guarantee would have to
actually enable some improvement in an implementation (as otherwise,
no one would choose to trim to 0, and we can then keep the spec

The spec will need to address this case in any event, even if to say we do not recommend EE return with vl=0.
The spec cannot mandate that EE not return vl=0. Certification does not extend to runtime constrained EEs.
Code needs to be aware that this can happen.
The net is, I don't believe the "prohibition" significantly simplifies the spec.
It may actually make it more contentious.

You simplified integer divide over other ISA that mandated a trap for divide by zero.
With this approach we mandate a trap for FoF when vl=0 would be sufficient.

Where it is inevitable that EE will do the sensible thing and
return vl=0; when forward progress [within reasonable constraints] is not possible.



>>>>> On Sat, 17 Oct 2020 22:39:37 -0400, "David Horner" <ds2horner@...> said:

| On 2020-10-17 6:49 p.m., krste@... wrote:
|| - [tech-cmo] so they don't get bothered with this off-topic discussion
||||||| On Fri, 16 Oct 2020 18:14:48 -0700, Andy Glew Si5 <andy.glew@...> said:
|| |     [DH]: I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
|| |     Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
|| |     or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.
|| |     A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under
|| |     SOFTWARE control.
|| | Sure. You could guarantee forward progress, e.g. by allowing no more than 10 successive "first fault" with VL=0, and requiring trap on element zero
|| | on the 11th.   That check could be in hardware, or it could be in
|| | the software that's calling the FF instruction.
|| I don't want us to rathole on how to guarantee forward progress for
|| vl=0 case, but do want to note that this kind of forward progress is
|| nasty to guarantee, implying there's long-lasting microarch state to
|| keep around - what if you're context swapped out before you get to the
|| 11th?  Do you have to force the first one after a context swap to not
|| trim?  What if there's a sequence of ff's and second one goes back to
|| vl=0?

| Krste: I gather your answer is more in the context of lr/sc type forward
| guarantees, instructions that are designed not to trap when delivering
| on their primary function.

|     So  I agree that determining an appropriate deferred trap is
| problematic.

|      However, the intent of vl*ff is to operate in an environment
| anticipating exception behaviour.

|       It is the instruction's raison d'être.  If the full vl is expected
| to always be returned [with very, very few exceptions] we would not have
| this instruction, but rather direct the EE to reduce vl or abort the
| process.

|      So rather than a rathole we have the Elephant-In-The-Room. What
| does the EE do when deferred forward progress is not possible?

|      Given that the application is anticipating "trouble" with the read
| memory access, does it make sense to only address the "safe" case?

|      With float exceptions RISCV does not provide trap handlers, but
| rather FFlags for the application to electively check.

|      With integer overflow or zero divide RISCV does not provide trap
| handlers, but requires the application to include code to detect the
| condition.

|      Trap handlers for vl*ff are only incidental. They are no more
| special to vl*ff than any other of the vl*, or the RVI lw,lh,lb, etc.

|      In apparent contradiction to the spec, a valid implementations can
| "trap" as it would for the non-ff but not service the fault, only reduce
| the vl accordingly until the fault occurs on the first element.

|      Thus central to the functioning of the instruction is what happens
| when the fault occurs on the first element.

|      Punting to the handler is not an answer. Return at least one
| element or trap does not define the operational characteristics [even if
| it may arguably be an ISA architectural answer].

|      There is nothing prohibiting the trap from returning vl=0. And I
| argue that EEs will indeed elect to do that when there can be no forward
| progress [e.g. the requested address is mapped execute only].

|      Platforms will stipulate a behaviour and vl=0 will be a choice.
| What we should try to address is how to allow greatest portability and
| least software fragmentation.

|    I believe this should be accomplished exactly was effected for the
| integer overflow. Exclude the checking code if you do not need it, and
| include it if you are not assured that it is superflous.

|   In other words vl=0 must be handled , either by avoidance or
| explicitly as indication that is nothing to process.

|| | But this does not need to be in the RISC-V architectural standard. Not yet.
|| Let's agree on this and move on.
| There is no value in ignoring the issue.

Join { to automatically receive all group messages.