Date   

Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Krste Asanovic
 

As you get to pick where vl is trimmed, you would probably choose the
vl=3 case here to simplify implementation.

Krste

On Fri, 16 Oct 2020 18:59:55 +0200, Roger Espasa <roger.espasa@...> said:
| Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that). 
| However, I would NOT agree that a masked out element has been written, even if past the failing point.

| roger.

| On Fri, Oct 16, 2020 at 6:57 PM Roger Espasa <roger.espasa@...> wrote:

| Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)

-| If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same
-| If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line
| followed by an adder. At SEW=8b, there can be lots of elements within a line...

| roger.

| On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote:

| The way the discussion has been going, I think either would be permissible.  Not only that, but it would have been permissible for element 9 already
| to have been overwritten with 1's (if vma allows it).

| I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.

|       Bill

| On 10/16/20 8:57 AM, Roger Espasa wrote:

| EXTERNAL MAIL

| Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:

| image.png

| On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:

| On 2020-10-16 10:30 a.m., krste@... wrote:
||
||||||| On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
|| | First I am very happy that "arbitrary decisions by the
|| | micro-architecture" allow reduction of vl to any [non-zero] value.
||
|| | Even if such appear "random".
|| [...]
|| | A check for vl=0 on platforms that allow it is eminently doable, low
|| | overhead for many use cases  AND guarantees forward progress under
|| | SOFTWARE control.
||
|| If we allowed implementation to return vl=0, how does software
|| guarantee forward progress?

| The forward progress is to advance to another task.

| In the case of machine mode it can potentially "resolve" the cause of
| the vl=0 return and re-execute the loop (without the overhead of the trap).

||
|| | I see it as no different [in fundamental principle] than other cases
|| | such as RVI integer divide by zero behaviour that does not trap but can
|| | be  readily checked for.
|| | Also RVI integer overflow that if you want to check for it is at most a
|| | few instructions including the branch.
||
|| I don't see how these examples relate to returning vl=0 on some
|| microarchitectural event.  The examples here have results that depend
|| only on architectural values, so can be deterministically handled.
| The similarity is the avoidance of trap handling, when it is sufficient
| to check instead register state.
||
|| vl=0 is more related to load-reserved/store-conditional failure, where
|| we need to add implementation constraints to guarantee forward
|| progress.

| Ok. I can see providing guidance as to when vl=0 is allowed, but not to
| exclude it outright.

|| Krste

|
| x[DELETED ATTACHMENT image.png, PNG image]


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Bill Huffman
 

Roger,

I think it's an implementation choice whether vl is trimmed to 3 or 6 (or theoretically other values).  I don't know a reason why the implementation couldn't always trim vl to the same value that vstart would have been set to if the exception were being taken.  Does anyone know such a reason?  It seems simplest to me always to trim vl to the value vstart would have been set to.

I meant element 9.  If vma=1, then inactive elements can be undisturbed or set to 1's.  Element 'a' couldn't have been loaded in the case described because it was in a line with a fault.  In general, I think our discussions would have allowed element 'a' to be written if there were some other reason for trimming vl.

      Bill

On 10/16/20 9:59 AM, Roger Espasa wrote:

EXTERNAL MAIL

Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that). 
However, I would NOT agree that a masked out element has been written, even if past the failing point.

roger.

On Fri, Oct 16, 2020 at 6:57 PM Roger Espasa <roger.espasa@...> wrote:
Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)

-> If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same
-> If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line followed by an adder. At SEW=8b, there can be lots of elements within a line...

roger.

On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote:

The way the discussion has been going, I think either would be permissible.  Not only that, but it would have been permissible for element 9 already to have been overwritten with 1's (if vma allows it).

I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.

      Bill

On 10/16/20 8:57 AM, Roger Espasa wrote:
EXTERNAL MAIL

Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:

image.png

On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:

On 2020-10-16 10:30 a.m., krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases  AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?

The forward progress is to advance to another task.

In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).


>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be  readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event.  The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.

Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.


> Krste






Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Roger Espasa
 

Bill you said element 9, but did you mean element labeled "a" which is the 11th element in the vector? (I agree with that). 
However, I would NOT agree that a masked out element has been written, even if past the failing point.

roger.

On Fri, Oct 16, 2020 at 6:57 PM Roger Espasa <roger.espasa@...> wrote:
Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)

-> If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same
-> If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line followed by an adder. At SEW=8b, there can be lots of elements within a line...

roger.

On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote:

The way the discussion has been going, I think either would be permissible.  Not only that, but it would have been permissible for element 9 already to have been overwritten with 1's (if vma allows it).

I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.

      Bill

On 10/16/20 8:57 AM, Roger Espasa wrote:
EXTERNAL MAIL

Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:

image.png

On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:

On 2020-10-16 10:30 a.m., krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases  AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?

The forward progress is to advance to another task.

In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).


>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be  readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event.  The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.

Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.


> Krste






Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Roger Espasa
 

Here's where the "implementation" cost comes in (at least in our implementation; others, of course, may have more clever ways of doing this)

-> If you pick "vl=3", then the vstart and vltrim calculations can be made one and the same
-> If you pick "vl=6" then the vstart and vltrim calculations are not exactly equal and vltrim needs a LZC on the mask for the elements within the line followed by an adder. At SEW=8b, there can be lots of elements within a line...

roger.

On Fri, Oct 16, 2020 at 6:31 PM Bill Huffman <huffman@...> wrote:

The way the discussion has been going, I think either would be permissible.  Not only that, but it would have been permissible for element 9 already to have been overwritten with 1's (if vma allows it).

I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.

      Bill

On 10/16/20 8:57 AM, Roger Espasa wrote:
EXTERNAL MAIL

Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:

image.png

On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:

On 2020-10-16 10:30 a.m., krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases  AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?

The forward progress is to advance to another task.

In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).


>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be  readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event.  The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.

Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.


> Krste






Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Bill Huffman
 

The way the discussion has been going, I think either would be permissible.  Not only that, but it would have been permissible for element 9 already to have been overwritten with 1's (if vma allows it).

I think bringing this up is good as we need to be sure what precisely we mean by the v*ff instructions.

      Bill

On 10/16/20 8:57 AM, Roger Espasa wrote:

EXTERNAL MAIL

Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:

image.png

On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:

On 2020-10-16 10:30 a.m., krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases  AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?

The forward progress is to advance to another task.

In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).


>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be  readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event.  The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.

Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.


> Krste






Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Roger Espasa
 

Here's a question for the group: I did in as a picture... hopefully it will go through the mailing list:

image.png

On Fri, Oct 16, 2020 at 4:56 PM David Horner <ds2horner@...> wrote:

On 2020-10-16 10:30 a.m., krste@... wrote:
>
>>>>>> On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
> | First I am very happy that "arbitrary decisions by the
> | micro-architecture" allow reduction of vl to any [non-zero] value.
>
> | Even if such appear "random".
> [...]
> | A check for vl=0 on platforms that allow it is eminently doable, low
> | overhead for many use cases  AND guarantees forward progress under
> | SOFTWARE control.
>
> If we allowed implementation to return vl=0, how does software
> guarantee forward progress?

The forward progress is to advance to another task.

In the case of machine mode it can potentially "resolve" the cause of
the vl=0 return and re-execute the loop (without the overhead of the trap).


>
> | I see it as no different [in fundamental principle] than other cases
> | such as RVI integer divide by zero behaviour that does not trap but can
> | be  readily checked for.
> | Also RVI integer overflow that if you want to check for it is at most a
> | few instructions including the branch.
>
> I don't see how these examples relate to returning vl=0 on some
> microarchitectural event.  The examples here have results that depend
> only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient
to check instead register state.
>
> vl=0 is more related to load-reserved/store-conditional failure, where
> we need to add implementation constraints to guarantee forward
> progress.

Ok. I can see providing guidance as to when vl=0 is allowed, but not to
exclude it outright.


> Krste






Re: Sequence to insert an element

David Horner
 

On 2020-10-16 11:10 a.m., Roger Ferrer Ibanez wrote:
Hi,

what is a reasonable sequence to insert an element into an arbitrary position in the vector?

I considered the following sequence (assume the input vector is v12)

vid.v v1
vmseq.vx v0, v1, <index>
vmerge.vxm v1, v12, <value>, v0.t

But I think this is problematic for sew=8 as there may be overflow if vlmax(sew=8)>256.
The mask could be built with sew=16, as the mask is ordinal based.

And there are tricks to set it up, for example a direct load (register move) to v0 to set the correct bit.

The mask could be built in v2 and transfered under mask to clear lower or higher aliasing.




It may be possible for lmul={1,2,4} sew=8 to compute vid and vmseq using lmul={2,4,8} sew=16, respectively but the lmul=8,sew=8 case won't work as there is no lmul=16,sew=16.

I also came up with this other sequence but doesn't look great to me:

vslidedown.vx v1, v12, <index>
vmv.s.x v1, <value>
vslideup.vx v1, v1, <index>
vsetvl x0, <index>,sew,lmul,tu,mu
vmv.v.v v1, v12    # should leave the tail undisturbed

Thanks a lot,


Sequence to insert an element

Roger Ferrer Ibanez
 

Hi,

what is a reasonable sequence to insert an element into an arbitrary position in the vector?

I considered the following sequence (assume the input vector is v12)

vid.v v1
vmseq.vx v0, v1, <index>
vmerge.vxm v1, v12, <value>, v0.t

But I think this is problematic for sew=8 as there may be overflow if vlmax(sew=8)>256.

It may be possible for lmul={1,2,4} sew=8 to compute vid and vmseq using lmul={2,4,8} sew=16, respectively but the lmul=8,sew=8 case won't work as there is no lmul=16,sew=16.

I also came up with this other sequence but doesn't look great to me:

vslidedown.vx v1, v12, <index>
vmv.s.x v1, <value>
vslideup.vx v1, v1, <index>
vsetvl x0, <index>,sew,lmul,tu,mu
vmv.v.v v1, v12    # should leave the tail undisturbed

Thanks a lot,

--
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


http://bsc.es/disclaimer


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

David Horner
 

On 2020-10-16 10:30 a.m., krste@... wrote:

On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
| First I am very happy that "arbitrary decisions by the
| micro-architecture" allow reduction of vl to any [non-zero] value.

| Even if such appear "random".
[...]
| A check for vl=0 on platforms that allow it is eminently doable, low
| overhead for many use cases  AND guarantees forward progress under
| SOFTWARE control.

If we allowed implementation to return vl=0, how does software
guarantee forward progress?
The forward progress is to advance to another task.

In the case of machine mode it can potentially "resolve" the cause of the vl=0 return and re-execute the loop (without the overhead of the trap).



| I see it as no different [in fundamental principle] than other cases
| such as RVI integer divide by zero behaviour that does not trap but can
| be  readily checked for.
| Also RVI integer overflow that if you want to check for it is at most a
| few instructions including the branch.

I don't see how these examples relate to returning vl=0 on some
microarchitectural event. The examples here have results that depend
only on architectural values, so can be deterministically handled.
The similarity is the avoidance of trap handling, when it is sufficient to check instead register state.

vl=0 is more related to load-reserved/store-conditional failure, where
we need to add implementation constraints to guarantee forward
progress.
Ok. I can see providing guidance as to when vl=0 is allowed, but not to exclude it outright.


Krste


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Krste Asanovic
 

On Fri, 16 Oct 2020 07:48:00 -0400, "David Horner" <ds2horner@...> said:
| First I am very happy that "arbitrary decisions by the
| micro-architecture" allow reduction of vl to any [non-zero] value.

| Even if such appear "random".
[...]
| A check for vl=0 on platforms that allow it is eminently doable, low
| overhead for many use cases  AND guarantees forward progress under
| SOFTWARE control.

If we allowed implementation to return vl=0, how does software
guarantee forward progress?

| I see it as no different [in fundamental principle] than other cases
| such as RVI integer divide by zero behaviour that does not trap but can
| be  readily checked for.
| Also RVI integer overflow that if you want to check for it is at most a
| few instructions including the branch.

I don't see how these examples relate to returning vl=0 on some
microarchitectural event. The examples here have results that depend
only on architectural values, so can be deterministically handled.

vl=0 is more related to load-reserved/store-conditional failure, where
we need to add implementation constraints to guarantee forward
progress.

Krste


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

David Horner
 

First I am very happy that "arbitrary decisions by the micro-architecture" allow reduction of vl to any [non-zero] value.

Even if such appear "random".


On 2020-10-16 2:01 a.m., krste@... wrote:
- I'm sure there's probably
papers out there with this already).
Exactly.
I see this openness/lack of arbitrary constraint  as  precisely  the strength of RISCV.
Limiting vector operations due to current constraints in software (Linuz does it this way, compilers cannot optimize that formulation[yet])
or hardware (reminiscent of delayed branch because prediction was too expensive) is short sighted.

A check for vl=0 on platforms that allow it is eminently doable, low overhead for many use cases  AND guarantees forward progress under SOFTWARE control.

I see it as no different [in fundamental principle] than other cases such as RVI integer divide by zero behaviour that does not trap but can be  readily checked for.
Also RVI integer overflow that if you want to check for it is at most a few instructions including the branch.
(sending replies to vector list - as this is off topic for CMOs)

My opinion is that baking SIMT execution model into ISA for purposes
of exposing microarchitectural performance (i.e., cache misses)
exposes too much of the machine, forcing application software to add
extra retry loops (2nd nested loop inside of stripmining) and forcing
system software to deal with complex traps.

[ Random historical connection - having a partial completion mask based
on cache misses is a vector version of the Stanford proposal for
"informing memory operations" where scalar core can branch on cache miss.
https://dl.acm.org/doi/10.1145/232974.233000 ]
Most of the benefit for SIMT execution around microarchitectural
hiccups can be obtained under the hood in the microarchitecture (and
there are several hundred ISCA/MICRO/HPCA papers on doing that - I
might be exaggerating, but only slightly - and I know Andy worked in
this space at some point), and should outperform putting this handling
into software.

That said, I think it's OK to allow FF V loads to stop anywhere past
element 0 including at a long-latency cache miss, mainly because it
doesn't change anything in software model.

I'm not sure it will really help perf that much in practice. While
it's easy to construct an example where it looks like it would help, I
think in general most loops touch multiple vector operands, hardware
prefetchers do well on vector streams, vector units are more efficient
on larger chunks, scatter-gathers missing in cache limit perf anyway,
etc., so it's probably a fairly brittle optimization (yes, you could
add a predictor to figure out whether to wait for the other elements
or go ahead with a partial vector result - I'm sure there's probably
papers out there with this already).

Krste

On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@...> said:
| My take is the same as Andrew has outlined below.
| Bill

| On 10/15/20 4:30 PM, andrew@... wrote:

| EXTERNAL MAIL
| Forwarding this to tech-vector-ext; couple comments below.
| On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote:

| In vector meeting last Friday I listened to both Krste and David Horner's different opinions about fault-on-first and vector length trimming. I realized (and may have
| convinced other attendees) that the RISC-V "fault-on-first" vector length trimming need not be done just for things like page-faults.
| Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, because vector element zero is the forward progress
| mechanism.
| Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the element zero of the operation, but that it can randomly stop with the appropriate
| indication for vector length trimming at any point in the middle of the instruction.
| Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). Enumerating all cases in which the length might be
| trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.

| This is part of what David Horner wants. However, it does not give him the fault-on-first with zero length complete mechanism. It could, if there were something else in
| the system that guaranteed forward progress
| My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization
| model.

| ---+ Expanded

| From vector meeting last Friday: trimming, fault-on-first. I realized that it is similar to the forms of SW visible non-faulting speculative loads some machines, especially
| VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The NaT-ness is implied by trimmed vector length.
| It could be implied by a mask showing which vector operations had completed.

| All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. Software control flow must fall through such a
| check operation, and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different instructions, so there must be a branch.

| The RISC-V Fault-on-first approach has the correctness check for non-faulting implied by redoing the instruction. i.e. it is its own non-faulting check. it gets away with
| this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on vector element zero, i.e. never allowing a trim
| to zero length. if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, but to make forward progress it would have to
| guarantee that at least one vector element had completed.

| David Horner's desire for fault-on-first that may have performed no operations at all is (1) reasonable IMHO (I think I managed to explain that the Krste), but (2) Would
| require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. Which is almost certainly a bigger
| architectural change than RISC-V should make it this time.

| Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've ever done. Particularly those vector instruction
| sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what amounts to a vector instruction set and
| microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would be unfortunate for such an SIMT program to lose work completed after the first fault.

| MORAL: fault-on-first may be suitable for vector load that might speculate past the end of the vector - where the length is not known or inconvenient when the vector load
| instruction is started. Fault-on-first is suboptimal for running SIMT on top of vectors. i.e. fault-on-first is the equivalent of precise exceptions for in order
| execution, and for a single thread executing vector instructions, whereas completion mask allows out of order within a vector and/or vector length threading.

| IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is totally fine to have the fault-on-first stuff
| return up to the first really long latency cost miss, as long as it always guarantees that at least vector element zero was complete. Because vector element zero complete
| is what guarantees forward progress.

| Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to actually implement a page-fault that did
| copy-on-write or swapped in from disk. but that would be visible to the operating system, not the user program. However, such an OS implementation would have to
| guarantee that it would not kill a process as a result of a true permissions error page-fault. Or, if the virtual memory architecture made the distinction between
| permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy on read, the OS does not need to be involved.

| EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless you only trim only on true faults and not COW
| or COR or disk swappage-faults). However, fault-on-first on any page-fault is a much lower bandwidth information leak channel than is fault-on-first on long latency
| cache misses. so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to implement fault-on-first on any cache miss.
| However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system where all of the CPUs are dedicated to a single
| problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to them!

| Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually much less of a virtualization hole than
| fault-on-first for page-faults. The operating system or hypervisor has very little control over cache misses. the OS and hypervisor have almost full control over
| page-faults. The usual rule in security and virtualization is that an application should not be able to detect that it has had an "innocent" page-fault, such as COW or COR
| or disk swapping.

| --
| --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis
|


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

David Horner
 

On 2020-10-15 7:30 p.m., Andrew Waterman wrote:
My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization model.
I agree, however, it still does not answer the ISA visible behavioural question: "Is the trap allowed to set vl=0 on return?"

Can this be compliant behaviour for certain platforms?

If so, then it would be equivalent to hardware doing the same thing, and thus the actual Vector hardware instruction should also be allowed this behaviour for the given platform.

This is a corollary of instruction emulation by trapping on unimplemented op codes.


Vector TG minutes from 2020/10/9 meeting

Krste Asanovic
 

Date: 2020/10/9
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~12
Current issues on github: https://github.com/riscv/riscv-v-spec

# 576 vlsegff exception behavior

Proposal is to allow updates of destination registers with portions of
a segment even if fault occurs that causes vl be trimmed to start of segment.
This simplifies implementation executing portions of a segment

It was noted that software cannot in general rely on no modification past trimmed
vl anyway, only in contrived cases relying on non-zero vstart and known
protection boundaries.

Unaligned elements within non-segment unit-stride fault-on-first are
handled in same way as partial segment. If we later add indexed
segment fault-on-first (longer encoding), could have issue with
overlap (though usually would redo stripmine loop from new start
point).

Discussion continued on to interaction with debug watchpoints.
Consensus was that watchpoints shall not cause vl trimming, as this
would change behavior of code. Instead the debug trap is taken, and
debug mode figures out what to do, possibly just restarting
instruction at vstart with untrimmed vl.

Also, decided that shouldn't trim on interrupt trap, as should go handle
the interrupt and resume at vstart. Interrupt can always be deferred
until instruction end (module concerns on interrupt latency).

Another case was whether to allow VL-trimming on cache misses. This
would allow stripmine loop to continue with elements already available
in cache while waiting for rest of vector to arrive from memory.

Also, if ECC error detected should this trap or trim VL?

General consensus was to allow vl-trimming to any random location to
give implementations flexibility, except that element 0 must always be
processed to ensure forward progress.


Minutes from 2020/10/2 meeting

Krste Asanovic
 

Date: 2020/10/2
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~12
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed;

# Implementation availability - discussion around when FPGA RTL
implementations could be made available to software community

# Imprecise traps

There was extensive discussion on handling of how to allow imprecise
traps in some implementations.

For application processors, the general 'V' extension will mandate
traps precise to the vector element, with vstart pointing at the
faulting element (and for memory traps, badvaddr pointing to faulting
memory address).

For embedded processors, precise traps can be expensive and
unnecessary to support. There was extensive discussion on the
possible spectrum of less-than-precise trap models.

One point in the design space supports "swappable" traps, where the
microarchitectural state can be saved and restored around a trap
handler using special instructions, but it was decided to postpone
defining this model until later.

The consensus was to define the simplest imprecise model that does not
allow restart but does indicate the faulting instruction (epc/cause).

An open question was around debug watchpoints, and whether these
needed to be precise or whether two run modes should be supported
(slow but precise watchpoints, or fast but imprecise watchpoint
traps).


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Krste Asanovic
 

(sending replies to vector list - as this is off topic for CMOs)

My opinion is that baking SIMT execution model into ISA for purposes
of exposing microarchitectural performance (i.e., cache misses)
exposes too much of the machine, forcing application software to add
extra retry loops (2nd nested loop inside of stripmining) and forcing
system software to deal with complex traps.

[ Random historical connection - having a partial completion mask based
on cache misses is a vector version of the Stanford proposal for
"informing memory operations" where scalar core can branch on cache miss.
https://dl.acm.org/doi/10.1145/232974.233000 ]

Most of the benefit for SIMT execution around microarchitectural
hiccups can be obtained under the hood in the microarchitecture (and
there are several hundred ISCA/MICRO/HPCA papers on doing that - I
might be exaggerating, but only slightly - and I know Andy worked in
this space at some point), and should outperform putting this handling
into software.

That said, I think it's OK to allow FF V loads to stop anywhere past
element 0 including at a long-latency cache miss, mainly because it
doesn't change anything in software model.

I'm not sure it will really help perf that much in practice. While
it's easy to construct an example where it looks like it would help, I
think in general most loops touch multiple vector operands, hardware
prefetchers do well on vector streams, vector units are more efficient
on larger chunks, scatter-gathers missing in cache limit perf anyway,
etc., so it's probably a fairly brittle optimization (yes, you could
add a predictor to figure out whether to wait for the other elements
or go ahead with a partial vector result - I'm sure there's probably
papers out there with this already).

Krste

On Fri, 16 Oct 2020 04:03:17 +0000, "Bill Huffman" <huffman@...> said:
| My take is the same as Andrew has outlined below.
| Bill

| On 10/15/20 4:30 PM, andrew@... wrote:

| EXTERNAL MAIL

| Forwarding this to tech-vector-ext; couple comments below.

| On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote:

| In vector meeting last Friday I listened to both Krste and David Horner's different opinions about fault-on-first and vector length trimming. I realized (and may have
| convinced other attendees) that the RISC-V "fault-on-first" vector length trimming need not be done just for things like page-faults.

| Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed, because vector element zero is the forward progress
| mechanism.

| Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the element zero of the operation, but that it can randomly stop with the appropriate
| indication for vector length trimming at any point in the middle of the instruction.

| Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values). Enumerating all cases in which the length might be
| trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.

| This is part of what David Horner wants. However, it does not give him the fault-on-first with zero length complete mechanism. It could, if there were something else in
| the system that guaranteed forward progress

| My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization
| model.

| ---+ Expanded

| From vector meeting last Friday: trimming, fault-on-first. I realized that it is similar to the forms of SW visible non-faulting speculative loads some machines, especially
| VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The NaT-ness is implied by trimmed vector length.
| It could be implied by a mask showing which vector operations had completed.

| All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison. Software control flow must fall through such a
| check operation, and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different instructions, so there must be a branch.

| The RISC-V Fault-on-first approach has the correctness check for non-faulting implied by redoing the instruction. i.e. it is its own non-faulting check. it gets away with
| this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on vector element zero, i.e. never allowing a trim
| to zero length. if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, but to make forward progress it would have to
| guarantee that at least one vector element had completed.

| David Horner's desire for fault-on-first that may have performed no operations at all is (1) reasonable IMHO (I think I managed to explain that the Krste), but (2) Would
| require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. Which is almost certainly a bigger
| architectural change than RISC-V should make it this time.

| Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've ever done. Particularly those vector instruction
| sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what amounts to a vector instruction set and
| microarchitecture. https://pharr.org/matt/papers/ispc_inpar_2012.pdf ). It would be unfortunate for such an SIMT program to lose work completed after the first fault.

| MORAL: fault-on-first may be suitable for vector load that might speculate past the end of the vector - where the length is not known or inconvenient when the vector load
| instruction is started. Fault-on-first is suboptimal for running SIMT on top of vectors. i.e. fault-on-first is the equivalent of precise exceptions for in order
| execution, and for a single thread executing vector instructions, whereas completion mask allows out of order within a vector and/or vector length threading.

| IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is totally fine to have the fault-on-first stuff
| return up to the first really long latency cost miss, as long as it always guarantees that at least vector element zero was complete. Because vector element zero complete
| is what guarantees forward progress.

| Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to actually implement a page-fault that did
| copy-on-write or swapped in from disk. but that would be visible to the operating system, not the user program. However, such an OS implementation would have to
| guarantee that it would not kill a process as a result of a true permissions error page-fault. Or, if the virtual memory architecture made the distinction between
| permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy on read, the OS does not need to be involved.

| EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless you only trim only on true faults and not COW
| or COR or disk swappage-faults). However, fault-on-first on any page-fault is a much lower bandwidth information leak channel than is fault-on-first on long latency
| cache misses. so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to implement fault-on-first on any cache miss.
| However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system where all of the CPUs are dedicated to a single
| problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to them!

| Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually much less of a virtualization hole than
| fault-on-first for page-faults. The operating system or hypervisor has very little control over cache misses. the OS and hypervisor have almost full control over
| page-faults. The usual rule in security and virtualization is that an application should not be able to detect that it has had an "innocent" page-fault, such as COW or COR
| or disk swapping.

| --
| --- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis

|


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Bill Huffman
 

My take is the same as Andrew has outlined below.

      Bill

On 10/15/20 4:30 PM, andrew@... wrote:

EXTERNAL MAIL

Forwarding this to tech-vector-ext; couple comments below.

On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote:

In vector meeting last Friday  I listened to both Krste and David Horner's  different opinions about fault-on-first and vector length trimming. I realized (and may have convinced other attendees) that the  RISC-V "fault-on-first"  vector length trimming need not be done just for things like page-faults.

Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed,  because vector element zero is the forward progress mechanism.

Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the  element zero of the operation,  but that it can randomly stop with the appropriate indication for vector length  trimming at any point in the middle of the instruction.

Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values).  Enumerating all cases in which the length might be trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.

This is part of what David Horner wants.   However, it does not give him the  fault-on-first with zero length complete mechanism.   It could, if there were something else in the system that guaranteed forward progress

My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization model.

---+ Expanded

 

From vector meeting last Friday: trimming, fault-on-first.  I realized that it is similar to the forms of SW visible non-faulting speculative loads some machines, especially VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The NaT-ness is implied by trimmed vector length.  It could be implied by a mask showing which vector operations had completed.

 

All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison.  Software control flow must fall through such a check operation,  and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different instructions, so there must be a branch.

 

The RISC-V Fault-on-first approach  has the correctness check for non-faulting implied by redoing the instruction.  i.e. it is its own non-faulting check.  it gets away with this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on vector element zero, i.e. never allowing a trim to zero length.   if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, but to make forward progress it would have to guarantee that at least one vector element had completed.

 

David Horner's desire for fault-on-first that may have performed no operations at all is (1)  reasonable IMHO (I think I managed to explain that the Krste), but (2) Would require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. Which is almost certainly a bigger architectural change than RISC-V should make it this time.

 

Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've ever done. Particularly those vector instruction sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what amounts to a vector instruction set and microarchitecture.  https://pharr.org/matt/papers/ispc_inpar_2012.pdf ).  It would be unfortunate for such an SIMT program to lose  work completed after the first fault.

 

MORAL:  fault-on-first may be suitable for vector load that might speculate past the end of the vector -  where the length is  not known or inconvenient when the vector load instruction is started. Fault-on-first is  suboptimal for running SIMT on top of vectors.   i.e. fault-on-first  is the equivalent of precise exceptions for in order execution,  and for a single thread executing vector instructions, whereas  completion mask  allows out of order within a vector and/or vector length  threading.

 

IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is totally fine to have the fault-on-first stuff return up to the  first really long latency cost miss, as long as it always  guarantees that at least vector element zero was complete. Because vector element zero complete is what guarantees forward progress.


Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to actually implement a page-fault that did copy-on-write or  swapped in from disk.   but that would be visible to the operating system, not the user program.  However, such an OS implementation  would have to guarantee that it would not kill a process as a result  of a true permissions error page-fault. Or, if the virtual memory architecture made the distinction between permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy  on read,  the OS does not need to be involved.


 EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless you only trim only on true faults and not  COW or COR or disk swappage-faults).   However,  fault-on-first on any page-fault is a much  lower bandwidth  information leak  channel  than is fault-on-first on long latency cache misses.  so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to implement fault-on-first on any cache miss.  However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system where all of the CPUs are dedicated to a single problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to them!


Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually  much less of a virtualization hole than fault-on-first for page-faults.   The operating system or hypervisor has very little control over cache misses.  the OS and hypervisor have almost full control over page-faults.  The usual rule in security and virtualization is that an application should not be able to detect that it has had an "innocent"  page-fault, such as COW or COR or disk swapping.

 

 

 

--
--- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis


Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Andy Glew Si5
 



This is part of what David Horner wants.   However, it does not give him the  fault-on-first with zero length complete mechanism.   It could, if there were something else in the system that guaranteed forward progress

My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization model.


Yep, it's sufficient for the needs  of  while loop Vectorization.

It is suboptimal for "SIMT on vector".  For that you need a completion mask.   and it is far too late to add that to the RISC-V vector spec.



Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

andrew@...
 

Forwarding this to tech-vector-ext; couple comments below.

On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote:

In vector meeting last Friday  I listened to both Krste and David Horner's  different opinions about fault-on-first and vector length trimming. I realized (and may have convinced other attendees) that the  RISC-V "fault-on-first"  vector length trimming need not be done just for things like page-faults.

Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed,  because vector element zero is the forward progress mechanism.

Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the  element zero of the operation,  but that it can randomly stop with the appropriate indication for vector length  trimming at any point in the middle of the instruction.

Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values).  Enumerating all cases in which the length might be trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.

This is part of what David Horner wants.   However, it does not give him the  fault-on-first with zero length complete mechanism.   It could, if there were something else in the system that guaranteed forward progress

My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization model.

---+ Expanded

 

From vector meeting last Friday: trimming, fault-on-first.  I realized that it is similar to the forms of SW visible non-faulting speculative loads some machines, especially VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The NaT-ness is implied by trimmed vector length.  It could be implied by a mask showing which vector operations had completed.

 

All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison.  Software control flow must fall through such a check operation,  and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different instructions, so there must be a branch.

 

The RISC-V Fault-on-first approach  has the correctness check for non-faulting implied by redoing the instruction.  i.e. it is its own non-faulting check.  it gets away with this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on vector element zero, i.e. never allowing a trim to zero length.   if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, but to make forward progress it would have to guarantee that at least one vector element had completed.

 

David Horner's desire for fault-on-first that may have performed no operations at all is (1)  reasonable IMHO (I think I managed to explain that the Krste), but (2) Would require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. Which is almost certainly a bigger architectural change than RISC-V should make it this time.

 

Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've ever done. Particularly those vector instruction sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what amounts to a vector instruction set and microarchitecture.  https://pharr.org/matt/papers/ispc_inpar_2012.pdf ).  It would be unfortunate for such an SIMT program to lose  work completed after the first fault.

 

MORAL:  fault-on-first may be suitable for vector load that might speculate past the end of the vector -  where the length is  not known or inconvenient when the vector load instruction is started. Fault-on-first is  suboptimal for running SIMT on top of vectors.   i.e. fault-on-first  is the equivalent of precise exceptions for in order execution,  and for a single thread executing vector instructions, whereas  completion mask  allows out of order within a vector and/or vector length  threading.

 

IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is totally fine to have the fault-on-first stuff return up to the  first really long latency cost miss, as long as it always  guarantees that at least vector element zero was complete. Because vector element zero complete is what guarantees forward progress.


Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to actually implement a page-fault that did copy-on-write or  swapped in from disk.   but that would be visible to the operating system, not the user program.  However, such an OS implementation  would have to guarantee that it would not kill a process as a result  of a true permissions error page-fault. Or, if the virtual memory architecture made the distinction between permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy  on read,  the OS does not need to be involved.


 EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless you only trim only on true faults and not  COW or COR or disk swappage-faults).   However,  fault-on-first on any page-fault is a much  lower bandwidth  information leak  channel  than is fault-on-first on long latency cache misses.  so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to implement fault-on-first on any cache miss.  However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system where all of the CPUs are dedicated to a single problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to them!


Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually  much less of a virtualization hole than fault-on-first for page-faults.   The operating system or hypervisor has very little control over cache misses.  the OS and hypervisor have almost full control over page-faults.  The usual rule in security and virtualization is that an application should not be able to detect that it has had an "innocent"  page-fault, such as COW or COR or disk swapping.

 

 

 

--
--- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis


Vector TG meeting today

Krste Asanovic
 

Per calendar instructions, in usual time slot,

Proposed agenda:

  • #560 vmulh rounding mode
  • #576 vlsegff exception behavior
  • #550 names/contents of initial vector subsets
  • #568 disabling/context swtiching vector unit


Krste


Updated Event: Vector Extension Task Group Meeting #cal-invite

tech-vector-ext@lists.riscv.org Calendar <noreply@...>
 

Vector Extension Task Group Meeting

When:
Friday, 12 June 2020
8:00am to 9:00am
(UTC-07:00) America/Los Angeles
Repeats: Weekly on Friday, through Thursday, 8 October 2020

Organizer: Krste Asanovic krste@...

Description:
DO NOT USE THIS CALENDAR ENTRY.
USE THE GOOGLE CALENDAR FOR MEETING INFORMATION.