Date   

Re: [RISC-V] [tech-cmo] Fault-on-first should be allowed to return randomly on non-faults (also, running SIMT code on vector ISA)

Andrew Waterman
 

Forwarding this to tech-vector-ext; couple comments below.

On Thu, Oct 15, 2020 at 2:33 PM Andy Glew Si5 <andy.glew@...> wrote:

In vector meeting last Friday  I listened to both Krste and David Horner's  different opinions about fault-on-first and vector length trimming. I realized (and may have convinced other attendees) that the  RISC-V "fault-on-first"  vector length trimming need not be done just for things like page-faults.

Fault-on-first could be done for the first long latency cache miss, as long as vector element zero has been completed,  because vector element zero is the forward progress mechanism.

Indeed, IMHO the correct semantic requirement for fault-on-first is that it completes the  element zero of the operation,  but that it can randomly stop with the appropriate indication for vector length  trimming at any point in the middle of the instruction.

Indeed, I've found other microarchitectural reasons to favor this approach (e.g., speculating through mask-register values).  Enumerating all cases in which the length might be trimmed seems like a fool's errand, so just saying it can be truncated to >= 1 for any reason is the way to go.

This is part of what David Horner wants.   However, it does not give him the  fault-on-first with zero length complete mechanism.   It could, if there were something else in the system that guaranteed forward progress

My take is that requiring that element 0 either complete or trap is already a solid mechanism for guaranteeing forward progress, and cleanly matches the while-loop vectorization model.

---+ Expanded

 

From vector meeting last Friday: trimming, fault-on-first.  I realized that it is similar to the forms of SW visible non-faulting speculative loads some machines, especially VLIWs, have. However, instead of delivering a NaN or NaT, it is non-faulting except for vector element 0, where it faults. The NaT-ness is implied by trimmed vector length.  It could be implied by a mask showing which vector operations had completed.

 

All such SW non-faulting loads need a "was this correct" operation, which might just be a faulting load and a comparison.  Software control flow must fall through such a check operation,  and through a redo of the faulting load if necessary. In scalar, non-faulting and faulting loads are different instructions, so there must be a branch.

 

The RISC-V Fault-on-first approach  has the correctness check for non-faulting implied by redoing the instruction.  i.e. it is its own non-faulting check.  it gets away with this because the trend vector length indicates which parts were valid and not. forward progress is guaranteed by trapping on vector element zero, i.e. never allowing a trim to zero length.   if a non-faulting vector approach was used instead of fault-on-first, it could return a vector complete mask, but to make forward progress it would have to guarantee that at least one vector element had completed.

 

David Horner's desire for fault-on-first that may have performed no operations at all is (1)  reasonable IMHO (I think I managed to explain that the Krste), but (2) Would require some other mechanism for forward progress. E.g. instead of trapping on element zero, the bitmask that I described above. Which is almost certainly a bigger architectural change than RISC-V should make it this time.

 

Although more and more I am happier that I included such a completion bitmask in newly every vector instruction set that I've ever done. Particularly those vector instruction sets that were supposed to implement SIMT efficiently. (I think of SIMT as a programming model that is implemented on top of what amounts to a vector instruction set and microarchitecture.  https://pharr.org/matt/papers/ispc_inpar_2012.pdf ).  It would be unfortunate for such an SIMT program to lose  work completed after the first fault.

 

MORAL:  fault-on-first may be suitable for vector load that might speculate past the end of the vector -  where the length is  not known or inconvenient when the vector load instruction is started. Fault-on-first is  suboptimal for running SIMT on top of vectors.   i.e. fault-on-first  is the equivalent of precise exceptions for in order execution,  and for a single thread executing vector instructions, whereas  completion mask  allows out of order within a vector and/or vector length  threading.

 

IMHO an important realization I made in that meeting is that fault-on-first does not need to be just about faulting. It is totally fine to have the fault-on-first stuff return up to the  first really long latency cost miss, as long as it always  guarantees that at least vector element zero was complete. Because vector element zero complete is what guarantees forward progress.


Furthermore, it is not even required that fault-on-first stop at the first page-fault. An implementation could actually choose to actually implement a page-fault that did copy-on-write or  swapped in from disk.   but that would be visible to the operating system, not the user program.  However, such an OS implementation  would have to guarantee that it would not kill a process as a result  of a true permissions error page-fault. Or, if the virtual memory architecture made the distinction between permissions faults and the sorts of page-fault that is for disk swapping or copy-on-write or copy  on read,  the OS does not need to be involved.


 EVERYTHING about fault-on-first is a microarchitecture security/information leak channel and/or a virtualization hole. (Unless you only trim only on true faults and not  COW or COR or disk swappage-faults).   However,  fault-on-first on any page-fault is a much  lower bandwidth  information leak  channel  than is fault-on-first on long latency cache misses.  so a general purpose system might choose to implement fault-on-first on any page-fault, but might not want to implement fault-on-first on any cache miss.  However, there are some systems for which that sort of security issue is not a concern. E.g. a data center or embedded system where all of the CPUs are dedicated to a single problem. In which case, if they can gain performance by doing fault-on-first on particular long latency cache misses, power to them!


Interestingly, although fault-on-first on long latency cache misses is a high-bandwidth information leak, it is actually  much less of a virtualization hole than fault-on-first for page-faults.   The operating system or hypervisor has very little control over cache misses.  the OS and hypervisor have almost full control over page-faults.  The usual rule in security and virtualization is that an application should not be able to detect that it has had an "innocent"  page-fault, such as COW or COR or disk swapping.

 

 

 

--
--- Sorry: Typos (Speech-Os?) Writing Errors <= Speech Recognition <= Computeritis


Vector TG meeting today

Krste Asanovic
 

Per calendar instructions, in usual time slot,

Proposed agenda:

  • #560 vmulh rounding mode
  • #576 vlsegff exception behavior
  • #550 names/contents of initial vector subsets
  • #568 disabling/context swtiching vector unit


Krste


Updated Event: Vector Extension Task Group Meeting #cal-invite

tech-vector-ext@lists.riscv.org Calendar <noreply@...>
 

Vector Extension Task Group Meeting

When:
Friday, 12 June 2020
8:00am to 9:00am
(UTC-07:00) America/Los Angeles
Repeats: Weekly on Friday, through Thursday, 8 October 2020

Organizer: Krste Asanovic krste@...

Description:
DO NOT USE THIS CALENDAR ENTRY.
USE THE GOOGLE CALENDAR FOR MEETING INFORMATION.


Re: Clarification on vid.v

Joseph Rahmeh <joseph.rahmeh@...>
 

Done.  Thanks Roger.

 

From: Roger Espasa <roger.espasa@...>
Date: Sunday, October 4, 2020 at 12:49 PM
To: Joseph Rahmeh <Joseph.Rahmeh@...>
Cc: "tech-vector-ext@..." <tech-vector-ext@...>, Robert Golla <Robert.Golla@...>, Cohen Steed <Cohen.Steed@...>, Christopher Olson <Christopher.Olson@...>, Matthew Smittle <Matthew.Smittle@...>
Subject: Re: [RISC-V] [tech-vector-ext] Clarification on vid.v

 

CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.

 

Joseph,

 

May I suggest you open a git issue here: https://github.com/riscv/riscv-v-spec/issues with these two questions? It will help better tracking and will ensure whatever the resolution is, it does make it into the spec.

 

roger.

 

On Sat, Oct 3, 2020 at 8:09 PM Joseph Rahmeh <joseph.rahmeh@...> wrote:

 

Should vid.v raise an illegal instruction exception when masked and when the destination group overlaps v0 ?

Should vid.v raise an illegal instruction exception when vstart > 0 ?


Re: Clarification on vid.v

Roger Espasa
 

Joseph,

May I suggest you open a git issue here: https://github.com/riscv/riscv-v-spec/issues with these two questions? It will help better tracking and will ensure whatever the resolution is, it does make it into the spec.

roger.

On Sat, Oct 3, 2020 at 8:09 PM Joseph Rahmeh <joseph.rahmeh@...> wrote:

 

Should vid.v raise an illegal instruction exception when masked and when the destination group overlaps v0 ?

Should vid.v raise an illegal instruction exception when vstart > 0 ?


Clarification on vid.v

Joseph Rahmeh <joseph.rahmeh@...>
 

 

Should vid.v raise an illegal instruction exception when masked and when the destination group overlaps v0 ?

Should vid.v raise an illegal instruction exception when vstart > 0 ?


Apologies - zoom on again if people can make

Krste Asanovic
 

Krste


Re: Vector TG meeting tomorrow

David Horner
 

I will add my thoughts related to "embedded" imprecise:

Why embedded specifically?

Linux handles GPUs as coprocessors. My understanding is that by their nature, the internal state of most GPUs is not precisely known, but it can be and is managed as a black box.

What is distinct or special about embedded that makes embedded more [or less] inclined/susceptible  to vector imprecise state?

What makes our vector engine less managable/usable/precise if the unit has to be managed as a black box?

It is important to make a distinction between

   - precise identification of the element/component requiring intervention (whether that be emulation/reconfiguring the system for instruction completion, etc.) and

  - precise state of all the components in the coprocessor/attached-device, etc. is readily known/discoverable.

I don't know if we are only discussing "imprecise" as defined in the Vector spec.

(There is an issue open about that that is not embedded specific. https://github.com/riscv/riscv-v-spec/issues/364)

Or are we expanding from 1 the number of vector registers that can be affected by concurrent vector operations.

(#364 again).



On 2020-10-02 5:36 a.m., David Horner via lists.riscv.org wrote:

Is there already a doc/issue specific to this imprecise handling that we can reference before and during the meeting?

On Fri, Oct 2, 2020, 03:46 Krste Asanovic, <krste@...> wrote:
Reminder we’ll be meeting tomorrow in usual slot.

I’d like to spend the time discussing imprecise trap handling for embedded vector systems.

Hopefully, we can all see the new correct link on Google Calendar for meeting info.

I replaced old groups.io calendar entry with message not to use this entry.

Krste



Re: Vector TG meeting tomorrow

David Horner
 

Is there already a doc/issue specific to this imprecise handling that we can reference before and during the meeting?


On Fri, Oct 2, 2020, 03:46 Krste Asanovic, <krste@...> wrote:
Reminder we’ll be meeting tomorrow in usual slot.

I’d like to spend the time discussing imprecise trap handling for embedded vector systems.

Hopefully, we can all see the new correct link on Google Calendar for meeting info.

I replaced old groups.io calendar entry with message not to use this entry.

Krste



Vector TG meeting tomorrow

Krste Asanovic
 

Reminder we’ll be meeting tomorrow in usual slot.

I’d like to spend the time discussing imprecise trap handling for embedded vector systems.

Hopefully, we can all see the new correct link on Google Calendar for meeting info.

I replaced old groups.io calendar entry with message not to use this entry.

Krste



Re: Proposing more portable vector cod

David Horner
 

Never say never.
Appears to be the mantra for V extension. 


On Tue, Sep 29, 2020, 15:06 Nick Knight, <nick.knight@...> wrote:
Hi Joseph,

Thanks for the clarification.

The wording in the spec is admittedly vague: "LMUL can have integer values 1,2,4,8.", etc. My understanding of the intent is that all implementations must support the full range of LMUL values.

Yes, the intent is that the V specification mandates LMUL of 8, 4 and 1.
Even for minimal systems of VLEN=128; not only for interoperability, but because it provides a substantial functional benefit.

Future extension to larger LMUL comencerate with expansion of register set to more than 32 will likely continue to trap if supported LMUL exceeded, however, there is an opportunity then to press for auto vl sizing.

Auto vl sizing was previously discussed for all ops, not just first fault loads, nor just for widening. It is not currently on the agenda for v1.0 release.

However, it is expected that components of the V extension would be separately implementable,  verifiable and certifiable.
A reduction of LMUL could be allowed in that context.
Even an expansion of EMUL to 2*max LMUL is still on the table for post v1.0.
Joseph, thank you so much for your insightful input.


I'll defer to others to confirm or deny this.
And thanks to you Nick for replying (before I did).

Best,
Nick

...

I agree that we could make widening instructions more flexible by having them decrease VL (and LMUL) so that EMUL becomes valid. The fault-first loads adjust VL automatically, so this is not without some precedent. However, In my opinion, it's too much of a burden to do this manually (using vsetvli), and I don't see any portability issues with that.

 

_._,_._,_


Re: Proposing more portable vector cod

Nick Knight
 

Hi Joseph,

Thanks for the clarification.

The wording in the spec is admittedly vague: "LMUL can have integer values 1,2,4,8.", etc. My understanding of the intent is that all implementations must support the full range of LMUL values.

I'll defer to others to confirm or deny this.

Best,
Nick


On Tue, Sep 29, 2020 at 11:42 AM Joseph Rahmeh <Joseph.Rahmeh@...> wrote:

Hi Nick,

 

Thanks for the reply.  I was not asking for non-power of 2 LMUL.  I was asking about LMUL values not supported by some implementation.

 

Let’s say that for SEW=128, an implementation supports LMUL=1 but no other values of LMUL.  If the software ties to use a widening operation with SEW=64 and LMUL=1, then the EEW for the wide operand will be 128 and the EMUL will be 2.  The EEW/EMUL combination (128/2) is not supported on this implementation.  However, 128/1 is.  If we reduce VL instead of taking an illegal instruction exception, the code will work. 

 

Joe

 

From: Nick Knight <nick.knight@...>
Date: Tuesday, September 29, 2020 at 1:31 PM
To: Joseph Rahmeh <Joseph.Rahmeh@...>
Cc: "tech-vector-ext@..." <tech-vector-ext@...>, Robert Golla <Robert.Golla@...>, Cohen Steed <Cohen.Steed@...>, Christopher Olson <Christopher.Olson@...>, Matthew Smittle <Matthew.Smittle@...>, Ajay Ingle <Ajay.Ingle@...>
Subject: Re: [RISC-V] [tech-vector-ext] Proposing more portable vector cod

 

CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.

 

Hi Joseph,

 

Thanks for your comments. I apologize, but I don't fully understand your proposal, or the problem it solves. To help explain my confusion, here are two thoughts.

 

The supported LMUL (and EMUL) values are 2^k (k = -3:3) on all implementations, so software requesting EMUL > 8 is illegal everywhere.

 

I agree that we could make widening instructions more flexible by having them decrease VL (and LMUL) so that EMUL becomes valid. The fault-first loads adjust VL automatically, so this is not without some precedent. However, In my opinion, it's too much of a burden to do this manually (using vsetvli), and I don't see any portability issues with that.

 

Best,

Nick Knight

 

On Tue, Sep 29, 2020 at 9:32 AM Joseph Rahmeh <Joseph.Rahmeh@...> wrote:

 

In the latest vector proposal (draft of version 1.0), there is the following restriction on widening instructions (section 11.2)

 

For all widening instructions, the destination EEW and EMUL values must be a supported configuration, otherwise an illegal instruction exception is raised.

 

This seems unduly restrictive and will limit software portability.  If the destination EEW is supported but EMUL is not, it would improve code portability if strip-mining reduces VL accordingly instead of raising an exception.

 

Similarly,  code would be more portable, if any proposed combination of SEW/LMUL is replaced by SEW/LMUL2 if SEW is supported and LMUL is not.  LMUL2 would be the highest supported group multiplier for the given SEW.

 

 


Re: Proposing more portable vector cod

joseph.rahmeh@...
 

Hi Nick,

 

Thanks for the reply.  I was not asking for non-power of 2 LMUL.  I was asking about LMUL values not supported by some implementation.

 

Let’s say that for SEW=128, an implementation supports LMUL=1 but no other values of LMUL.  If the software ties to use a widening operation with SEW=64 and LMUL=1, then the EEW for the wide operand will be 128 and the EMUL will be 2.  The EEW/EMUL combination (128/2) is not supported on this implementation.  However, 128/1 is.  If we reduce VL instead of taking an illegal instruction exception, the code will work. 

 

Joe

 

From: Nick Knight <nick.knight@...>
Date: Tuesday, September 29, 2020 at 1:31 PM
To: Joseph Rahmeh <Joseph.Rahmeh@...>
Cc: "tech-vector-ext@..." <tech-vector-ext@...>, Robert Golla <Robert.Golla@...>, Cohen Steed <Cohen.Steed@...>, Christopher Olson <Christopher.Olson@...>, Matthew Smittle <Matthew.Smittle@...>, Ajay Ingle <Ajay.Ingle@...>
Subject: Re: [RISC-V] [tech-vector-ext] Proposing more portable vector cod

 

CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.

 

Hi Joseph,

 

Thanks for your comments. I apologize, but I don't fully understand your proposal, or the problem it solves. To help explain my confusion, here are two thoughts.

 

The supported LMUL (and EMUL) values are 2^k (k = -3:3) on all implementations, so software requesting EMUL > 8 is illegal everywhere.

 

I agree that we could make widening instructions more flexible by having them decrease VL (and LMUL) so that EMUL becomes valid. The fault-first loads adjust VL automatically, so this is not without some precedent. However, In my opinion, it's too much of a burden to do this manually (using vsetvli), and I don't see any portability issues with that.

 

Best,

Nick Knight

 

On Tue, Sep 29, 2020 at 9:32 AM Joseph Rahmeh <Joseph.Rahmeh@...> wrote:

 

In the latest vector proposal (draft of version 1.0), there is the following restriction on widening instructions (section 11.2)

 

For all widening instructions, the destination EEW and EMUL values must be a supported configuration, otherwise an illegal instruction exception is raised.

 

This seems unduly restrictive and will limit software portability.  If the destination EEW is supported but EMUL is not, it would improve code portability if strip-mining reduces VL accordingly instead of raising an exception.

 

Similarly,  code would be more portable, if any proposed combination of SEW/LMUL is replaced by SEW/LMUL2 if SEW is supported and LMUL is not.  LMUL2 would be the highest supported group multiplier for the given SEW.

 

 


Re: Proposing more portable vector cod

Nick Knight
 

Sorry, in case it wasn't clear: typo

On Tue, Sep 29, 2020 at 11:30 AM Nick Knight <nick.knight@...> wrote:
However, In my opinion, it's too much of a burden to do this manually (using vsetvli),

it's not too much of a burden.

Best,
Nick Knight

On Tue, Sep 29, 2020 at 9:32 AM Joseph Rahmeh <Joseph.Rahmeh@...> wrote:

 

In the latest vector proposal (draft of version 1.0), there is the following restriction on widening instructions (section 11.2)

 

For all widening instructions, the destination EEW and EMUL values must be a supported configuration, otherwise an illegal instruction exception is raised.

 

This seems unduly restrictive and will limit software portability.  If the destination EEW is supported but EMUL is not, it would improve code portability if strip-mining reduces VL accordingly instead of raising an exception.

 

Similarly,  code would be more portable, if any proposed combination of SEW/LMUL is replaced by SEW/LMUL2 if SEW is supported and LMUL is not.  LMUL2 would be the highest supported group multiplier for the given SEW.

 

 


Re: Proposing more portable vector cod

Nick Knight
 

Hi Joseph,

Thanks for your comments. I apologize, but I don't fully understand your proposal, or the problem it solves. To help explain my confusion, here are two thoughts.

The supported LMUL (and EMUL) values are 2^k (k = -3:3) on all implementations, so software requesting EMUL > 8 is illegal everywhere.

I agree that we could make widening instructions more flexible by having them decrease VL (and LMUL) so that EMUL becomes valid. The fault-first loads adjust VL automatically, so this is not without some precedent. However, In my opinion, it's too much of a burden to do this manually (using vsetvli), and I don't see any portability issues with that.

Best,
Nick Knight


On Tue, Sep 29, 2020 at 9:32 AM Joseph Rahmeh <Joseph.Rahmeh@...> wrote:

 

In the latest vector proposal (draft of version 1.0), there is the following restriction on widening instructions (section 11.2)

 

For all widening instructions, the destination EEW and EMUL values must be a supported configuration, otherwise an illegal instruction exception is raised.

 

This seems unduly restrictive and will limit software portability.  If the destination EEW is supported but EMUL is not, it would improve code portability if strip-mining reduces VL accordingly instead of raising an exception.

 

Similarly,  code would be more portable, if any proposed combination of SEW/LMUL is replaced by SEW/LMUL2 if SEW is supported and LMUL is not.  LMUL2 would be the highest supported group multiplier for the given SEW.

 

 


Proposing more portable vector cod

Joseph Rahmeh <Joseph.Rahmeh@...>
 

 

In the latest vector proposal (draft of version 1.0), there is the following restriction on widening instructions (section 11.2)

 

For all widening instructions, the destination EEW and EMUL values must be a supported configuration, otherwise an illegal instruction exception is raised.

 

This seems unduly restrictive and will limit software portability.  If the destination EEW is supported but EMUL is not, it would improve code portability if strip-mining reduces VL accordingly instead of raising an exception.

 

Similarly,  code would be more portable, if any proposed combination of SEW/LMUL is replaced by SEW/LMUL2 if SEW is supported and LMUL is not.  LMUL2 would be the highest supported group multiplier for the given SEW.

 

 


Re: Vector TG meeting minutes 2020/9/25

David Horner
 

On 2020-09-26 7:15 p.m., Krste Asanovic wrote:
Date: 2020/9/25
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~14
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed

#551 Memory consistency model for scalar loads and vector loads

In current PoR, RVWMO memory model requires that scalar loads and
vector loads from same hart to same address are ordered following
program order. Proposal is to weaken this requirement so that scalar
loads and vector loads to the same address can be reordered,
simplifying implementations, except for ordered gathers. In
particular, the requirement for a younger scalar load to not occur
before an older vector gather to same address requires that the scalar
load wait (or speculates) to determine vector gather addresses.

Discussion centered around how much of an impact this would have on
software, and on constructing a case where the change would impact
software. In almost all cases where the scalar access is used to read
a signaling value from another hart, a FENCE would anyway be required
for correct operation as the synchronization would be associated with
the communication of more than one atomic word of memory. Only in the
case where the signal is part of an atomically written word of memory
(8 bytes max in current spec), and where the vector read is used to
read the same word (perhaps as a vector of bytes) might this cause an
issue. This was felt to be relatively rare.

Another worry is when a routine with a sync operation based on a
scalar read of a signaling variable then calls a routine, where the
subroutine is separately compiled and reads the data including the
signaling variable using vectors, there is a possibility that the
vector read will return inconsistent data. In general the caller is
unaware of whether the routine uses scalar or vecor reads, and the
subroutine is unaware that the variable was used to communicate
between threads.

While modern programing languages require that access to variables
used to communicate between harts be annotated to ensure correct
compilation, in practice legacy code and incorrect code might fail to
include the correct annotations and have a latent bug.

It was noted there are two directions for the ordering.

sl -> vl: Older scalar load before newer vector load, and
vl -> sl: older vector load before newer scalar load

The sl->vl direction represents the signaling-value-check before
vector computation case and is easiest to implement in hardware as
vector instructions typically access memory later in the pipeline than
scalar instructions.

The vl->sl case is the difficult one to implement at high-performance
but is also easier for software to work around with some form of read
fence (either FENCE or ordered vector access or just scalar read of
affected address).

The sentiment was in favor of weakening the memory ordering constraint
but more discussion was needed. Potentially only the vl->sl
constraint could be weakened.
I am in favour of effectively weakening the scalar/vector vector/scalar load/load order requirement.

However, this cannot be performed in isolation  without regard to the rest of the RVWMO dependency requirements.


RVI has section 14.3 Source and Destination Register Listings, 5 pages detailing , identifying and categorizing dependencies between implictly and explicitly opcode identified persistent stores, including csrs.

These dependencies form a critical component of the RVWMO specification.

They constrain global memory order for memory data entering and exiting a potentially lengthy sequence of non-memory accessing instructions.

They are also based on an intuitive engine: the hypothetical device that executes instruction is program order, the "hart".

The rules and constraints are crafted to accomplish results that are "strong enough to support programming language memory models".


For Vector extension, we have not yet stipulated what the Vector specific RVVWMO requirements are.

This is a necessary step, it will  be instrumental in shaping or tempering the explicit WMO constraints.

To me the pivitol question is what execution model does the Vector Engine follow.

Does it need to be constrained to support legacy programming language memory models?

Should it rather be envisioned as a novel model freed from past bondage, or if not to that extreme some of those constraints?


A [simple/comprehensive] specific conceptual vector model may eliminate a swath of RVWMO rules.

Specifically, idealizing the vector processor as distinct from the hosting "hart",

    as an autonomous co-processor as far as Memory order is concerned.

    This functions conceptually as a set of independent "hardware threads" coordinating among themselves,

       and also collectively to the host hart to cause the required vector behaviour.

I believe "register" dependency must still be considered, at the element level and not solely named registers.

We should not profess to be "RVWMO- except vl -> sl,   and except sl -> vl ( except when ordered indexed reads), and in-order precise execution trapping except ..., and ....)"

Rather we must define a model that **intuitively** allows all the optimizations we believe are necessary for a first class Vector design.


Vector TG meeting minutes 2020/9/25

Krste Asanovic
 

Date: 2020/9/25
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~14
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed

#551 Memory consistency model for scalar loads and vector loads

In current PoR, RVWMO memory model requires that scalar loads and
vector loads from same hart to same address are ordered following
program order. Proposal is to weaken this requirement so that scalar
loads and vector loads to the same address can be reordered,
simplifying implementations, except for ordered gathers. In
particular, the requirement for a younger scalar load to not occur
before an older vector gather to same address requires that the scalar
load wait (or speculates) to determine vector gather addresses.

Discussion centered around how much of an impact this would have on
software, and on constructing a case where the change would impact
software. In almost all cases where the scalar access is used to read
a signaling value from another hart, a FENCE would anyway be required
for correct operation as the synchronization would be associated with
the communication of more than one atomic word of memory. Only in the
case where the signal is part of an atomically written word of memory
(8 bytes max in current spec), and where the vector read is used to
read the same word (perhaps as a vector of bytes) might this cause an
issue. This was felt to be relatively rare.

Another worry is when a routine with a sync operation based on a
scalar read of a signaling variable then calls a routine, where the
subroutine is separately compiled and reads the data including the
signaling variable using vectors, there is a possibility that the
vector read will return inconsistent data. In general the caller is
unaware of whether the routine uses scalar or vecor reads, and the
subroutine is unaware that the variable was used to communicate
between threads.

While modern programing languages require that access to variables
used to communicate between harts be annotated to ensure correct
compilation, in practice legacy code and incorrect code might fail to
include the correct annotations and have a latent bug.

It was noted there are two directions for the ordering.

sl -> vl: Older scalar load before newer vector load, and
vl -> sl: older vector load before newer scalar load

The sl->vl direction represents the signaling-value-check before
vector computation case and is easiest to implement in hardware as
vector instructions typically access memory later in the pipeline than
scalar instructions.

The vl->sl case is the difficult one to implement at high-performance
but is also easier for software to work around with some form of read
fence (either FENCE or ordered vector access or just scalar read of
affected address).

The sentiment was in favor of weakening the memory ordering constraint
but more discussion was needed. Potentially only the vl->sl
constraint could be weakened.

# Imprecise Traps

Ways to support imprecise traps were also discussed, matching the very
brief descriptions in the spec 18.2-18.4, which will need expansion
and elaboration.


Re: Please check new Google calendar for new vector TG meeting link

Krste Asanovic
 

Nick fowarded the old deprecated calendar.

https://sites.google.com/a/riscv.org/risc-v-staff/home/tech-groups-cal

Is the new unified tech group calendar.

It is very hard to find, we're working on improving that.

Krste

On Fri, 25 Sep 2020 08:58:25 -0700, "Nick Knight" <nick.knight@sifive.com> said:
| Hi Cohen,
| I can see the calendar here:
| https://lists.riscv.org/g/tech-vector-ext/calendar

| Unfortunately, due to a conflict I can only rarely attend the TG meeting.

| Best,
| Nick

| On Fri, Sep 25, 2020 at 8:27 AM CDS <cohen.steed@wdc.com> wrote:

| Can you publish a link to what the Google Calendar is, or the TG meeting
| link? There are no links on this site to any of that information.

|


Re: Please check new Google calendar for new vector TG meeting link

CDS <cohen.steed@...>
 

It looks like there are two calendars – one on the risc-v site, and one on google (someone else sent to me). And the ZOOM meeting invites are not the same between them :\

 

-Cohen

 

From: Nick Knight <nick.knight@...>
Date: Friday, September 25, 2020 at 10:58 AM
To: Cohen Steed <Cohen.Steed@...>
Cc: "tech-vector-ext@..." <tech-vector-ext@...>
Subject: Re: [RISC-V] [tech-vector-ext] Please check new Google calendar for new vector TG meeting link

 

CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.

 

Hi Cohen,

 

 

Unfortunately, due to a conflict I can only rarely attend the TG meeting.

 

Best,

Nick

 

On Fri, Sep 25, 2020 at 8:27 AM CDS <cohen.steed@...> wrote:

Can you publish a link to what the Google Calendar is, or the TG meeting link? There are no links on this site to any of that information.

341 - 360 of 790