Topics

Vector TG meeting minutes 2020/9/25


Krste Asanovic
 

Date: 2020/9/25
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~14
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed

#551 Memory consistency model for scalar loads and vector loads

In current PoR, RVWMO memory model requires that scalar loads and
vector loads from same hart to same address are ordered following
program order. Proposal is to weaken this requirement so that scalar
loads and vector loads to the same address can be reordered,
simplifying implementations, except for ordered gathers. In
particular, the requirement for a younger scalar load to not occur
before an older vector gather to same address requires that the scalar
load wait (or speculates) to determine vector gather addresses.

Discussion centered around how much of an impact this would have on
software, and on constructing a case where the change would impact
software. In almost all cases where the scalar access is used to read
a signaling value from another hart, a FENCE would anyway be required
for correct operation as the synchronization would be associated with
the communication of more than one atomic word of memory. Only in the
case where the signal is part of an atomically written word of memory
(8 bytes max in current spec), and where the vector read is used to
read the same word (perhaps as a vector of bytes) might this cause an
issue. This was felt to be relatively rare.

Another worry is when a routine with a sync operation based on a
scalar read of a signaling variable then calls a routine, where the
subroutine is separately compiled and reads the data including the
signaling variable using vectors, there is a possibility that the
vector read will return inconsistent data. In general the caller is
unaware of whether the routine uses scalar or vecor reads, and the
subroutine is unaware that the variable was used to communicate
between threads.

While modern programing languages require that access to variables
used to communicate between harts be annotated to ensure correct
compilation, in practice legacy code and incorrect code might fail to
include the correct annotations and have a latent bug.

It was noted there are two directions for the ordering.

sl -> vl: Older scalar load before newer vector load, and
vl -> sl: older vector load before newer scalar load

The sl->vl direction represents the signaling-value-check before
vector computation case and is easiest to implement in hardware as
vector instructions typically access memory later in the pipeline than
scalar instructions.

The vl->sl case is the difficult one to implement at high-performance
but is also easier for software to work around with some form of read
fence (either FENCE or ordered vector access or just scalar read of
affected address).

The sentiment was in favor of weakening the memory ordering constraint
but more discussion was needed. Potentially only the vl->sl
constraint could be weakened.

# Imprecise Traps

Ways to support imprecise traps were also discussed, matching the very
brief descriptions in the spec 18.2-18.4, which will need expansion
and elaboration.


David Horner
 

On 2020-09-26 7:15 p.m., Krste Asanovic wrote:
Date: 2020/9/25
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~14
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed

#551 Memory consistency model for scalar loads and vector loads

In current PoR, RVWMO memory model requires that scalar loads and
vector loads from same hart to same address are ordered following
program order. Proposal is to weaken this requirement so that scalar
loads and vector loads to the same address can be reordered,
simplifying implementations, except for ordered gathers. In
particular, the requirement for a younger scalar load to not occur
before an older vector gather to same address requires that the scalar
load wait (or speculates) to determine vector gather addresses.

Discussion centered around how much of an impact this would have on
software, and on constructing a case where the change would impact
software. In almost all cases where the scalar access is used to read
a signaling value from another hart, a FENCE would anyway be required
for correct operation as the synchronization would be associated with
the communication of more than one atomic word of memory. Only in the
case where the signal is part of an atomically written word of memory
(8 bytes max in current spec), and where the vector read is used to
read the same word (perhaps as a vector of bytes) might this cause an
issue. This was felt to be relatively rare.

Another worry is when a routine with a sync operation based on a
scalar read of a signaling variable then calls a routine, where the
subroutine is separately compiled and reads the data including the
signaling variable using vectors, there is a possibility that the
vector read will return inconsistent data. In general the caller is
unaware of whether the routine uses scalar or vecor reads, and the
subroutine is unaware that the variable was used to communicate
between threads.

While modern programing languages require that access to variables
used to communicate between harts be annotated to ensure correct
compilation, in practice legacy code and incorrect code might fail to
include the correct annotations and have a latent bug.

It was noted there are two directions for the ordering.

sl -> vl: Older scalar load before newer vector load, and
vl -> sl: older vector load before newer scalar load

The sl->vl direction represents the signaling-value-check before
vector computation case and is easiest to implement in hardware as
vector instructions typically access memory later in the pipeline than
scalar instructions.

The vl->sl case is the difficult one to implement at high-performance
but is also easier for software to work around with some form of read
fence (either FENCE or ordered vector access or just scalar read of
affected address).

The sentiment was in favor of weakening the memory ordering constraint
but more discussion was needed. Potentially only the vl->sl
constraint could be weakened.
I am in favour of effectively weakening the scalar/vector vector/scalar load/load order requirement.

However, this cannot be performed in isolation  without regard to the rest of the RVWMO dependency requirements.


RVI has section 14.3 Source and Destination Register Listings, 5 pages detailing , identifying and categorizing dependencies between implictly and explicitly opcode identified persistent stores, including csrs.

These dependencies form a critical component of the RVWMO specification.

They constrain global memory order for memory data entering and exiting a potentially lengthy sequence of non-memory accessing instructions.

They are also based on an intuitive engine: the hypothetical device that executes instruction is program order, the "hart".

The rules and constraints are crafted to accomplish results that are "strong enough to support programming language memory models".


For Vector extension, we have not yet stipulated what the Vector specific RVVWMO requirements are.

This is a necessary step, it will  be instrumental in shaping or tempering the explicit WMO constraints.

To me the pivitol question is what execution model does the Vector Engine follow.

Does it need to be constrained to support legacy programming language memory models?

Should it rather be envisioned as a novel model freed from past bondage, or if not to that extreme some of those constraints?


A [simple/comprehensive] specific conceptual vector model may eliminate a swath of RVWMO rules.

Specifically, idealizing the vector processor as distinct from the hosting "hart",

    as an autonomous co-processor as far as Memory order is concerned.

    This functions conceptually as a set of independent "hardware threads" coordinating among themselves,

       and also collectively to the host hart to cause the required vector behaviour.

I believe "register" dependency must still be considered, at the element level and not solely named registers.

We should not profess to be "RVWMO- except vl -> sl,   and except sl -> vl ( except when ordered indexed reads), and in-order precise execution trapping except ..., and ....)"

Rather we must define a model that **intuitively** allows all the optimizations we believe are necessary for a first class Vector design.