Vector task group minutes for 2020/8/21

Krste Asanovic
Date: 2020/8/21 Task Group: Vector Extension Chair: Krste Asanovic Co-Chair: Roger Espasa Number of Attendees: ~12 Current issues on github: https://github.com/riscv/riscv-v-specIssues discussed: #501 Ordering of element loads Discussion was around vector memory access ordering and interaction with global memory consistency model. The original proposal for vector memory ordering was that it behaved like a scalar loop over the elements, however this would imply that a vector load that accessed the same address would have to access that address in element order. This only affects stride-0 loads and indexed gathers. The proposal was to relax indexed gathers to always be unordered, and to use vl=1 if the program required ordering between loop iterations. Another dicussion was on access to ordered and/or non-idempotent memory regions, where initiating memory accessed out-of-order would not give expected semantics, and restarting instructions once this was detected could be problematic for implementations without load buffers or renamed registers, and when having to maintain precise-to-the-element exception semantics. One proposal was to weaken precise-to-element semantics and forbid overlap of source index and destination vector on gathers. Implementing debug data watchpoints with unordered accesses also is problematic if watchpoints have to be triggered in element. A proposal was to allow watchpoints to be reported out of element order. Currently, strided segment stores where segments overlap are defined to occur in element order. Discussion was around loosening this constraint to out-of-order writes, or to trap if the overlap was detected.
|
|
Re: poll on vstart management issues #493, #510 and #532

swallach
attached is what we did at convex and it worked quite well. worked well in the context of compiler generated code for stencils and for runtimers like convolution and correlation
i am not sure this answers the questions you posed.
hope this helps
—————————
Vector first register - C4600
The vector register set of the C4600 Series CPUs contains an
additional vector register called the vector first register (VF).
VF specifies the first element of vector register Vi, Vj or Vk
accessed by a vector instruction, provided that the MSB of the
corresponding 5-bit register select field of the instruction is set.
VF cannot be applied to operations on VM.
VF is seven bits in length and may contain a value between 0 and
127. If the value of VF plus the value of VL is greater than 128,
the effective value ofVL for vector instructions that use VF is 128
minus VF. This effective VL value determines the number of
results written to a vector register or VM, or the number of
elements stored to memory.
If the value of VF plus Sj is greater than 127 in the mov
V i, S j, Sk and rnov S i, S j, Vk instructions, then the selected
element of the vector register is equal to (VF plus Sj) mod 128.
Therefore, the vector register wraps for these two instructions only.
If Vi or Vj of an instruction specifies the same register as Vk of
the instruction, and VF is applied to Vk, and VL is greater than
VF, then elements of the shared register may be written (as Vk)
before they are read (as Vi or Vj, depending of the hardware
implementation). In this case, the result in Vk is architecturally
undefined.Theinstructionmerg.x Vi,Vj,Vkhasthesame
behavior if Vi or Vj are the same as Vk.
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|
Added details for vector TG meeting tomorrow

Krste Asanovic
I believe I added the correct zoom info on the correct new calendar for tomorrow’s vector task group meeting.
Please check and advise if you’re not seeing it,
Krste
|
|
poll on vstart management issues #493, #510 and #532
Ahead of the vector meeting I would like to see if we can address
or at least get direction on some of the flagged for pre-v1.0
resolution.
There are 3 related flagged issues that all deal with vstart.
493 - unbind vstart from element index
510 - add
element index to vstart for segment loads and stores
532 -
define vstart as an opaque value (some values will have specific
meanings
These have in common redefining vstart.
#493 proposes that the vstart value not have a one to one mapping
from its value to the usual vl ordering.
This was motivated in part by SLEN considerations
that are now invisible to the ISA architecture.
However, included is the consideration that some
operations , e.g. cryptography may desire restart part of the way
through the operation.
This even if only one element is contained in
the vlen*8 register set.
Clearly insufficient internal state is
expressed it only zero and 1 are allowed values.
#510 proposes the element position within a segmented load/store
is identified in vstart, not just the group position. **
Most of the discussion relates to which should be
identified, element position or group position.
POR is group position and it was substantially defended
as the incumbent definition.
The POR substantially limits an implementations options,
intended for the greater good.
Thus the question of where and how the additional
"element within group" information should be stored did not
progress.
However, even if segmented load/stores will settle on
restart granularity of groups or elements,
the larger question of alternate representation of
restart information for "special" cirumstances has been raised,
as it has in #493.
#532 proposes that vstart be defined as a value that will be
treated as opaque at the ISA level.
No intrinsic meaning should be inferred directly from
the value in vstart.
It is assured only to be a value used to instruct the
implementation to restart the instruction from either a) the
exception element or b) by adding 1 to the vstart value, the next
element.
The proposal allows for implementations to provide a
mechanism to provide additional trap information, including
related element at exception.
Escape mechanisms to convey that the index is simply
embedded in the rightmost bits of vstart is discussed as a
concession
to the expectation that plain text identification
of the element "active" at the time of the exception is valuable
information and sufficient in most cases for restart handling.
My request for the meeting is to poll the group to respond agree,
disagree or abstain on the following..
1) the POR [vstart is the plain text number indicating
the next element at which to resume] is sufficient for v1.0,
Any augmenting of vstart or adding new facilities
[e.g. csr] can be addressed later.
The ecosystem changes can also be made later as we
expect the changes required to be specific to new functionality
(crypto, ediv, etc.).
2) we identify specific special cases that we believe are
desired by ecosystem [ element within segment group, phased restart
of crypto ops, etc.]
and make specific allowance for supporting fields
in vstart.
Low order bit will still identify the item
[element, segment, crypto term, etc.] specific to the instruction.
Other fields will be populated according to the
nature of the operation and the element type.
3) vstart be considered opaque as above with the escape
mechanism that then reduces to the current POR.
The ecosystem will need to handle the general
case in which the element of exception must be determined by other
means than low bits in vstart.
Exception handlers can no longer resume at an
arbitrary element in the instruction and have a reasonable
expectation that the restart will work as expected.
4) opaque vstart without an escape signature in vstart. In
all cases an alternate mechanism will be required to identify the
element "of exception".
5) further investigation is still required before these
decisions can be made.
** (and presumably any future segmented operation)
|
|

swallach
i agree with your comment i got this paper from someone who applied their assessment to risc-v vectors On Sep 14, 2020, at 10:47 AM, krste@... wrote:
They mention RISC-V vectors in the intro, but on a quick scan, the results are very ARM-specific, with no real implication for RISC-V vectors. They're pointing out a problem with ARM SVE where all elements are executed regardless of vector length due to SVE using predication to implement vector length.
Krste
On Mon, 14 Sep 2020 08:46:05 -0500, "swallach" <steven.wallach@...> said:
| i was made aware of this paper. risc-v vectors are mentioned. | one of the key conclusions are (from the abstract)
| Our experiments show that VLA code reaches about 90% of the performance of | vector length specific code, i.e. a 10% overhead is inferred due to global | predication of instructions. Furthermore, we show that code performance is not | increasing proportionally with increasing vector lengths due to the higher | memory demands.
| my experience is just the opposite. (based on memory system design)
| i am curious to hear other opinions
| WARNING / LEGAL TEXT: This message is intended only for the use of the | individual or entity to which it is addressed and may contain information | which is privileged, confidential, proprietary, or exempt from disclosure | under applicable law. If you are not the intended recipient or the person | responsible for delivering the message to the intended recipient, you are | strictly prohibited from disclosing, distributing, copying, or in any way | using this message. If you have received this communication in error, please | notify the sender and destroy and delete any copies you may have received.
| http://www.bsc.es/disclaimer
| | x[DELETED ATTACHMENT PohlAPPMM19.pdf, PDF] http://bsc.es/disclaimer
|
|

Krste Asanovic
They mention RISC-V vectors in the intro, but on a quick scan, the results are very ARM-specific, with no real implication for RISC-V vectors. They're pointing out a problem with ARM SVE where all elements are executed regardless of vector length due to SVE using predication to implement vector length. Krste On Mon, 14 Sep 2020 08:46:05 -0500, "swallach" <steven.wallach@...> said:
| i was made aware of this paper. risc-v vectors are mentioned. | one of the key conclusions are (from the abstract) | Our experiments show that VLA code reaches about 90% of the performance of | vector length specific code, i.e. a 10% overhead is inferred due to global | predication of instructions. Furthermore, we show that code performance is not | increasing proportionally with increasing vector lengths due to the higher | memory demands. | my experience is just the opposite. (based on memory system design) | i am curious to hear other opinions | WARNING / LEGAL TEXT: This message is intended only for the use of the | individual or entity to which it is addressed and may contain information | which is privileged, confidential, proprietary, or exempt from disclosure | under applicable law. If you are not the intended recipient or the person | responsible for delivering the message to the intended recipient, you are | strictly prohibited from disclosing, distributing, copying, or in any way | using this message. If you have received this communication in error, please | notify the sender and destroy and delete any copies you may have received. | http://www.bsc.es/disclaimer| | x[DELETED ATTACHMENT PohlAPPMM19.pdf, PDF]
|
|

swallach
i was made aware of this paper. risc-v vectors are mentioned.
one of the key conclusions are (from the abstract)
Our experiments
show that VLA code reaches about 90% of the performance of
vector length specific code, i.e. a 10% overhead is inferred due
to global predication of instructions. Furthermore, we show that
code performance is not increasing proportionally with increasing
vector lengths due to the higher memory demands.
my experience is just the opposite. (based on memory system design)
i am curious to hear other opinions
WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|
No vector TG meeting tomorrow

Krste Asanovic
Reminder there’s no vector TG meeting tomorrow (I have a conflict). We’ll be meeting again on Friday Sep 18 (I’ll work with Stephano to figure out new calendar scheme), Krste
|
|
Re: Vector Memory Ordering
Guy Lemieux <guy.lemieux@...>
In less concerned with making the FIFO example work, and more concerned with ensuring programs are fully portable across all implementations.
Programmers don’t tend to read ISA specs and details. They try it, and infer behaviour from the implementations they have available to them. Hence, I believe there is a risk that many programmers will assume ordered stores in particular, but also ordered loads.
The ISAa cost of providing 2 modes, ordered and unordered, is either 1 bit in the ISA encoding or a mode bit along the lines of setvl. The implementation cost is nearly zero. The benefit is fully portable programs without surprises.
Guy
toggle quoted messageShow quoted text
On Fri, Sep 4, 2020 at 12:54 PM Bill Huffman < huffman@...> wrote:
Guy Lemieux commented:
I think 90%+ of implementations will choose to do ordered loads and stores even though unordered is permitted.
This means programmers will expect them to be ordered, and such software will not work properly on the remaining implementations. This compatibility problem is a concern.
I think the best way to combat this is to have 2 sets of instructions: ordered and unordered. The unordered implementation can simply do the ordered thing in simple implementations.
Ordered stores to a FIFO is a paradigm I was hoping to use for inter-processor communication.
I think compiler considerations are also important, but I don’t know the implications here.
Guy
Maybe what's below could be improved by saying that if the base address (in src1) was non-idempotent or an "ordered channel," the entire instruction would run in order. If not, it would not. We could allow stride of zero
but not other overlapping strides for stores. Having a later access come into a non-idempotent or ordered region would raise an exception. That would provide for loads from and stores to a FIFO to work. But it wouldn't provide for an instruction to "fall
into" such a memory segment part way through. Bill On 9/4/20 10:18 AM, Bill Huffman wrote:
I think from this morning, we are considering:
- Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
- Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are:
- A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount
of auto-vectorization that can be done?
Exception reporting still has issues:
- Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than
for unit stride or even strided accesses.
Bill
|
|
Re: Vector Memory Ordering
The problem is not that you have to ensure previous (in the scalar loop order sense) entries don't trap before you take the exception on the later one. The problem is that you have to not complete a later store before checking
for earlier traps. That is unfortunate for all vector stores but it is a disaster for scatters. It's impossible to know scatters won't trap unless they're run through the MMU/PMP. So you have to run all lower numbered ones through the MMU/PMP before you
allow a higher numbered one to complete. Which in an in-order implementation is doing them in order and in an out-of-order implementation is a significant constraint.
The problem, as I see it, is that reporting an exception on a higher numbered element when a lower numbered element has not yet been stored requires more state than vstart. I think we either have to add exception state (not
sure how) or allow stores to be done more than once. This seems critical for scatters - at least unordered ones. For strides larger than very small numbers, it is restrictive as well and we need to consider it, I think. In that case, do we extend the same
to unit/small stride as well?
Bill
On 9/8/20 4:41 PM, Allen Baum wrote:
toggle quoted messageShow quoted text
EXTERNAL MAIL
You don't have to ensure previous stores (or loads) have completed - I misspoke. You have to ensure that they don't trap, which might be a much easier proposition. Implementations may typically do those checks
in order anyway, and the checks only need to be done once/page (or PMA or PMP region, unfortunately).
On Tue, Sep 8, 2020 at 2:08 PM Bill Huffman < huffman@...> wrote:
On 9/8/20 1:47 PM, Allen Baum wrote:
EXTERNAL MAIL
If the memory ordering (in this case specifically memory access ordering, I think - correct me if I'm wrong) doesn't affect the final processor state (which would not be the case for a vector reduce that started
reduction as the values became available, instead of in some specific order), I don't know that compliance tests can tell the difference.
Where this does affect processor state is exception reporting: a golden model which accesses results in one order may not have the same exception state as an implementation that accesses in a different order.
Ditto for watchpoints (are they considered architectural?). A possible way to fix this is to have the golden model access in a canonical order, and implementations delay exception reporting until all accesses from logically previous addresses are complete
(and any logically previous exception override one that was reported earlier)
From most points of view it's a fine method (delay exception reporting until all accesses from logically previous addresses are complete and override). But it seems to require either that stores are done in order anyway, or
that stores of vector elements can be done more than once (only until the instruction completes). I
think I'm OK with vector elements being stored more than once during the instruction, but I don't know that it will be accepted easily. The other choice seems to be to do many stores in order anyway. For reasonably small strides, it may be possible
to do exception checks all at once, but for large strides and scatters, I don't expect it is.
Bill
Debug compliance testing will be verry interesting....
Non-idempotent memory accesses will also affect final state, but I'd argue that is non-ISA and we won't be testing that.
Concurrent, overlapping accesses is another pain point.
Bill, great lists.
can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a
doc later if we can avoid it.
Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?
general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.
Mark
On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman < huffman@...> wrote:
I think from this morning, we are considering:
- Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
- Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are:
- A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount
of auto-vectorization that can be done?
Exception reporting still has issues:
- Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than
for unit stride or even strided accesses.
Bill
|
|
Re: Vector Memory Ordering
You don't have to ensure previous stores (or loads) have completed - I misspoke. You have to ensure that they don't trap, which might be a much easier proposition. Implementations may typically do those checks in order anyway, and the checks only need to be done once/page (or PMA or PMP region, unfortunately).
toggle quoted messageShow quoted text
On Tue, Sep 8, 2020 at 2:08 PM Bill Huffman < huffman@...> wrote:
On 9/8/20 1:47 PM, Allen Baum wrote:
EXTERNAL MAIL
If the memory ordering (in this case specifically memory access ordering, I think - correct me if I'm wrong) doesn't affect the final processor state (which would not be the case for a vector reduce that started
reduction as the values became available, instead of in some specific order), I don't know that compliance tests can tell the difference.
Where this does affect processor state is exception reporting: a golden model which accesses results in one order may not have the same exception state as an implementation that accesses in a different order.
Ditto for watchpoints (are they considered architectural?). A possible way to fix this is to have the golden model access in a canonical order, and implementations delay exception reporting until all accesses from logically previous addresses are complete
(and any logically previous exception override one that was reported earlier)
From most points of view it's a fine method (delay exception reporting until all accesses from logically previous addresses are complete and override). But it seems to require either that stores are done in order anyway, or
that stores of vector elements can be done more than once (only until the instruction completes). I
think I'm OK with vector elements being stored more than once during the instruction, but I don't know that it will be accepted easily. The other choice seems to be to do many stores in order anyway. For reasonably small strides, it may be possible
to do exception checks all at once, but for large strides and scatters, I don't expect it is.
Bill
Debug compliance testing will be verry interesting....
Non-idempotent memory accesses will also affect final state, but I'd argue that is non-ISA and we won't be testing that.
Concurrent, overlapping accesses is another pain point.
Bill, great lists.
can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a
doc later if we can avoid it.
Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?
general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.
Mark
On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman < huffman@...> wrote:
I think from this morning, we are considering:
- Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
- Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are:
- A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount
of auto-vectorization that can be done?
Exception reporting still has issues:
- Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than
for unit stride or even strided accesses.
Bill
|
|
Re: Vector Memory Ordering
On 9/8/20 1:47 PM, Allen Baum wrote:
EXTERNAL MAIL
If the memory ordering (in this case specifically memory access ordering, I think - correct me if I'm wrong) doesn't affect the final processor state (which would not be the case for a vector reduce that started
reduction as the values became available, instead of in some specific order), I don't know that compliance tests can tell the difference.
Where this does affect processor state is exception reporting: a golden model which accesses results in one order may not have the same exception state as an implementation that accesses in a different order.
Ditto for watchpoints (are they considered architectural?). A possible way to fix this is to have the golden model access in a canonical order, and implementations delay exception reporting until all accesses from logically previous addresses are complete
(and any logically previous exception override one that was reported earlier)
From most points of view it's a fine method (delay exception reporting until all accesses from logically previous addresses are complete and override). But it seems to require either that stores are done in order anyway, or
that stores of vector elements can be done more than once (only until the instruction completes). I
think I'm OK with vector elements being stored more than once during the instruction, but I don't know that it will be accepted easily. The other choice seems to be to do many stores in order anyway. For reasonably small strides, it may be possible
to do exception checks all at once, but for large strides and scatters, I don't expect it is.
Bill
Debug compliance testing will be verry interesting....
Non-idempotent memory accesses will also affect final state, but I'd argue that is non-ISA and we won't be testing that.
Concurrent, overlapping accesses is another pain point.
Bill, great lists.
can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a
doc later if we can avoid it.
Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?
general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.
Mark
On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman < huffman@...> wrote:
I think from this morning, we are considering:
- Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
- Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are:
- A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount
of auto-vectorization that can be done?
Exception reporting still has issues:
- Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than
for unit stride or even strided accesses.
Bill
|
|
Re: Vector Memory Ordering
If the memory ordering (in this case specifically memory access ordering, I think - correct me if I'm wrong) doesn't affect the final processor state (which would not be the case for a vector reduce that started reduction as the values became available, instead of in some specific order), I don't know that compliance tests can tell the difference.
Where this does affect processor state is exception reporting: a golden model which accesses results in one order may not have the same exception state as an implementation that accesses in a different order. Ditto for watchpoints (are they considered architectural?). A possible way to fix this is to have the golden model access in a canonical order, and implementations delay exception reporting until all accesses from logically previous addresses are complete (and any logically previous exception override one that was reported earlier)
Debug compliance testing will be verry interesting....
Non-idempotent memory accesses will also affect final state, but I'd argue that is non-ISA and we won't be testing that. Concurrent, overlapping accesses is another pain point.
toggle quoted messageShow quoted text
Bill, great lists.
can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a doc later if we can avoid it.
Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?
general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.
Mark
On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman < huffman@...> wrote:
I think from this morning, we are considering:
- Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
- Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are:
- A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount
of auto-vectorization that can be done?
Exception reporting still has issues:
- Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than
for unit stride or even strided accesses.
Bill
|
|
Re: Vector Memory Ordering
On 9/4/20 10:25 PM, David Horner wrote:
EXTERNAL MAIL
On 2020-09-04 1:18 p.m., Bill Huffman wrote:
I think from this morning, we are considering:
- Ordered scatters are done truly in order
I tentatively agree *if* "truly in order"
means "as if writes were the equivalent scalar writes of the register elements processed in order from 0 to vl" .
However, RVWMO should be assumed and no further strengthening of "order" should be stipulated.
None overlapping element writes are allowed in RVWMO global memory order to freely pass each other.
Yes, that's what I intended.
Bill
I would however prefer to weaken the constraint
as #528 recommends (in addition to a mnemonic change):
... VSUXEI ... stipulate that the net result of the operation allows multiple outcomes visible to the local hart; essentially depending upon which element is last written to overlapping memory locations.
... [VSXEI] to stipulate that the data written to memory will be as if the active elements were first copied to an XLEN length buffer ... processing the elements in the order from 0 to vl, and then only the affected bytes written to memory in whatever order
RVVWMO allows.
...
The ideas are that
- rather than suggesting VSUXEI is the exception to the rule, it indicates that the ordered VSOXEI is a more constrained alternative to the default processor model that allows concurrent execution wherever possible and
- VSOXEI makes no guarantee on what is globally visible than what the RVVWMO model allows
- VSUXEI is still constrained by the RVVWMO rules.
|
|
ordered vs unordered and overlaps use cases

mark
what are the use cases? do we have examples in mind when they would/could be used?
are there examples of what developers would want from previous efforts on vector machines?
can we write them down?
thanks Mark
|
|
Re: Vector Memory Ordering

mark
Bill, great lists.
can we start building a testcase list for these situations and others? maybe a github doc? i am sure these will be documented but I don't want to lose bullet lists like this and have to derive them from a doc later if we can avoid it.
Allen, does the compliance group have a proposed way (or format) to define testcases and input data sets? If not should we?
general note, I think we need to specifically add debug to the definition of done. I can imagine things like watchpoints affecting a lot of extensions in interesting ways.
Mark
toggle quoted messageShow quoted text
On Fri, Sep 4, 2020 at 10:18 AM Bill Huffman < huffman@...> wrote:
I think from this morning, we are considering:
- Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
- Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are:
- A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount
of auto-vectorization that can be done?
Exception reporting still has issues:
- Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than
for unit stride or even strided accesses.
Bill
|
|
Re: Vector Memory Ordering
On 2020-09-04 5:00 p.m., swallach
wrote:
i have not
been following this thread in lots of detail
could someone please explain why we need to
differentiate between ordered and unordered load/stores.
These issues discuss the need for order and reasons to have
variants:
https://github.com/riscv/riscv-v-spec/issues/501 Unordered
Indexed Load
https://github.com/riscv/riscv-v-spec/issues/502 Unordered
Index Operation Memory Ordering Description
https://github.com/riscv/riscv-v-spec/issues/504 Further
Ordering Relaxation of Unordered Indexed Stores/Loads
https://github.com/riscv/riscv-v-spec/issues/528 rename
VSXEI<EEW> to VSOXEI<EEW>
in the 6 or vector systems i have been involved
with, vector references bypassed the cache, main memory was
highly interleaved.
And this part has not been as extensively discussed. RISCV has a
mechanism to characterize memory regions, including channels with
"special" features.
The idea is that RISCV vector should be all things to all needs,
and failing that, it should address as many needs as practical.
One of those "needs" is memory coherence that agrees with the
RVWMO model.
I am in favour of a vector enhanced RVWMO, "RVVWMO", if it proves
to be beneficial. Specifically, We propose a weaker RVVWMO for
software to validate, and advice that all future safe
implementations only use RVWMO. This allows a restraining of the
proposed RVVWMO if issues do arise.
compilers could not care less.
One objective is autovectorization, and for this case compilers
definitely care.
one of the major performance optimizations, was to
eliminate power of 2 strides. (generally manually done)
at convey we even had a option to deploy a prime
number of interleave memory system. (the bsp first did this)
i saw one application, a major cfd code, that
sorted references to a stencil based reference pattern. this
was done to optimize performance for cache based vector systems.
with the presence of HBM memory systems, and some
cleaver memory controller design (that could be done with vector
references information), i am pretty sure ordered and unordered
loads/stores will have the same implementation. to be more
specific, a memory design for GUPS, would have this type of
implementation
i look forward to a better understanding
Thank you for all your inputs that have help me get a better
understanding.
Guy Lemieux
commented:
I think 90%+ of implementations
will choose to do ordered loads and stores even though
unordered is permitted.
This means programmers will expect
them to be ordered, and such software will not work properly
on the remaining implementations. This compatibility problem
is a concern.
I think the best way to combat this
is to have 2 sets of instructions: ordered and unordered.
The unordered implementation can simply do the ordered thing
in simple implementations.
Ordered stores to a FIFO is a
paradigm I was hoping to use for inter-processor
communication.
I think compiler considerations are
also important, but I don’t know the implications here.
Guy
Maybe what's
below could be improved by saying that if the base address (in
src1) was non-idempotent or an "ordered channel," the entire
instruction would run in order. If not, it would not. We
could allow stride of zero but not other overlapping strides
for stores. Having a later access come into a non-idempotent
or ordered region would raise an exception. That would
provide for loads from and stores to a FIFO to work. But it
wouldn't provide for an instruction to "fall into" such a
memory segment part way through.
Bill
On 9/4/20 10:18 AM, Bill Huffman
wrote:
I think from
this morning, we are considering:
- Ordered
scatters are done truly in order
- Strided
stores that overlap (including segmented ones) will trap
as illegal
- All other
vector loads and stores do their memory accesses in
arbitrary order.
- A vector
load that accesses the same location multiple times is
free to use the same loaded value for any subset of
results
- All loads
with vector sources must use a different register for the
destination than any source (including mask).
- Maybe a vector load may access
the memory location corresponding to a given element
multiple times (for exception handling)??
A few of the
consequences of this are:
- A gather
with repeated elements can access the higher numbered
elements first and lower ones later
- A vector
memory access where multiple elements match watchpoint
criteria can trap on any of the multiple elements,
regardless of watchpoint requirements on order
- A
stride-0 load accessing an "incrementing" location can see
a result with larger values at lower element numbers than
smaller values
- When
vector loads or stores access an "ordered channel" the
elements will still be accessed in arbitrary order
- Strided
loads, gathers, and unordered scatters to non-idempotent
regions will not behave as might be expected.
- A
stride-0 store to a FIFO will trap
- A
stride-0 load to a FIFO will pop an arbitrary number of
entries from the FIFO (from 1 to more than vl) and
elements are distributed in an arbitrary way in the
result.
- A
non-idempotent memory location accessed by a vector load
may be accessed multiple times.
We need to
be sure software is OK with these characteristics as
"ordered channels" and non-idempotent regions can't be known
at compile time. Even strides can't always be known at
compile time. Will this plan reduce the amount of
auto-vectorization that can be done?
Exception
reporting still has issues:
- Unless
stores can be done multiple times, there is a need to save
some representation of what stores have and have not been
done.
- For loads
and stores, watchpoints can happen more than once without
some representation of what elements are complete.
- There may
need to be a way to report a watchpoint on one element but
restart on an earlier element
- If loads
have to do this exception reporting as well, do we forbid
loads to happen more than once for each element? Does
that help anything if we do?
I'd like to
see us relax the ordering of gathers and unordered scatters
with younger instructions in some way. If we don't, younger
scalar memory accesses will stall for some time as
comparisons are much more difficult than for unit stride or
even strided accesses.
Bill
WARNING / LEGAL TEXT: This message is intended only for the use of
the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or
exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to
the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you
have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer
|
|
Re: Vector Memory Ordering
On 2020-09-04 5:53 p.m., Andrew
Waterman wrote:
On Fri, Sep 4, 2020 at 10:19
AM Bill Huffman < huffman@...> wrote:
I think from this
morning, we are considering:
- Ordered scatters are
done truly in order
Just a quick note here, if Ztso is
active then "truly in order" has global ordering too.
- Strided stores that
overlap (including segmented ones) will trap as
illegal
This is not terribly straightforward.
I'll assume that the trap would only be a function of the
stride and element/segment width, rather than checking that
two active elements actually overlap at runtime.
Much of the discussion was driven by consideration for the
difficulty in runtime overlap detection and challenges to ensure
"correct" function.
Even so, very large strides can foul this up. Consider
(in RV32) vssseg8e32 with vl=5 and stride=0xC0000004.
Elements 0 and 4 overlap!
Thinking was
1) such a restriction would be addressed by application fall back
to vl=1 iterations
2) such a restriction could be relaxed later and thus defer
addressing all permutations such as this.
(This phenomenon can also happen for non-segment strided
stores by using a misaligned stride,
Thinking was misaligned stride would also be restricted with the
same fallback.
which (for good reason) is a valid thing to do.)
I believe misaligned stride could be very valuable for load, less
valuable for store.
Would you please elaborate on the good reason(s) for misaligned
stride?
But it Thinking was misaligned stride would also be restricted
with the same fallback.
such a restriction could be relaxed later and thus defer
addressing all permutations such as this.
It's not an inexpensive computation in general. It would
be better, I think, to either make them in-order or to
permit arbitrary reordering than to trap.
I am in favour of arbitrary reordering, but I'd prefer to refer
to such as parallel execution; memory ordering is a distinct issue
and often conflated with vector element processing sequences.
- All other vector
loads and stores do their memory accesses in
arbitrary order.
- A vector load that
accesses the same location multiple times is free to
use the same loaded value for any subset of results
- All loads with vector
sources must use a different register for the
destination than any source (including mask).
Why is this change necessary? Currently, depending on
EEW, non-segment indexed loads are allowed to overlap the
index register. Are you suggesting that, not only can
indexed loads access memory in arbitrary order, they can
also write back imprecisely (past vstart),
yes. It was noted that this is already allowed by some instructions
and explicitly by the relaxed (point 4) definition of precise vector
traps.
destroying the index register?
potentially, so avoid the situation as in other instructions where
vrestart is desired but costly, difficult or impossible to implement
without the restriction.
(Mask register overlap is already forbidden by the usual
rules for different-EEW overlap.)
|
|
Re: Vector Memory Ordering
On 2020-09-04 1:18 p.m., Bill Huffman
wrote:
I think from this morning, we
are considering:
- Ordered scatters are done
truly in order
I
tentatively agree *if* "truly in order" means
"as if writes were the equivalent scalar writes of the register
elements processed in order from 0 to vl" .
However, RVWMO should be assumed
and no further strengthening of "order" should be stipulated.
None overlapping element writes
are allowed in RVWMO global memory order to freely pass each
other.
I would however prefer to weaken
the constraint
as #528 recommends (in addition to
a mnemonic change):
... VSUXEI ... stipulate that the net result of the operation
allows multiple outcomes visible to the local hart; essentially
depending upon which element is last written to overlapping memory
locations.
... [VSXEI] to stipulate that the data written to memory will be
as if the active elements were first copied to an XLEN length
buffer ...
processing the elements in the order from 0 to vl, and then only
the affected bytes written to memory in whatever order RVVWMO
allows.
...
The ideas are that
- rather than suggesting VSUXEI is the exception to the rule, it
indicates that the ordered VSOXEI is a more constrained
alternative to the default processor model that allows
concurrent execution wherever possible and
- VSOXEI makes no guarantee on what is globally visible than
what the RVVWMO model allows
- VSUXEI is still constrained by the RVVWMO rules.
|
|
Re: Vector Memory Ordering
On Fri, Sep 4, 2020 at 10:19 AM Bill Huffman < huffman@...> wrote:
I think from this morning, we are considering:
- Ordered scatters are done truly in order
- Strided stores that overlap (including segmented ones) will trap as illegal
This is not terribly straightforward.
I'll assume that the trap would only be a function of the stride and element/segment width, rather than checking that two active elements actually overlap at runtime. Even so, very large strides can foul this up. Consider (in RV32) vssseg8e32 with vl=5 and stride=0xC0000004. Elements 0 and 4 overlap! (This phenomenon can also happen for non-segment strided stores by using a misaligned stride, which (for good reason) is a valid thing to do.)
It's not an inexpensive computation in general. It would be better, I think, to either make them in-order or to permit arbitrary reordering than to trap.
- All other vector loads and stores do their memory accesses in arbitrary order.
- A vector load that accesses the same location multiple times is free to use the same loaded value for any subset of results
- All loads with vector sources must use a different register for the destination than any source (including mask).
Why is this change necessary? Currently, depending on EEW, non-segment indexed loads are allowed to overlap the index register. Are you suggesting that, not only can indexed loads access memory in arbitrary order, they can also write back imprecisely (past vstart), destroying the index register?
(Mask register overlap is already forbidden by the usual rules for different-EEW overlap.) - Maybe a vector load may access the memory location corresponding to a given element multiple times (for exception handling)??
A few of the consequences of this are:
- A gather with repeated elements can access the higher numbered elements first and lower ones later
- A vector memory access where multiple elements match watchpoint criteria can trap on any of the multiple elements, regardless of watchpoint requirements on order
- A stride-0 load accessing an "incrementing" location can see a result with larger values at lower element numbers than smaller values
- When vector loads or stores access an "ordered channel" the elements will still be accessed in arbitrary order
- Strided loads, gathers, and unordered scatters to non-idempotent regions will not behave as might be expected.
- A stride-0 store to a FIFO will trap
- A stride-0 load to a FIFO will pop an arbitrary number of entries from the FIFO (from 1 to more than vl) and elements are distributed in an arbitrary way in the result.
- A non-idempotent memory location accessed by a vector load may be accessed multiple times.
We need to be sure software is OK with these characteristics as "ordered channels" and non-idempotent regions can't be known at compile time. Even strides can't always be known at compile time. Will this plan reduce the amount
of auto-vectorization that can be done?
Exception reporting still has issues:
- Unless stores can be done multiple times, there is a need to save some representation of what stores have and have not been done.
- For loads and stores, watchpoints can happen more than once without some representation of what elements are complete.
- There may need to be a way to report a watchpoint on one element but restart on an earlier element
- If loads have to do this exception reporting as well, do we forbid loads to happen more than once for each element? Does that help anything if we do?
I'd like to see us relax the ordering of gathers and unordered scatters with younger instructions in some way. If we don't, younger scalar memory accesses will stall for some time as comparisons are much more difficult than
for unit stride or even strided accesses.
Bill
|
|