Date
1 - 1 of 1
Vector TG minutes for 2020/9/18
Date: 2020/9/18
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~17
Current issues on github: https://github.com/riscv/riscv-v-spec
#551 Memory orderings scalar-vector
#534/5 Element ordering
-----------------------------------
The following proposal was put forward and was agreeable to group.
Vector unit-stride and constant vector memory accesses (both load and
store) would always be unordered by element. Indexed accesses would
be supplied in both ordered and unordered forms. Where ordering is
required, software has to use an indexed instruction.
Existing encoding is retained in mop[1:0], with the previously
reserved load mop[1:0] encoding now allocated to unordered gather.
Strided operations now treated as unordered.
Loads Old New
mop[1:0]
0 0 unit-stride (ordered) VLE unit-stride (unordered) VLE
0 1 reserved --- indexed (unordered) VLUXEI
1 0 strided (ordered) VLSE strided (unordered) VLSE
1 1 indexed (ordered) VLXEI indexed (ordered) VLOXEI
Stores Old New
mop[1:0]
0 0 unit-stride (unordered) VSE unit-stride (unordered) VSE
0 1 indexed (unordered) VSUXEI indexed (unordered) VSUXEI
1 0 strided (unordered) VSSE strided (unordered) VSSE
1 1 indexed (ordered) VSXEI indexed (ordered) VSOXEI
(The mnemonics were not discussed at length in meeting, and a slightly
different scheme is given here. The indexed operations have a "U" or
"O" to distinguish between unordered and ordered. This is a little
less consistent, as could argue that unordered indexed operations
don't need a U to match others, but this approach minimizes disruption
to existing software.)
For unordered instructions (mop!=11) there is no guarantee on element
access order. For segment loads and stores, the individual element
accesses within each segement are unordered with respect to each other.
If the accesses are to a strongly ordered IO region, the element
accesses can happen in any order.
Stride-0 Optimizations
----------------------
We also discussed stride-0 optimizations. The proposed scheme is that
if rs1=x0, then an implementation is allowed to only perform one
memory read and replicate the value to all
destination elements but may the read the location more than once.
Similarly, a store might combine one or more elements and do fewer
writes to memory. With a zero-valued stride in a register rs1!=x0,
i.e., x[rs1]==0, the implementation must perform all accesses (but
these will be unordered). It was noted the compiler must be aware not
to convert a known stride of 0 into use of x0 if all memory accesses
are required.
If it is desired to perform multiple repeated ordered accesses to a
non-idempotent memory region (e.g., popping a memory-mapped FIFO),
then an ordered gather should be used targeting the single address.
When a segment straddles a PMA boundary, the segment accesses must
obey the PMA constraints associated with each constituent element's
address for accesses to that element.
Ordered AMOs
------------
The current PoR only has unordered AMOs. If was discussed whether
ordered AMOs are desirable, but there were few clear examples where
this would be useful, and so these are not planned for v1.0.
Element-Precise Exception Reporting
-----------------------------------
There was discussion around allowing vector stores to have updated
some locations in memory corresponding to elements before the element
that raises a synchronous exception. The proposal was to allow these
additional stores in idempotent memory regions. Stores to
non-idempotent memory regions must not occur for elements past an
element reporting an exception.
#493/510/532 Opaque vstart
--------------------------
There was brief discussion around opaque vstart. Given the relaxation
of memory access ordering, it was felt less critical to support opaque
vstart, but was suggested to allow this in a subset of the base
architecure (treating non-opaque vstart in V as an extension).
However, it was also noted that opaque vstart alone will rarely be
sufficient to support resumable traps, and in general additional
mechanism will be required to save and restore microarchitectural
state in complex processors with imprecise traps.
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~17
Current issues on github: https://github.com/riscv/riscv-v-spec
#551 Memory orderings scalar-vector
#534/5 Element ordering
-----------------------------------
The following proposal was put forward and was agreeable to group.
Vector unit-stride and constant vector memory accesses (both load and
store) would always be unordered by element. Indexed accesses would
be supplied in both ordered and unordered forms. Where ordering is
required, software has to use an indexed instruction.
Existing encoding is retained in mop[1:0], with the previously
reserved load mop[1:0] encoding now allocated to unordered gather.
Strided operations now treated as unordered.
Loads Old New
mop[1:0]
0 0 unit-stride (ordered) VLE unit-stride (unordered) VLE
0 1 reserved --- indexed (unordered) VLUXEI
1 0 strided (ordered) VLSE strided (unordered) VLSE
1 1 indexed (ordered) VLXEI indexed (ordered) VLOXEI
Stores Old New
mop[1:0]
0 0 unit-stride (unordered) VSE unit-stride (unordered) VSE
0 1 indexed (unordered) VSUXEI indexed (unordered) VSUXEI
1 0 strided (unordered) VSSE strided (unordered) VSSE
1 1 indexed (ordered) VSXEI indexed (ordered) VSOXEI
(The mnemonics were not discussed at length in meeting, and a slightly
different scheme is given here. The indexed operations have a "U" or
"O" to distinguish between unordered and ordered. This is a little
less consistent, as could argue that unordered indexed operations
don't need a U to match others, but this approach minimizes disruption
to existing software.)
For unordered instructions (mop!=11) there is no guarantee on element
access order. For segment loads and stores, the individual element
accesses within each segement are unordered with respect to each other.
If the accesses are to a strongly ordered IO region, the element
accesses can happen in any order.
Stride-0 Optimizations
----------------------
We also discussed stride-0 optimizations. The proposed scheme is that
if rs1=x0, then an implementation is allowed to only perform one
memory read and replicate the value to all
destination elements but may the read the location more than once.
Similarly, a store might combine one or more elements and do fewer
writes to memory. With a zero-valued stride in a register rs1!=x0,
i.e., x[rs1]==0, the implementation must perform all accesses (but
these will be unordered). It was noted the compiler must be aware not
to convert a known stride of 0 into use of x0 if all memory accesses
are required.
If it is desired to perform multiple repeated ordered accesses to a
non-idempotent memory region (e.g., popping a memory-mapped FIFO),
then an ordered gather should be used targeting the single address.
When a segment straddles a PMA boundary, the segment accesses must
obey the PMA constraints associated with each constituent element's
address for accesses to that element.
Ordered AMOs
------------
The current PoR only has unordered AMOs. If was discussed whether
ordered AMOs are desirable, but there were few clear examples where
this would be useful, and so these are not planned for v1.0.
Element-Precise Exception Reporting
-----------------------------------
There was discussion around allowing vector stores to have updated
some locations in memory corresponding to elements before the element
that raises a synchronous exception. The proposal was to allow these
additional stores in idempotent memory regions. Stores to
non-idempotent memory regions must not occur for elements past an
element reporting an exception.
#493/510/532 Opaque vstart
--------------------------
There was brief discussion around opaque vstart. Given the relaxation
of memory access ordering, it was felt less critical to support opaque
vstart, but was suggested to allow this in a subset of the base
architecure (treating non-opaque vstart in V as an extension).
However, it was also noted that opaque vstart alone will rarely be
sufficient to support resumable traps, and in general additional
mechanism will be required to save and restore microarchitectural
state in complex processors with imprecise traps.