On 2020-09-04 5:00 p.m., swallach
wrote:
i have not
been following this thread in lots of detail
could someone please explain why we need to
differentiate between ordered and unordered load/stores.
These issues discuss the need for order and reasons to have
variants:
https://github.com/riscv/riscv-v-spec/issues/501 Unordered
Indexed Load
https://github.com/riscv/riscv-v-spec/issues/502 Unordered
Index Operation Memory Ordering Description
https://github.com/riscv/riscv-v-spec/issues/504 Further
Ordering Relaxation of Unordered Indexed Stores/Loads
https://github.com/riscv/riscv-v-spec/issues/528 rename
VSXEI<EEW> to VSOXEI<EEW>
in the 6 or vector systems i have been involved
with, vector references bypassed the cache, main memory was
highly interleaved.
And this part has not been as extensively discussed. RISCV has a
mechanism to characterize memory regions, including channels with
"special" features.
The idea is that RISCV vector should be all things to all needs,
and failing that, it should address as many needs as practical.
One of those "needs" is memory coherence that agrees with the
RVWMO model.
I am in favour of a vector enhanced RVWMO, "RVVWMO", if it proves
to be beneficial. Specifically, We propose a weaker RVVWMO for
software to validate, and advice that all future safe
implementations only use RVWMO. This allows a restraining of the
proposed RVVWMO if issues do arise.
compilers could not care less.
One objective is autovectorization, and for this case compilers
definitely care.
one of the major performance optimizations, was to
eliminate power of 2 strides. (generally manually done)
at convey we even had a option to deploy a prime
number of interleave memory system. (the bsp first did this)
i saw one application, a major cfd code, that
sorted references to a stencil based reference pattern. this
was done to optimize performance for cache based vector systems.
with the presence of HBM memory systems, and some
cleaver memory controller design (that could be done with vector
references information), i am pretty sure ordered and unordered
loads/stores will have the same implementation. to be more
specific, a memory design for GUPS, would have this type of
implementation
i look forward to a better understanding
Thank you for all your inputs that have help me get a better
understanding.
Guy Lemieux
commented:
I think 90%+ of implementations
will choose to do ordered loads and stores even though
unordered is permitted.
This means programmers will expect
them to be ordered, and such software will not work properly
on the remaining implementations. This compatibility problem
is a concern.
I think the best way to combat this
is to have 2 sets of instructions: ordered and unordered.
The unordered implementation can simply do the ordered thing
in simple implementations.
Ordered stores to a FIFO is a
paradigm I was hoping to use for inter-processor
communication.
I think compiler considerations are
also important, but I don’t know the implications here.
Guy
Maybe what's
below could be improved by saying that if the base address (in
src1) was non-idempotent or an "ordered channel," the entire
instruction would run in order. If not, it would not. We
could allow stride of zero but not other overlapping strides
for stores. Having a later access come into a non-idempotent
or ordered region would raise an exception. That would
provide for loads from and stores to a FIFO to work. But it
wouldn't provide for an instruction to "fall into" such a
memory segment part way through.
Bill
On 9/4/20 10:18 AM, Bill Huffman
wrote:
I think from
this morning, we are considering:
- Ordered
scatters are done truly in order
- Strided
stores that overlap (including segmented ones) will trap
as illegal
- All other
vector loads and stores do their memory accesses in
arbitrary order.
- A vector
load that accesses the same location multiple times is
free to use the same loaded value for any subset of
results
- All loads
with vector sources must use a different register for the
destination than any source (including mask).
- Maybe a vector load may access
the memory location corresponding to a given element
multiple times (for exception handling)??
A few of the
consequences of this are:
- A gather
with repeated elements can access the higher numbered
elements first and lower ones later
- A vector
memory access where multiple elements match watchpoint
criteria can trap on any of the multiple elements,
regardless of watchpoint requirements on order
- A
stride-0 load accessing an "incrementing" location can see
a result with larger values at lower element numbers than
smaller values
- When
vector loads or stores access an "ordered channel" the
elements will still be accessed in arbitrary order
- Strided
loads, gathers, and unordered scatters to non-idempotent
regions will not behave as might be expected.
- A
stride-0 store to a FIFO will trap
- A
stride-0 load to a FIFO will pop an arbitrary number of
entries from the FIFO (from 1 to more than vl) and
elements are distributed in an arbitrary way in the
result.
- A
non-idempotent memory location accessed by a vector load
may be accessed multiple times.
We need to
be sure software is OK with these characteristics as
"ordered channels" and non-idempotent regions can't be known
at compile time. Even strides can't always be known at
compile time. Will this plan reduce the amount of
auto-vectorization that can be done?
Exception
reporting still has issues:
- Unless
stores can be done multiple times, there is a need to save
some representation of what stores have and have not been
done.
- For loads
and stores, watchpoints can happen more than once without
some representation of what elements are complete.
- There may
need to be a way to report a watchpoint on one element but
restart on an earlier element
- If loads
have to do this exception reporting as well, do we forbid
loads to happen more than once for each element? Does
that help anything if we do?
I'd like to
see us relax the ordering of gathers and unordered scatters
with younger instructions in some way. If we don't, younger
scalar memory accesses will stall for some time as
comparisons are much more difficult than for unit stride or
even strided accesses.
Bill
WARNING / LEGAL TEXT: This message is intended only for the use of
the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or
exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to
the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you
have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.
http://www.bsc.es/disclaimer