Basic options for chaining vector loads?


andrew@...
 

Historically, vector machines have employed a wide variety of strategies here.  Simply not executing past a vector memory access until exception checks have been performed is the simplest thing to do.  Standard speculative execution and rollback techniques (e.g. ROB/renaming) have also been used.  And there are a variety of microarchitectural hacks in between those extremes that I'll let others comment on if they wish.

Other vector ISAs haven't mandated precise exceptions at all, which makes this problem trivial, but RVV hasn't standardized this option, and it's inappropriate in many circumstances anyway.

(PS. The fault-only-first loads weren't designed to ease chaining; they were designed to support the vectorization of loops whose trip count isn't known at loop entry.)


On Wed, Oct 27, 2021 at 7:31 AM Arjan Bink <Arjan.Bink@...> wrote:

Hello all,

Could somebody please comment on the basic options related to chaining vector loads? It is easy to see how arithmetic vector instructions can be chained together. However, if a vector load is followed by an arithmetic instruction that depends on it (e.g. as shown in the below example), then chaining is not as easy as the vector load could cause a synchronous exception on any of its elements. If we would use chaining anyway, then instructions following such a vector load might have to be undone (or prevented from updating the register file) maybe as late as during an exception on the last element of such a vector load (which would be costly/complex) if we want to support precise exceptions.

VLD  v1, r1
VLD  v2, r2
VMUL v3, v1, v2
VST  v3, r3

I see some basic strategies:

A) Unroll above code such that the vector loads and arithmetic instructions become indepedent of each other (then a vector load can execute in parallel with a preceding arithmetic instructions that do not source the vector register updated by the vector load)

B) Make the bus interface used by the vector load really wide such that not being able to chain it hurts less (fewer cycles lost). This might work well for unit stride loads, but not really well for vector indexed (scatter gather) loads.

C) Use fault-only-first vector loads. Are such loads intended to allow for easier chaining (as it will be known early if a load will cause a fault or not)?

D) Somehow predict/compute early on whether exceptions can potentially happen during any element of a vector load/store. This seems not at all practical in the context of vector indexed loads/stored, but is maybe feasible for unit stride vector loads/stores.

Any good other alternatives that I am missing here or pros and cons that are worth mentioning?

Best regards,
Arjan


Arjan Bink
 

Hello all,

Could somebody please comment on the basic options related to chaining vector loads? It is easy to see how arithmetic vector instructions can be chained together. However, if a vector load is followed by an arithmetic instruction that depends on it (e.g. as shown in the below example), then chaining is not as easy as the vector load could cause a synchronous exception on any of its elements. If we would use chaining anyway, then instructions following such a vector load might have to be undone (or prevented from updating the register file) maybe as late as during an exception on the last element of such a vector load (which would be costly/complex) if we want to support precise exceptions.

VLD  v1, r1
VLD  v2, r2
VMUL v3, v1, v2
VST  v3, r3

I see some basic strategies:

A) Unroll above code such that the vector loads and arithmetic instructions become indepedent of each other (then a vector load can execute in parallel with a preceding arithmetic instructions that do not source the vector register updated by the vector load)

B) Make the bus interface used by the vector load really wide such that not being able to chain it hurts less (fewer cycles lost). This might work well for unit stride loads, but not really well for vector indexed (scatter gather) loads.

C) Use fault-only-first vector loads. Are such loads intended to allow for easier chaining (as it will be known early if a load will cause a fault or not)?

D) Somehow predict/compute early on whether exceptions can potentially happen during any element of a vector load/store. This seems not at all practical in the context of vector indexed loads/stored, but is maybe feasible for unit stride vector loads/stores.

Any good other alternatives that I am missing here or pros and cons that are worth mentioning?

Best regards,
Arjan