Basic options for chaining vector loads?
Hello all,
Could somebody please comment on the basic options related to chaining vector loads? It is easy to see how arithmetic vector instructions can be chained together. However, if a vector load is followed by an arithmetic instruction that depends on it (e.g. as shown in the below example), then chaining is not as easy as the vector load could cause a synchronous exception on any of its elements. If we would use chaining anyway, then instructions following such a vector load might have to be undone (or prevented from updating the register file) maybe as late as during an exception on the last element of such a vector load (which would be costly/complex) if we want to support precise exceptions.
VLD v1, r1
VLD v2, r2
VMUL v3, v1, v2
VST v3, r3
I see some basic strategies:
A) Unroll above code such that the vector loads and arithmetic instructions become indepedent of each other (then a vector load can execute in parallel with a preceding arithmetic instructions that do not source the vector register updated by the vector load)
B) Make the bus interface used by the vector load really wide such that not being able to chain it hurts less (fewer cycles lost). This might work well for unit stride loads, but not really well for vector indexed (scatter gather) loads.
C) Use fault-only-first vector loads. Are such loads intended to allow for easier chaining (as it will be known early if a load will cause a fault or not)?
D) Somehow predict/compute early on whether exceptions can potentially happen during any element of a vector load/store. This seems not at all practical in the context of vector indexed loads/stored, but is maybe feasible for unit stride vector loads/stores.
Any good other alternatives that I am missing here or pros and cons that are worth mentioning?
Best regards,
Arjan
Hello all,
Could somebody please comment on the basic options related to chaining vector loads? It is easy to see how arithmetic vector instructions can be chained together. However, if a vector load is followed by an arithmetic instruction that depends on it (e.g. as shown in the below example), then chaining is not as easy as the vector load could cause a synchronous exception on any of its elements. If we would use chaining anyway, then instructions following such a vector load might have to be undone (or prevented from updating the register file) maybe as late as during an exception on the last element of such a vector load (which would be costly/complex) if we want to support precise exceptions.
VLD v1, r1
VLD v2, r2
VMUL v3, v1, v2
VST v3, r3
I see some basic strategies:
A) Unroll above code such that the vector loads and arithmetic instructions become indepedent of each other (then a vector load can execute in parallel with a preceding arithmetic instructions that do not source the vector register updated by the vector load)
B) Make the bus interface used by the vector load really wide such that not being able to chain it hurts less (fewer cycles lost). This might work well for unit stride loads, but not really well for vector indexed (scatter gather) loads.
C) Use fault-only-first vector loads. Are such loads intended to allow for easier chaining (as it will be known early if a load will cause a fault or not)?
D) Somehow predict/compute early on whether exceptions can potentially happen during any element of a vector load/store. This seems not at all practical in the context of vector indexed loads/stored, but is maybe feasible for unit stride vector loads/stores.
Any good other alternatives that I am missing here or pros and cons that are worth mentioning?
Best regards,
Arjan