Vector Indexed Loads - Partial Return?
This observation is a bit of a memory side issue but is tied to vector indexed loading semantics.
The problem that I see is that my address offsets in the vector indexed load instruction are rather large. The address offsets of column indices are often as large 8KB (depending on matrix size). Given that these indexes are sparse, they tend to be jumpy crossing typical cache block boundaries. Now when such an indexed load is executed, the memory system has a bunch of work to do. Depending on the system, it takes several cycles for the memory to gather the requested contents and return. These memory side access cycles stall the pipe. Question: is there a semantic extension of the indexed load instruction whereby if the memory has accumulated several but not all values, it could return those values and let the pipe consume whatever values were returned? Some status would be returned to indicate which indexes were returned and the instruction could be retried for as-yet pending indices to complete the instruction. The alternative is a software solution wherein the indices supplied in each invocation of the indexed load are close to each other ("close" according to some spatial definition). If indices are known statically, then the software solution could work. Not sure if this issue has already been addressed in some way.. Thanks Nagendra |
|
Guy Lemieux
Since this is happening to an already-slow memory access, I would
toggle quoted message
Show quoted text
guess performing checks in software would be the better way to go. There is no current talk about aborting a long-running indexed load. However, internally, a microarchitecture may be able to perform partial execution of dependent instructions (eg, if the following instruction is a VADD, then it can do the addition for some parts of the data already received due to an OoO engine, thus enabling overlapped execution). Guy On Tue, Mar 10, 2020 at 5:47 PM Nagendra Gulur <nagendra.gd@...> wrote:
|
|
Andy Glew Si5
Anecdote:
One of my initial inspirations for working on dynamic out of order scheduling was to support incremental time pipeline vectors, both scatter/gather and stride.
I.e. the P6 RS began as a vector machine stunt box.
From: tech-vector-ext@... <tech-vector-ext@...> On Behalf Of Nagendra Gulur
This observation is a bit of a memory side issue but is tied to vector indexed loading semantics. |
|
Intel's Larrabee machines had something like this - where indexed
loads cleared down a mask of completed accesses. I think it's a very bad design pattern for something as common place as an indexed load (Roger will probably challenge me on that :-). Apart from introducing functional non-determinism into spec (and more potential timing channels into system), it also means that SW must always do branches to check if there's a need to loop around, even on machines that don't need it. Even if the branches are very predictable on a machine that doesn't need them, they pollute predictor tables - unless you separate these branches/checks out, which is just more microarchitectural junk. Hardware chaining, as Guy mentions, gets you the benefits (even in an in-order uarch). An alternative software approach is to use shorter-than-maximum length vectors if you believe this will be a problem in your system. Or separately vectorize gathers that are in a small neighborhood of addresses. Krste | Since this is happening to an already-slow memory access, I wouldOn Tue, 10 Mar 2020 18:18:22 -0700, "Guy Lemieux" <glemieux@...> said: | guess performing checks in software would be the better way to go. | There is no current talk about aborting a long-running indexed load. | However, internally, a microarchitecture may be able to perform | partial execution of dependent instructions (eg, if the following | instruction is a VADD, then it can do the addition for some parts of | the data already received due to an OoO engine, thus enabling | overlapped execution). | Guy | On Tue, Mar 10, 2020 at 5:47 PM Nagendra Gulur <nagendra.gd@...> wrote: || || This observation is a bit of a memory side issue but is tied to vector indexed loading semantics. || || The problem that I see is that my address offsets in the vector indexed load instruction are rather large. The address offsets of column indices are often as large 8KB (depending on matrix size). Given that these indexes are sparse, they tend to be jumpy crossing typical cache block boundaries. Now when such an indexed load is executed, the memory system has a bunch of work to do. Depending on the system, it takes several cycles for the memory to gather the requested contents and return. These memory side access cycles stall the pipe. Question: is there a semantic extension of the indexed load instruction whereby if the memory has accumulated several but not all values, it could return those values and let the pipe consume whatever values were returned? Some status would be returned to indicate which indexes were returned and the instruction could be retried for as-yet pending indices to complete the instruction. || || The alternative is a software solution wherein the indices supplied in each invocation of the indexed load are close to each other ("close" according to some spatial definition). If indices are known statically, then the software solution could work. || || Not sure if this issue has already been addressed in some way.. || || Thanks || Nagendra || || | |
|
Not really challenge :-) But it's the only option if you can't / don't want to add a "vstart" equivalent (because you don't want to save/restore it...) @Nagendra: note that the spec does require the microarchitecture to make some form of "forward progress" (in the presence of, say, page faults) and at least get "one element done" and advance "vstart" accordingly. So what you are describing will kind-of-happen when truly bad things (page faults) occur. Now, you were thinking of it from the point of "performance " and maybe doing something in the interim while the rest of the values return. I challenge that this is really practical. Your ideal microarchitecture should have used up all your MHSRs and you wouldn't get much further ahead by "branching to do something else". That "something else" would most likely have one or two memory operations and chances are, they would block due to lack of MSHRs. Now, if your indexed load has sparse but small indices and they all hit in the L1 cache, then all this discussion is moot anyway, as the gather would finish much sooner than "branch to some other code, examine what data is missing, and decide what to do with the data we've got back". roger. On Mon, Mar 30, 2020 at 10:30 AM Krste Asanovic <krste@...> wrote:
|
|