Re: Vector Byte Arrangement in Wide Implementations


On Thu, Nov 5, 2020 at 11:05 PM Bill Huffman <huffman@...> wrote:

On 11/5/20 10:51 PM, Bill Huffman wrote:

On 11/5/20 8:33 PM, Andrew Waterman wrote:

On Thu, Nov 5, 2020 at 5:31 PM Bill Huffman <huffman@...> wrote:

On 11/5/20 4:36 PM, Andrew Waterman wrote:

On Thu, Nov 5, 2020 at 4:17 PM Bill Huffman <huffman@...> wrote:

On 11/5/20 3:35 PM, Andrew Waterman wrote:

On Thu, Nov 5, 2020 at 3:27 PM Bill Huffman <huffman@...> wrote:

I've been thinking through the cases where a wide implementation that wants "slices" could have to introduce a hiccup to rearrange bytes because of an EEW change (since SLEN is gone).  The ones I know of, with comments, are:

  • The programmer intended to read different size elements than were written
    • This should be extremely rare.  There are lots of things to manage - such as the vector length changing.
    • The hiccup will simply happen.
  • The compiler is spilling and filling vector registers within a compilation unit
    • The store should be a whole register store and the load should be a whole register load with a size hint for the next use and the hiccup will be avoided.
    • Filling with the wrong size will be a compiler performance bug
  • The compiler is spilling and filling vector registers across compilation units
    • This probably happens only if there are callee-saved vector registers
    • Is IPA realistic for this case, or will it happen any time there are callee-saved vector registers regardless of compilation units?
    • If it does happen, hardware may want to predict the correct size for the fill
    • With the current instructions, I don't think there's a way to make that happen (see below)
  • The OS is swapping processes
    • This is rare and we will live with the hiccup.

So, the first question is whether these are all the cases.  Are there any other cases where the EEW of a register will change?

The second question is whether to provide for the spilling and filling across compilation units.  The problem is that the whole register loads all have a size hint.  If the desired load EEW is not known, the instruction must still state a load EEW hint and the hardware has no way to know that it would be a good idea to use its prediction on this load.

It might be nice here to have a load type that indicated that the hardware ought to predict the next use type by associating it with the type in use at the time of the previous whole register store of the same register.  It might work to have a separate whole register store encoding that indicated the next whole register load could predict the current micro-architectural value instead of using the size hint in the load.  The separate store is easier to encode.

Any thoughts on how a predictor might know without adding such a load?

I'm probably being obtuse, because you've surely already thought this through: if you can build an EEW predictor, why can't you build an ignore-the-encoded-EEW predictor?  If you have a PC-indexed and PC-tagged structure, a hit means you should ignore the specified EEW and use the one you've memoized in the structure.

My idea of a predictor is one predicted size per vector register (or maybe two or three sizes operating as a stack).  Let's say for simplicity, we have both a separate store whole register and a separate load whole register that are used for spill/fill when the compiler doesn't know what EEW is in use.  They cooperate to reload in the same EEW form that was there before the store.  Maybe one of the two instructions doesn't have to be separate.

If we use a PC to predict we can predict ignore-the-encoded-EEW as you say, but then the predictor gets bigger and, I expect, misspredicts more often.  If a function is called from multiple places and each place has a different set of EEWs in vector registers, then the predictor will need to follow PC history to be reasonably accurate, I think.  The whole register loads can load 2, 4, or 8 registers at once - and those registers will often have different EEWs because they are grouped for speed/size not because they're actually related in current use.

If your per-vector-register predictor works well to begin with, I would think you could extend it with a valid bit that indicates whether to use the prediction or the encoded hint, and it would probably work OK.  Right?
How are you thinking that bit gets set/cleared?  The same store instruction is used whether or not the compiler will be able to put in a hint.

I was thinking of something along the lines of the Alpha 21264's store-wait predictor: every N cycles (with N probably somewhere in the range of [2^10, 2^14]), clear the valid bits.

Do you know of a reference for how the store-wait predictor works?  I can't find any reference to it in the 21264 hardware reference manual, though I found a reference (without description) in a paper.

I see it's called a "stWait table" in the hardware reference manual.

I see how that works for waiting loads.  I'm guessing that you're thinking of a PC based table here that remembers that certain whole register loads fail on the hint and should use the preview store EEW instead.  I'll think about that.  Maybe it would work OK.

The key insight from the 21264 design is that the valid bits are cleared periodically.  I appreciate that you were willing to meet me more than halfway, but I actually think your original idea would suffice: per-vector-register state + valid bits + periodic clearing.  I think what you assumed I was suggesting would also work.  Deciding which is better is a matter of ISCAtecture :-)


Maybe you're thinking of tracking whether it's better to go by the EEW of the whole register load hint or the EEW of the most recent whole register store.  If the one that's being used is wrong often enough, try the other one.

The saving grace is that the misprediction penalty isn't nearly as extreme as, say, a branch misprediction in an OOO superscalar.  If we can get rid of, say, 80% of the hiccups on interprocedure spills and fills (and of course 100% of intraprocedure ones) then the perf impact of the hiccups won't be a huge thing.

I'm hoping such a predictor is not needed for the reasons you say.  But I want to know that there is an "out" if it comes to that.

Gotcha.  And while I'm not 100% convinced of the efficacy of my particular proposal, I think it's sufficiently on the right track that clever microarchitects can devise a solution along those lines.

My immediate idea of a solution is that we have 3 bits of size hints for whole loads.  Only one value of the same field is used for whole register stores.  Let's use two values instead for the whole register store.  One hints that the following whole register load hint will be correct.  The other whole register store opcode hints that the EEW being stored is more likely to be correct for the following whole register load.  Architecturally, they do the same thing, of course.


Join to automatically receive all group messages.