Topics

Whole Register Loads and Stores


Bill Huffman
 

The whole register loads and stores in section 7.9 of the spec are
currently specified as having an element size of 8-bits. Could they be
extended to cover all sizes instead of just the 8-bit size? It looks
like the encoding space is there.

The different sizes would do the same thing functionally, but they allow
software to avoid the requirement for hardware to insert a cast
operation in many circumstances by defining a byte arrangement and
setting the corresponding tag for element size.

Bill


Andrew Waterman
 

I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know the previous tag?

If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore, using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m being too optimistic about the implementability of this compiler analysis.)

On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@...> wrote:
The whole register loads and stores in section 7.9 of the spec are
currently specified as having an element size of 8-bits.  Could they be
extended to cover all sizes instead of just the 8-bit size?  It looks
like the encoding space is there.

The different sizes would do the same thing functionally, but they allow
software to avoid the requirement for hardware to insert a cast
operation in many circumstances by defining a byte arrangement and
setting the corresponding tag for element size.

      Bill



Bill Huffman
 


On 6/15/20 6:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL

I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know the previous tag?

Intra-procedure spills were my first concern.  I assume "callee-save code" means the code that saves callee-save registers before using them and restores after.  If so, they may be helped in an ipa context, I guess, so they're a reason as well, though not as strong.  Context-switch code can't know, but is much less important.

I'm thinking about wide SIMD and multiple instructions per cycle.  We currently complete several instructions per cycle and more than one memory transfer per cycle.   Each SIMD instruction is 512-bits wide or more.  SIMD will probably get wider and the number of instructions per cycle will probably increase.  And then, an inserted cast instruction will take multiple cycles because of the physical arrangement.   So the cost of inserting a cast instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler.


If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore, using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m being too optimistic about the implementability of this compiler analysis.)

I've thought about that.  I don't know how often the length will be different currently than it is for the registers that need to be saved (and so need two vsetvli instructions).  With realistic latencies and high dispatch rates, software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that spilling and restoring is an issue.  In general, every iteration has a different vector length so it's very common for the register you're spilling to have a different length than the one you're going to load.  And whole register operations cost the same as other memory operations in a wide SIMD machine.

If we had a load instruction that was capable of the whole register load with the desired size expectation, I think it would help.  Now that I'm thinking about it, I think a store of that sort is useless.  All sizes have the same effect for store.  :-)

      Bill

On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@...> wrote:
The whole register loads and stores in section 7.9 of the spec are
currently specified as having an element size of 8-bits.  Could they be
extended to cover all sizes instead of just the 8-bit size?  It looks
like the encoding space is there.

The different sizes would do the same thing functionally, but they allow
software to avoid the requirement for hardware to insert a cast
operation in many circumstances by defining a byte arrangement and
setting the corresponding tag for element size.

      Bill



Andrew Waterman
 



On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@...> wrote:


On 6/15/20 6:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL

I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know the previous tag?

Intra-procedure spills were my first concern.  I assume "callee-save code" means the code that saves callee-save registers before using them and restores after.  If so, they may be helped in an ipa context, I guess, so they're a reason as well, though not as strong.  Context-switch code can't know, but is much less important.

I'm thinking about wide SIMD and multiple instructions per cycle.  We currently complete several instructions per cycle and more than one memory transfer per cycle.   Each SIMD instruction is 512-bits wide or more.  SIMD will probably get wider and the number of instructions per cycle will probably increase.  And then, an inserted cast instruction will take multiple cycles because of the physical arrangement.   So the cost of inserting a cast instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler.


If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore, using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m being too optimistic about the implementability of this compiler analysis.)

I've thought about that.  I don't know how often the length will be different currently than it is for the registers that need to be saved (and so need two vsetvli instructions).  With realistic latencies and high dispatch rates, software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that spilling and restoring is an issue.  In general, every iteration has a different vector length so it's very common for the register you're spilling to have a different length than the one you're going to load.  And whole register operations cost the same as other memory operations in a wide SIMD machine.

Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically scheduled wide-issue machines with short chimes.

Is a microarchitectural solution out of the question?  It's obviously possible to predict the tag for stack reloads with high accuracy at low cost.  The repacking still needs to occur, but instead of the uop being inserted when the register is accessed, it can happen as a part of the load.  This approach reduces issue bandwidth and incorporates the latency into the load writeback, making it much easier to schedule.

I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to obviate or solve some other way.

If we had a load instruction that was capable of the whole register load with the desired size expectation, I think it would help.  Now that I'm thinking about it, I think a store of that sort is useless.  All sizes have the same effect for store.  :-)

      Bill

On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@...> wrote:
The whole register loads and stores in section 7.9 of the spec are
currently specified as having an element size of 8-bits.  Could they be
extended to cover all sizes instead of just the 8-bit size?  It looks
like the encoding space is there.

The different sizes would do the same thing functionally, but they allow
software to avoid the requirement for hardware to insert a cast
operation in many circumstances by defining a byte arrangement and
setting the corresponding tag for element size.

      Bill



Bill Huffman
 


On 6/15/20 11:14 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@...> wrote:


On 6/15/20 6:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL

I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know the previous tag?

Intra-procedure spills were my first concern.  I assume "callee-save code" means the code that saves callee-save registers before using them and restores after.  If so, they may be helped in an ipa context, I guess, so they're a reason as well, though not as strong.  Context-switch code can't know, but is much less important.

I'm thinking about wide SIMD and multiple instructions per cycle.  We currently complete several instructions per cycle and more than one memory transfer per cycle.   Each SIMD instruction is 512-bits wide or more.  SIMD will probably get wider and the number of instructions per cycle will probably increase.  And then, an inserted cast instruction will take multiple cycles because of the physical arrangement.   So the cost of inserting a cast instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler.


If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore, using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m being too optimistic about the implementability of this compiler analysis.)

I've thought about that.  I don't know how often the length will be different currently than it is for the registers that need to be saved (and so need two vsetvli instructions).  With realistic latencies and high dispatch rates, software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that spilling and restoring is an issue.  In general, every iteration has a different vector length so it's very common for the register you're spilling to have a different length than the one you're going to load.  And whole register operations cost the same as other memory operations in a wide SIMD machine.

Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically scheduled wide-issue machines with short chimes.

Is a microarchitectural solution out of the question?  It's obviously possible to predict the tag for stack reloads with high accuracy at low cost.  The repacking still needs to occur, but instead of the uop being inserted when the register is accessed, it can happen as a part of the load.  This approach reduces issue bandwidth and incorporates the latency into the load writeback, making it much easier to schedule.

I've also thought about prediction.  If it works, it's just like a whole register load always of the correct size.  The inserted cast will almost never be needed.  And no cycles are added.  I think that prediction could be more accurate than having the amount in the load instruction.  Or less.  I don't know enough about how well compilers can do at knowing the element widths in different circumstances.   Or how often spills and restores can be only, so to speak, one deep - where a predictor that remembered the most recent spill of any particular register and used that size for the fill might do extremely well.

For example, if a callee-save register were spilled and used without spilling again, such a mechanism would have the restore before return working correctly.  And perhaps if the compiler could be convinced to avoid spilling any register in a "nested" fashion, predicting the fill that way would always work.


I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to obviate or solve some other way.

It's not much opcode space.  There are 32 lumop codes and only three are used.  This wouldn't increase the number of used codes, just make the whole register load code apply to more widths - unless there are code combinations I'm not realizing are there.  And simpler implementations would be identical for every width.

Still, more is more.  And if a predictor is necessary anyway because of too many cases like callee-saved registers, then we don't want additional codes.

      Bill


If we had a load instruction that was capable of the whole register load with the desired size expectation, I think it would help.  Now that I'm thinking about it, I think a store of that sort is useless.  All sizes have the same effect for store.  :-)

      Bill

On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@...> wrote:
The whole register loads and stores in section 7.9 of the spec are
currently specified as having an element size of 8-bits.  Could they be
extended to cover all sizes instead of just the 8-bit size?  It looks
like the encoding space is there.

The different sizes would do the same thing functionally, but they allow
software to avoid the requirement for hardware to insert a cast
operation in many circumstances by defining a byte arrangement and
setting the corresponding tag for element size.

      Bill



Andrew Waterman
 



On Mon, Jun 15, 2020 at 11:38 PM Bill Huffman <huffman@...> wrote:


On 6/15/20 11:14 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@...> wrote:


On 6/15/20 6:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL

I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know the previous tag?

Intra-procedure spills were my first concern.  I assume "callee-save code" means the code that saves callee-save registers before using them and restores after.  If so, they may be helped in an ipa context, I guess, so they're a reason as well, though not as strong.  Context-switch code can't know, but is much less important.

I'm thinking about wide SIMD and multiple instructions per cycle.  We currently complete several instructions per cycle and more than one memory transfer per cycle.   Each SIMD instruction is 512-bits wide or more.  SIMD will probably get wider and the number of instructions per cycle will probably increase.  And then, an inserted cast instruction will take multiple cycles because of the physical arrangement.   So the cost of inserting a cast instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler.


If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore, using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m being too optimistic about the implementability of this compiler analysis.)

I've thought about that.  I don't know how often the length will be different currently than it is for the registers that need to be saved (and so need two vsetvli instructions).  With realistic latencies and high dispatch rates, software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that spilling and restoring is an issue.  In general, every iteration has a different vector length so it's very common for the register you're spilling to have a different length than the one you're going to load.  And whole register operations cost the same as other memory operations in a wide SIMD machine.

Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically scheduled wide-issue machines with short chimes.

Is a microarchitectural solution out of the question?  It's obviously possible to predict the tag for stack reloads with high accuracy at low cost.  The repacking still needs to occur, but instead of the uop being inserted when the register is accessed, it can happen as a part of the load.  This approach reduces issue bandwidth and incorporates the latency into the load writeback, making it much easier to schedule.

I've also thought about prediction.  If it works, it's just like a whole register load always of the correct size.  The inserted cast will almost never be needed.  And no cycles are added.  I think that prediction could be more accurate than having the amount in the load instruction.  Or less.  I don't know enough about how well compilers can do at knowing the element widths in different circumstances.   Or how often spills and restores can be only, so to speak, one deep - where a predictor that remembered the most recent spill of any particular register and used that size for the fill might do extremely well.

For example, if a callee-save register were spilled and used without spilling again, such a mechanism would have the restore before return working correctly.  And perhaps if the compiler could be convinced to avoid spilling any register in a "nested" fashion, predicting the fill that way would always work.


I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to obviate or solve some other way.

It's not much opcode space.  There are 32 lumop codes and only three are used.  This wouldn't increase the number of used codes, just make the whole register load code apply to more widths - unless there are code combinations I'm not realizing are there.  And simpler implementations would be identical for every width.

Yeah.  I'm thinking into the future where pressure to avoid spilling over into 48-/64-bit instruction encodings will use up more code points in the 32b load/store encoding space, reducing their orthogonality. 

Still, more is more.  And if a predictor is necessary anyway because of too many cases like callee-saved registers, then we don't want additional codes.

The callee-saved register case might not be a red herring, because of vector function calls for transcendentals etc.  Even though the standard C ABI must eschew callee-saved vector registers for compatibility reasons, these millicode routines will spill and fill temporaries.

      Bill


If we had a load instruction that was capable of the whole register load with the desired size expectation, I think it would help.  Now that I'm thinking about it, I think a store of that sort is useless.  All sizes have the same effect for store.  :-)

      Bill

On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@...> wrote:
The whole register loads and stores in section 7.9 of the spec are
currently specified as having an element size of 8-bits.  Could they be
extended to cover all sizes instead of just the 8-bit size?  It looks
like the encoding space is there.

The different sizes would do the same thing functionally, but they allow
software to avoid the requirement for hardware to insert a cast
operation in many circumstances by defining a byte arrangement and
setting the corresponding tag for element size.

      Bill



Bill Huffman
 

Hi Andrew,

I've been thinking about this some more.  It seems to me there's value in pursuing both element sized whole vector loads as well as predictors.  Taking the cases that seem to matter here, one at a time:

Intra-procedure spills:

  • Here the compiler should know the element size and can use it on the load.  If element sized whole register loads are available, predictors can be left to work for cases the compiler doesn't know about.
  • Not wanting to use the predictors for this case leads to a desire for two store types as well as a set of sized load types and a unknown size load type.  A store and a set of sized loads that work with no prediction and a store and load pair that it makes sense to predict.
  • We mentioned that using normal length loads and stores for intra-preciedure spills is an issue with software pipelining and short chimes.
  • This case can be quite involved and I'm not convinced that predictors won't often have trouble in complex cases.
  • Caller-saved registers are similar from the compiler point of view, but not from the predictor point of view.  The prediction is likely to be ruined during the called function

Callee-saved registers (where such things exist for vectors):

  • Here, there's no way for the compiler to know what size to use in a library.  So prediction is better.
  • If the store/load pair that depends on prediction are different instructions from the ones where the compiler knows, the prediction will work better.  Same reason we use an absolute jump and not BEQ x0,x0 at the end of an "if" to branch around the "then."

Context switch:

  • It would be nice to solve, but it doesn't cost enough in cycles to be worth adding architectural state.

In the end, I see three categories for whole register load/store:

  1. Intra-procedure spills, which need a single store and a set of loads per element size which the compiler can use.
  2. A separate store and load pair, which is used when the compiler doesn't know the size.  These can be predicted at some percentage of correctness.
  3. The context switch case which can use anything and hardware will spend cycles fixing it up later because it's not worth adding architectural state.

I'd like to see #1 and #2 possible to cover.  I think they will help not only in-order machines but OoO vector machines as well.  The penalty for the OoO machines won't be as high, but they will still insert a uop and it will still cost time.

You've expressed concern for size in a 48-/64-bit instruction encoding.  But given that the entire set of stride-1 cases is much smaller than strided and indexed.  The whole register cases, are even smaller as they don't use the nf field.  To compare to indexed:

  • The sized whole register loads have 5-bits for an address register, 5-bits for vector register, and 3-bits for a size - total 13 bits.
  • Indexed loads have 5-bits for an address register, 5-bits for an index register, 5-bits for a vector result register, 3-bits for a size, 3-bits for segment number - total 21 bits (and indexed stores have 22 bits because of ordered or not ordered).

So, I'm having a hard time thinking this will matter much even in the 48-/64-bit encoding.  And simple machines can implement both stores and all loads exactly alike.

      Bill

On 6/15/20 11:56 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Mon, Jun 15, 2020 at 11:38 PM Bill Huffman <huffman@...> wrote:


On 6/15/20 11:14 PM, Andrew Waterman wrote:
EXTERNAL MAIL



On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@...> wrote:


On 6/15/20 6:54 PM, Andrew Waterman wrote:
EXTERNAL MAIL

I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know the previous tag?

Intra-procedure spills were my first concern.  I assume "callee-save code" means the code that saves callee-save registers before using them and restores after.  If so, they may be helped in an ipa context, I guess, so they're a reason as well, though not as strong.  Context-switch code can't know, but is much less important.

I'm thinking about wide SIMD and multiple instructions per cycle.  We currently complete several instructions per cycle and more than one memory transfer per cycle.   Each SIMD instruction is 512-bits wide or more.  SIMD will probably get wider and the number of instructions per cycle will probably increase.  And then, an inserted cast instruction will take multiple cycles because of the physical arrangement.   So the cost of inserting a cast instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler.


If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore, using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m being too optimistic about the implementability of this compiler analysis.)

I've thought about that.  I don't know how often the length will be different currently than it is for the registers that need to be saved (and so need two vsetvli instructions).  With realistic latencies and high dispatch rates, software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that spilling and restoring is an issue.  In general, every iteration has a different vector length so it's very common for the register you're spilling to have a different length than the one you're going to load.  And whole register operations cost the same as other memory operations in a wide SIMD machine.

Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically scheduled wide-issue machines with short chimes.

Is a microarchitectural solution out of the question?  It's obviously possible to predict the tag for stack reloads with high accuracy at low cost.  The repacking still needs to occur, but instead of the uop being inserted when the register is accessed, it can happen as a part of the load.  This approach reduces issue bandwidth and incorporates the latency into the load writeback, making it much easier to schedule.

I've also thought about prediction.  If it works, it's just like a whole register load always of the correct size.  The inserted cast will almost never be needed.  And no cycles are added.  I think that prediction could be more accurate than having the amount in the load instruction.  Or less.  I don't know enough about how well compilers can do at knowing the element widths in different circumstances.   Or how often spills and restores can be only, so to speak, one deep - where a predictor that remembered the most recent spill of any particular register and used that size for the fill might do extremely well.

For example, if a callee-save register were spilled and used without spilling again, such a mechanism would have the restore before return working correctly.  And perhaps if the compiler could be convinced to avoid spilling any register in a "nested" fashion, predicting the fill that way would always work.


I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to obviate or solve some other way.

It's not much opcode space.  There are 32 lumop codes and only three are used.  This wouldn't increase the number of used codes, just make the whole register load code apply to more widths - unless there are code combinations I'm not realizing are there.  And simpler implementations would be identical for every width.

Yeah.  I'm thinking into the future where pressure to avoid spilling over into 48-/64-bit instruction encodings will use up more code points in the 32b load/store encoding space, reducing their orthogonality. 

Still, more is more.  And if a predictor is necessary anyway because of too many cases like callee-saved registers, then we don't want additional codes.

The callee-saved register case might not be a red herring, because of vector function calls for transcendentals etc.  Even though the standard C ABI must eschew callee-saved vector registers for compatibility reasons, these millicode routines will spill and fill temporaries.

      Bill


If we had a load instruction that was capable of the whole register load with the desired size expectation, I think it would help.  Now that I'm thinking about it, I think a store of that sort is useless.  All sizes have the same effect for store.  :-)

      Bill

On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@...> wrote:
The whole register loads and stores in section 7.9 of the spec are
currently specified as having an element size of 8-bits.  Could they be
extended to cover all sizes instead of just the 8-bit size?  It looks
like the encoding space is there.

The different sizes would do the same thing functionally, but they allow
software to avoid the requirement for hardware to insert a cast
operation in many circumstances by defining a byte arrangement and
setting the corresponding tag for element size.

      Bill



Krste Asanovic
 

If SLEN=VLEN layout is in force, then whole vector register
load/stores don't need to be specified as using SEW=8. They can use
current SEW from vtype - this will reduce, though not eliminate,
incorrect microarchitectural SEW tag settings. They still use VLEN as
number of bits saved for each register.

For callee-saved under this "whole-register moves imply vtype.sew tag"
model, the paradigm would be to restore vtype, including sew, before
restoring the callee-saved registers.

I still understand the desire to include expected sew tag in load
instructions, but am trying to find a solution that doesn't require a
microarchitectural hint.

Krste


On Thu, 18 Jun 2020 02:58:21 +0000, "Bill Huffman" <huffman@cadence.com> said:
| Hi Andrew,
| I've been thinking about this some more. It seems to me there's value in pursuing both element sized whole vector loads as well as
| predictors. Taking the cases that seem to matter here, one at a time:

| Intra-procedure spills:

| • Here the compiler should know the element size and can use it on the load. If element sized whole register loads are available,
| predictors can be left to work for cases the compiler doesn't know about.
| • Not wanting to use the predictors for this case leads to a desire for two store types as well as a set of sized load types and a
| unknown size load type. A store and a set of sized loads that work with no prediction and a store and load pair that it makes
| sense to predict.
| • We mentioned that using normal length loads and stores for intra-preciedure spills is an issue with software pipelining and
| short chimes.
| • This case can be quite involved and I'm not convinced that predictors won't often have trouble in complex cases.
| • Caller-saved registers are similar from the compiler point of view, but not from the predictor point of view. The prediction is
| likely to be ruined during the called function

| Callee-saved registers (where such things exist for vectors):

| • Here, there's no way for the compiler to know what size to use in a library. So prediction is better.
| • If the store/load pair that depends on prediction are different instructions from the ones where the compiler knows, the
| prediction will work better. Same reason we use an absolute jump and not BEQ x0,x0 at the end of an "if" to branch around the
| "then."

| Context switch:

| • It would be nice to solve, but it doesn't cost enough in cycles to be worth adding architectural state.

| In the end, I see three categories for whole register load/store:

| 1. Intra-procedure spills, which need a single store and a set of loads per element size which the compiler can use.
| 2. A separate store and load pair, which is used when the compiler doesn't know the size. These can be predicted at some
| percentage of correctness.
| 3. The context switch case which can use anything and hardware will spend cycles fixing it up later because it's not worth adding
| architectural state.

| I'd like to see #1 and #2 possible to cover. I think they will help not only in-order machines but OoO vector machines as well.
| The penalty for the OoO machines won't be as high, but they will still insert a uop and it will still cost time.

| You've expressed concern for size in a 48-/64-bit instruction encoding. But given that the entire set of stride-1 cases is much
| smaller than strided and indexed. The whole register cases, are even smaller as they don't use the nf field. To compare to
| indexed:

| • The sized whole register loads have 5-bits for an address register, 5-bits for vector register, and 3-bits for a size - total 13
| bits.
| • Indexed loads have 5-bits for an address register, 5-bits for an index register, 5-bits for a vector result register, 3-bits for
| a size, 3-bits for segment number - total 21 bits (and indexed stores have 22 bits because of ordered or not ordered).

| So, I'm having a hard time thinking this will matter much even in the 48-/64-bit encoding. And simple machines can implement both
| stores and all loads exactly alike.

| Bill

| On 6/15/20 11:56 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| On Mon, Jun 15, 2020 at 11:38 PM Bill Huffman <huffman@cadence.com> wrote:

| On 6/15/20 11:14 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@cadence.com> wrote:

| On 6/15/20 6:54 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know
| the previous tag?

| Intra-procedure spills were my first concern. I assume "callee-save code" means the code that saves callee-save
| registers before using them and restores after. If so, they may be helped in an ipa context, I guess, so they're a
| reason as well, though not as strong. Context-switch code can't know, but is much less important.

| I'm thinking about wide SIMD and multiple instructions per cycle. We currently complete several instructions per
| cycle and more than one memory transfer per cycle. Each SIMD instruction is 512-bits wide or more. SIMD will
| probably get wider and the number of instructions per cycle will probably increase. And then, an inserted cast
| instruction will take multiple cycles because of the physical arrangement. So the cost of inserting a cast
| instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler.

| If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular
| unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL
| rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore,
| using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m
| being too optimistic about the implementability of this compiler analysis.)

| I've thought about that. I don't know how often the length will be different currently than it is for the registers
| that need to be saved (and so need two vsetvli instructions). With realistic latencies and high dispatch rates,
| software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that
| spilling and restoring is an issue. In general, every iteration has a different vector length so it's very common
| for the register you're spilling to have a different length than the one you're going to load. And whole register
| operations cost the same as other memory operations in a wide SIMD machine.

| Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically
| scheduled wide-issue machines with short chimes.

| Is a microarchitectural solution out of the question? It's obviously possible to predict the tag for stack reloads with
| high accuracy at low cost. The repacking still needs to occur, but instead of the uop being inserted when the register
| is accessed, it can happen as a part of the load. This approach reduces issue bandwidth and incorporates the latency
| into the load writeback, making it much easier to schedule.

| I've also thought about prediction. If it works, it's just like a whole register load always of the correct size. The
| inserted cast will almost never be needed. And no cycles are added. I think that prediction could be more accurate than
| having the amount in the load instruction. Or less. I don't know enough about how well compilers can do at knowing the
| element widths in different circumstances. Or how often spills and restores can be only, so to speak, one deep - where a
| predictor that remembered the most recent spill of any particular register and used that size for the fill might do
| extremely well.

| For example, if a callee-save register were spilled and used without spilling again, such a mechanism would have the restore
| before return working correctly. And perhaps if the compiler could be convinced to avoid spilling any register in a
| "nested" fashion, predicting the fill that way would always work.

| I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to
| obviate or solve some other way.

| It's not much opcode space. There are 32 lumop codes and only three are used. This wouldn't increase the number of used
| codes, just make the whole register load code apply to more widths - unless there are code combinations I'm not realizing
| are there. And simpler implementations would be identical for every width.

| Yeah. I'm thinking into the future where pressure to avoid spilling over into 48-/64-bit instruction encodings will use up more
| code points in the 32b load/store encoding space, reducing their orthogonality.

| Still, more is more. And if a predictor is necessary anyway because of too many cases like callee-saved registers, then we
| don't want additional codes.

| The callee-saved register case might not be a red herring, because of vector function calls for transcendentals etc. Even
| though the standard C ABI must eschew callee-saved vector registers for compatibility reasons, these millicode routines will
| spill and fill temporaries.

| Bill

| If we had a load instruction that was capable of the whole register load with the desired size expectation, I think
| it would help. Now that I'm thinking about it, I think a store of that sort is useless. All sizes have the same
| effect for store. :-)

| Bill

| On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@cadence.com> wrote:

| The whole register loads and stores in section 7.9 of the spec are
| currently specified as having an element size of 8-bits. Could they be
| extended to cover all sizes instead of just the 8-bit size? It looks
| like the encoding space is there.

| The different sizes would do the same thing functionally, but they allow
| software to avoid the requirement for hardware to insert a cast
| operation in many circumstances by defining a byte arrangement and
| setting the corresponding tag for element size.

| Bill

|


Krste Asanovic
 

The more I think through the options, the more I'm convinced we have
to support SLEN=VLEN, at least as an extension if not in all cases,
primarily for software.

Working through the design challenges of SLEN=VLEN:

For narrower datapaths (<=128b), this is the simplest solution.

For wider datapaths (>128b), the implementation complexity is for the
widening/narrowing operations.

The simple brute-force approach is to store bits in registers in
memory order and deal with cross-datapath communication on
widening/narrowing operations. The communication is always
constrained to be within a single-cycle element group (= bits
accessed in one beat of the vector functional unit), and this
communication can be scheduled into pipeline design statically as is
independent of microarch state. Single-width operations don't
activate any long wires.

The sophisticated approach is to store elements in an
communication-optimized internal format (e.g., variants of our
SLEN<VLEN layouts) and tag each element group with its last-written
EEW. When instructions (both single width or widening/narrowing)
access registers with the correct EEW there is no cross-lane
communication. If an instruction tries to read an element group with
the wrong EEW, a dynamic microop is inserted to rearrange the element
group, using a separate permute network that crosses the datapath, and
write back permuted pattern with the new EEW (except for stores, which
can always write any EEW into memory order). Some instructions have 3
source operands, so might need all three to be reformatted in worst
case.

The worry for the sophisticated scheme (apart from complexity) is when
there are common cases when values are written with one EEW and read
with another.

1) Interrupt/OS save/restore

The save/restore code does not know the EEW of each register, so there
must be some penalty of load with incorrect EEW. Restoring with
vtype.SEW might capture some cases better than a fixed EEW in load
instruction. The expected overall penalty is low compared to other
cases below, as context-switch vector save/restore is rarer operation.

2) Callee-save in vector millicode routines

The vector millicode routine will know the EEW for argument types and
the return value, which are not themselves callee-save but which can
give a strong hint as to the EEW of the caller's registers (e.g., a
single-precision exp millicode routine probably can expect quite a few
EEW=32 reg in caller). Library software can therefore try to reduce
performance impact by just using expect EEW of caller. Millicode
routines are passed mask and length information, and can save only
needed length and use tail-undisturbed instructions to avoid having to
save/restore whole callee-saved register.

3) Spill code inside loop

This is the most problematic case. I wonder about how often the
compiler does not know the type and length of the values to be
restored? I agree adding EEW to the whole-register move could help
here, and doesn't add complexity to simpler implementations which can
ignore it.



Separately, I'm also wondering if the whole-register-move instructions
actually make sense for case 2 and 3, as some temporal vector machines
might have registers that are 64+ beats deep, more with LMUL, making a
whole-register store/load very expensive in these cases when AVL is
actually shorter.

Case 1 does not really need the whole register move instructions, as
regular load/stores are fine.

Krste


On Sun, 21 Jun 2020 23:55:32 -0700, "Krste Asanovic via lists.riscv.org" <krste=berkeley.edu@lists.riscv.org> said:
| If SLEN=VLEN layout is in force, then whole vector register
| load/stores don't need to be specified as using SEW=8. They can use
| current SEW from vtype - this will reduce, though not eliminate,
| incorrect microarchitectural SEW tag settings. They still use VLEN as
| number of bits saved for each register.

| For callee-saved under this "whole-register moves imply vtype.sew tag"
| model, the paradigm would be to restore vtype, including sew, before
| restoring the callee-saved registers.

| I still understand the desire to include expected sew tag in load
| instructions, but am trying to find a solution that doesn't require a
| microarchitectural hint.

| Krste


On Thu, 18 Jun 2020 02:58:21 +0000, "Bill Huffman" <huffman@cadence.com> said:
| | Hi Andrew,
| | I've been thinking about this some more. It seems to me there's value in pursuing both element sized whole vector loads as well as
| | predictors. Taking the cases that seem to matter here, one at a time:

| | Intra-procedure spills:

| | • Here the compiler should know the element size and can use it on the load. If element sized whole register loads are available,
| | predictors can be left to work for cases the compiler doesn't know about.
| | • Not wanting to use the predictors for this case leads to a desire for two store types as well as a set of sized load types and a
| | unknown size load type. A store and a set of sized loads that work with no prediction and a store and load pair that it makes
| | sense to predict.
| | • We mentioned that using normal length loads and stores for intra-preciedure spills is an issue with software pipelining and
| | short chimes.
| | • This case can be quite involved and I'm not convinced that predictors won't often have trouble in complex cases.
| | • Caller-saved registers are similar from the compiler point of view, but not from the predictor point of view. The prediction is
| | likely to be ruined during the called function

| | Callee-saved registers (where such things exist for vectors):

| | • Here, there's no way for the compiler to know what size to use in a library. So prediction is better.
| | • If the store/load pair that depends on prediction are different instructions from the ones where the compiler knows, the
| | prediction will work better. Same reason we use an absolute jump and not BEQ x0,x0 at the end of an "if" to branch around the
| | "then."

| | Context switch:

| | • It would be nice to solve, but it doesn't cost enough in cycles to be worth adding architectural state.

| | In the end, I see three categories for whole register load/store:

| | 1. Intra-procedure spills, which need a single store and a set of loads per element size which the compiler can use.
| | 2. A separate store and load pair, which is used when the compiler doesn't know the size. These can be predicted at some
| | percentage of correctness.
| | 3. The context switch case which can use anything and hardware will spend cycles fixing it up later because it's not worth adding
| | architectural state.

| | I'd like to see #1 and #2 possible to cover. I think they will help not only in-order machines but OoO vector machines as well.
| | The penalty for the OoO machines won't be as high, but they will still insert a uop and it will still cost time.

| | You've expressed concern for size in a 48-/64-bit instruction encoding. But given that the entire set of stride-1 cases is much
| | smaller than strided and indexed. The whole register cases, are even smaller as they don't use the nf field. To compare to
| | indexed:

| | • The sized whole register loads have 5-bits for an address register, 5-bits for vector register, and 3-bits for a size - total 13
| | bits.
| | • Indexed loads have 5-bits for an address register, 5-bits for an index register, 5-bits for a vector result register, 3-bits for
| | a size, 3-bits for segment number - total 21 bits (and indexed stores have 22 bits because of ordered or not ordered).

| | So, I'm having a hard time thinking this will matter much even in the 48-/64-bit encoding. And simple machines can implement both
| | stores and all loads exactly alike.

| | Bill

| | On 6/15/20 11:56 PM, Andrew Waterman wrote:

| | EXTERNAL MAIL

| | On Mon, Jun 15, 2020 at 11:38 PM Bill Huffman <huffman@cadence.com> wrote:

| | On 6/15/20 11:14 PM, Andrew Waterman wrote:

| | EXTERNAL MAIL

| | On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@cadence.com> wrote:

| | On 6/15/20 6:54 PM, Andrew Waterman wrote:

| | EXTERNAL MAIL

| | I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know
| | the previous tag?

| | Intra-procedure spills were my first concern. I assume "callee-save code" means the code that saves callee-save
| | registers before using them and restores after. If so, they may be helped in an ipa context, I guess, so they're a
| | reason as well, though not as strong. Context-switch code can't know, but is much less important.

| | I'm thinking about wide SIMD and multiple instructions per cycle. We currently complete several instructions per
| | cycle and more than one memory transfer per cycle. Each SIMD instruction is 512-bits wide or more. SIMD will
| | probably get wider and the number of instructions per cycle will probably increase. And then, an inserted cast
| | instruction will take multiple cycles because of the physical arrangement. So the cost of inserting a cast
| | instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler.

| | If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular
| | unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL
| | rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore,
| | using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m
| | being too optimistic about the implementability of this compiler analysis.)

| | I've thought about that. I don't know how often the length will be different currently than it is for the registers
| | that need to be saved (and so need two vsetvli instructions). With realistic latencies and high dispatch rates,
| | software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that
| | spilling and restoring is an issue. In general, every iteration has a different vector length so it's very common
| | for the register you're spilling to have a different length than the one you're going to load. And whole register
| | operations cost the same as other memory operations in a wide SIMD machine.

| | Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically
| | scheduled wide-issue machines with short chimes.

| | Is a microarchitectural solution out of the question? It's obviously possible to predict the tag for stack reloads with
| | high accuracy at low cost. The repacking still needs to occur, but instead of the uop being inserted when the register
| | is accessed, it can happen as a part of the load. This approach reduces issue bandwidth and incorporates the latency
| | into the load writeback, making it much easier to schedule.

| | I've also thought about prediction. If it works, it's just like a whole register load always of the correct size. The
| | inserted cast will almost never be needed. And no cycles are added. I think that prediction could be more accurate than
| | having the amount in the load instruction. Or less. I don't know enough about how well compilers can do at knowing the
| | element widths in different circumstances. Or how often spills and restores can be only, so to speak, one deep - where a
| | predictor that remembered the most recent spill of any particular register and used that size for the fill might do
| | extremely well.

| | For example, if a callee-save register were spilled and used without spilling again, such a mechanism would have the restore
| | before return working correctly. And perhaps if the compiler could be convinced to avoid spilling any register in a
| | "nested" fashion, predicting the fill that way would always work.

| | I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to
| | obviate or solve some other way.

| | It's not much opcode space. There are 32 lumop codes and only three are used. This wouldn't increase the number of used
| | codes, just make the whole register load code apply to more widths - unless there are code combinations I'm not realizing
| | are there. And simpler implementations would be identical for every width.

| | Yeah. I'm thinking into the future where pressure to avoid spilling over into 48-/64-bit instruction encodings will use up more
| | code points in the 32b load/store encoding space, reducing their orthogonality.

| | Still, more is more. And if a predictor is necessary anyway because of too many cases like callee-saved registers, then we
| | don't want additional codes.

| | The callee-saved register case might not be a red herring, because of vector function calls for transcendentals etc. Even
| | though the standard C ABI must eschew callee-saved vector registers for compatibility reasons, these millicode routines will
| | spill and fill temporaries.

| | Bill

| | If we had a load instruction that was capable of the whole register load with the desired size expectation, I think
| | it would help. Now that I'm thinking about it, I think a store of that sort is useless. All sizes have the same
| | effect for store. :-)

| | Bill

| | On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@cadence.com> wrote:

| | The whole register loads and stores in section 7.9 of the spec are
| | currently specified as having an element size of 8-bits. Could they be
| | extended to cover all sizes instead of just the 8-bit size? It looks
| | like the encoding space is there.

| | The different sizes would do the same thing functionally, but they allow
| | software to avoid the requirement for hardware to insert a cast
| | operation in many circumstances by defining a byte arrangement and
| | setting the corresponding tag for element size.

| | Bill

| |


|


Bill Huffman
 

On 6/21/20 11:55 PM, krste@berkeley.edu wrote:
EXTERNAL MAIL



If SLEN=VLEN layout is in force, then whole vector register
load/stores don't need to be specified as using SEW=8. They can use
current SEW from vtype - this will reduce, though not eliminate,
incorrect microarchitectural SEW tag settings. They still use VLEN as
number of bits saved for each register.
I've thought about this as a solution and I don't believe it is enough.
This will require an extra pair of vsetvli instructions around many
restores of a spilled register. I think that's too costly.


For callee-saved under this "whole-register moves imply vtype.sew tag"
model, the paradigm would be to restore vtype, including sew, before
restoring the callee-saved registers.
That means the compiler can't intermix the restores with the final
operations of the function. Or it has to add two vsetvli instructions
for each one. That seems poor to me.


I still understand the desire to include expected sew tag in load
instructions, but am trying to find a solution that doesn't require a
microarchitectural hint.

Krste
Well, if you can think of one, great. But I don't think either of the
above is usable.

Why is there an issue putting the size in the whole register load
instructions? Seems trivial to me. Especially if it enables removing
the SLEN parameter entirely from the spec.

Bill



On Thu, 18 Jun 2020 02:58:21 +0000, "Bill Huffman" <huffman@cadence.com> said:
| Hi Andrew,
| I've been thinking about this some more. It seems to me there's value in pursuing both element sized whole vector loads as well as
| predictors. Taking the cases that seem to matter here, one at a time:

| Intra-procedure spills:

| • Here the compiler should know the element size and can use it on the load. If element sized whole register loads are available,
| predictors can be left to work for cases the compiler doesn't know about.
| • Not wanting to use the predictors for this case leads to a desire for two store types as well as a set of sized load types and a
| unknown size load type. A store and a set of sized loads that work with no prediction and a store and load pair that it makes
| sense to predict.
| • We mentioned that using normal length loads and stores for intra-preciedure spills is an issue with software pipelining and
| short chimes.
| • This case can be quite involved and I'm not convinced that predictors won't often have trouble in complex cases.
| • Caller-saved registers are similar from the compiler point of view, but not from the predictor point of view. The prediction is
| likely to be ruined during the called function

| Callee-saved registers (where such things exist for vectors):

| • Here, there's no way for the compiler to know what size to use in a library. So prediction is better.
| • If the store/load pair that depends on prediction are different instructions from the ones where the compiler knows, the
| prediction will work better. Same reason we use an absolute jump and not BEQ x0,x0 at the end of an "if" to branch around the
| "then."

| Context switch:

| • It would be nice to solve, but it doesn't cost enough in cycles to be worth adding architectural state.

| In the end, I see three categories for whole register load/store:

| 1. Intra-procedure spills, which need a single store and a set of loads per element size which the compiler can use.
| 2. A separate store and load pair, which is used when the compiler doesn't know the size. These can be predicted at some
| percentage of correctness.
| 3. The context switch case which can use anything and hardware will spend cycles fixing it up later because it's not worth adding
| architectural state.

| I'd like to see #1 and #2 possible to cover. I think they will help not only in-order machines but OoO vector machines as well.
| The penalty for the OoO machines won't be as high, but they will still insert a uop and it will still cost time.

| You've expressed concern for size in a 48-/64-bit instruction encoding. But given that the entire set of stride-1 cases is much
| smaller than strided and indexed. The whole register cases, are even smaller as they don't use the nf field. To compare to
| indexed:

| • The sized whole register loads have 5-bits for an address register, 5-bits for vector register, and 3-bits for a size - total 13
| bits.
| • Indexed loads have 5-bits for an address register, 5-bits for an index register, 5-bits for a vector result register, 3-bits for
| a size, 3-bits for segment number - total 21 bits (and indexed stores have 22 bits because of ordered or not ordered).

| So, I'm having a hard time thinking this will matter much even in the 48-/64-bit encoding. And simple machines can implement both
| stores and all loads exactly alike.

| Bill

| On 6/15/20 11:56 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| On Mon, Jun 15, 2020 at 11:38 PM Bill Huffman <huffman@cadence.com> wrote:

| On 6/15/20 11:14 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@cadence.com> wrote:

| On 6/15/20 6:54 PM, Andrew Waterman wrote:

| EXTERNAL MAIL

| I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know
| the previous tag?

| Intra-procedure spills were my first concern. I assume "callee-save code" means the code that saves callee-save
| registers before using them and restores after. If so, they may be helped in an ipa context, I guess, so they're a
| reason as well, though not as strong. Context-switch code can't know, but is much less important.

| I'm thinking about wide SIMD and multiple instructions per cycle. We currently complete several instructions per
| cycle and more than one memory transfer per cycle. Each SIMD instruction is 512-bits wide or more. SIMD will
| probably get wider and the number of instructions per cycle will probably increase. And then, an inserted cast
| instruction will take multiple cycles because of the physical arrangement. So the cost of inserting a cast
| instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler.

| If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular
| unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL
| rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore,
| using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m
| being too optimistic about the implementability of this compiler analysis.)

| I've thought about that. I don't know how often the length will be different currently than it is for the registers
| that need to be saved (and so need two vsetvli instructions). With realistic latencies and high dispatch rates,
| software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that
| spilling and restoring is an issue. In general, every iteration has a different vector length so it's very common
| for the register you're spilling to have a different length than the one you're going to load. And whole register
| operations cost the same as other memory operations in a wide SIMD machine.

| Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically
| scheduled wide-issue machines with short chimes.

| Is a microarchitectural solution out of the question? It's obviously possible to predict the tag for stack reloads with
| high accuracy at low cost. The repacking still needs to occur, but instead of the uop being inserted when the register
| is accessed, it can happen as a part of the load. This approach reduces issue bandwidth and incorporates the latency
| into the load writeback, making it much easier to schedule.

| I've also thought about prediction. If it works, it's just like a whole register load always of the correct size. The
| inserted cast will almost never be needed. And no cycles are added. I think that prediction could be more accurate than
| having the amount in the load instruction. Or less. I don't know enough about how well compilers can do at knowing the
| element widths in different circumstances. Or how often spills and restores can be only, so to speak, one deep - where a
| predictor that remembered the most recent spill of any particular register and used that size for the fill might do
| extremely well.

| For example, if a callee-save register were spilled and used without spilling again, such a mechanism would have the restore
| before return working correctly. And perhaps if the compiler could be convinced to avoid spilling any register in a
| "nested" fashion, predicting the fill that way would always work.

| I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to
| obviate or solve some other way.

| It's not much opcode space. There are 32 lumop codes and only three are used. This wouldn't increase the number of used
| codes, just make the whole register load code apply to more widths - unless there are code combinations I'm not realizing
| are there. And simpler implementations would be identical for every width.

| Yeah. I'm thinking into the future where pressure to avoid spilling over into 48-/64-bit instruction encodings will use up more
| code points in the 32b load/store encoding space, reducing their orthogonality.

| Still, more is more. And if a predictor is necessary anyway because of too many cases like callee-saved registers, then we
| don't want additional codes.

| The callee-saved register case might not be a red herring, because of vector function calls for transcendentals etc. Even
| though the standard C ABI must eschew callee-saved vector registers for compatibility reasons, these millicode routines will
| spill and fill temporaries.

| Bill

| If we had a load instruction that was capable of the whole register load with the desired size expectation, I think
| it would help. Now that I'm thinking about it, I think a store of that sort is useless. All sizes have the same
| effect for store. :-)

| Bill

| On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@cadence.com> wrote:

| The whole register loads and stores in section 7.9 of the spec are
| currently specified as having an element size of 8-bits. Could they be
| extended to cover all sizes instead of just the 8-bit size? It looks
| like the encoding space is there.

| The different sizes would do the same thing functionally, but they allow
| software to avoid the requirement for hardware to insert a cast
| operation in many circumstances by defining a byte arrangement and
| setting the corresponding tag for element size.

| Bill

|


Krste Asanovic
 

On Jun 22, 2020, at 4:56 PM, Bill Huffman <huffman@cadence.com> wrote:

On 6/21/20 11:55 PM, krste@berkeley.edu wrote:
EXTERNAL MAIL

I still understand the desire to include expected sew tag in load
instructions, but am trying to find a solution that doesn't require a
microarchitectural hint.

Krste
Well, if you can think of one, great. But I don't think either of the
above is usable.

Why is there an issue putting the size in the whole register load
instructions? Seems trivial to me. Especially if it enables removing
the SLEN parameter entirely from the spec.

Bill
I agree it’s difficult to find an alternative, and I am OK with having this as an architected hint.

Dropping SLEN completely is a major win.

Krste


Bill Huffman
 

On 6/22/20 5:26 PM, Krste Asanovic wrote:
EXTERNAL MAIL



On Jun 22, 2020, at 4:56 PM, Bill Huffman <huffman@cadence.com> wrote:

On 6/21/20 11:55 PM, krste@berkeley.edu wrote:
EXTERNAL MAIL

I still understand the desire to include expected sew tag in load
instructions, but am trying to find a solution that doesn't require a
microarchitectural hint.

Krste
Well, if you can think of one, great. But I don't think either of the
above is usable.

Why is there an issue putting the size in the whole register load
instructions? Seems trivial to me. Especially if it enables removing
the SLEN parameter entirely from the spec.

Bill
I agree it’s difficult to find an alternative, and I am OK with having this as an architected hint.

Dropping SLEN completely is a major win.

Krste
It's a lesser issue, as you said, but the millicode case might want a
single store and single load of whole registers that expects prediction
in addition to the single whole register store and hint-ed whole
register load that don't expect prediction.

That may depend on how much use of millicode routines is expected - or
at least millicode routines that need to use callee-saved registers.

Bill






Kito Cheng
 

Hi

3) Spill code inside loop

This is the most problematic case. I wonder about how often the
compiler does not know the type and length of the values to be
restored? I agree adding EEW to the whole-register move could help
here, and doesn't add complexity to simpler implementations which can
ignore it.
Some point from compiler developer's view, we've implemented spill
code gen with whole register load/store on GCC.

Compiler/GCC know the type when spilling register but length (AVL) is
unknown during generating spilling code, also confirmed with LLVM
folks, there is same situation as GCC, and compiler also know the
whole register move/load/store won't use vtpe and vl, so there won't
generate extra vsetvl[i] around spill code.

I think EEW=8 for whole register load/store doesn't matter for
compiler, since the compiler only care the content can be
saved/restored, so I am not sure the usage for (another) EEW
whole-register move/load/store on the compiler side.

The only concern is the debugging scenarios, while a value spilled
into memory and debugger want to print out its content from memory,
debugger might not know how to interpret that if VLEN != SLEN, the
solution I can imagine is load it into vector register and then set
the vtype correctly.

On Tue, Jun 23, 2020 at 9:06 AM Bill Huffman <huffman@cadence.com> wrote:



On 6/22/20 5:26 PM, Krste Asanovic wrote:
EXTERNAL MAIL



On Jun 22, 2020, at 4:56 PM, Bill Huffman <huffman@cadence.com> wrote:

On 6/21/20 11:55 PM, krste@berkeley.edu wrote:
EXTERNAL MAIL

I still understand the desire to include expected sew tag in load
instructions, but am trying to find a solution that doesn't require a
microarchitectural hint.

Krste
Well, if you can think of one, great. But I don't think either of the
above is usable.

Why is there an issue putting the size in the whole register load
instructions? Seems trivial to me. Especially if it enables removing
the SLEN parameter entirely from the spec.

Bill
I agree it’s difficult to find an alternative, and I am OK with having this as an architected hint.

Dropping SLEN completely is a major win.

Krste
It's a lesser issue, as you said, but the millicode case might want a
single store and single load of whole registers that expects prediction
in addition to the single whole register store and hint-ed whole
register load that don't expect prediction.

That may depend on how much use of millicode routines is expected - or
at least millicode routines that need to use callee-saved registers.

Bill






Krste Asanovic
 

On Tue, 23 Jun 2020 11:22:44 +0800, Kito Cheng <kito.cheng@sifive.com> said:
| Hi

Hi Kito,

|| 3) Spill code inside loop
||
|| This is the most problematic case. I wonder about how often the
|| compiler does not know the type and length of the values to be
|| restored? I agree adding EEW to the whole-register move could help
|| here, and doesn't add complexity to simpler implementations which can
|| ignore it.

| Some point from compiler developer's view, we've implemented spill
| code gen with whole register load/store on GCC.

| Compiler/GCC know the type when spilling register but length (AVL) is
| unknown during generating spilling code, also confirmed with LLVM
| folks, there is same situation as GCC, and compiler also know the
| whole register move/load/store won't use vtpe and vl, so there won't
| generate extra vsetvl[i] around spill code.

| I think EEW=8 for whole register load/store doesn't matter for
| compiler, since the compiler only care the content can be
| saved/restored, so I am not sure the usage for (another) EEW
| whole-register move/load/store on the compiler side.

The EEW in the whole register load is a hint to aid wide
microarchitectures in organizing data internally. If the compiler can
put the correct EEW on the whole register load, then the microarch can
avoid an internal rearrangement on the first use with different EEW.
It is only a hint, which doesn't change functional behavior but could
affect performance. If the compiler (or other software) doesn't know
the value then everything still works, just possibly with an internal
performance hiccup.

| The only concern is the debugging scenarios, while a value spilled
| into memory and debugger want to print out its content from memory,
| debugger might not know how to interpret that if VLEN != SLEN, the
| solution I can imagine is load it into vector register and then set
| the vtype correctly.

I think debugger issue is yet another reason to fix on SLEN=VLEN
format.

Krste

| On Tue, Jun 23, 2020 at 9:06 AM Bill Huffman <huffman@cadence.com> wrote:
||
||
||
|| On 6/22/20 5:26 PM, Krste Asanovic wrote:
|| > EXTERNAL MAIL
|| >
|| >
|| >
|| >> On Jun 22, 2020, at 4:56 PM, Bill Huffman <huffman@cadence.com> wrote:
|| >>
|| >> On 6/21/20 11:55 PM, krste@berkeley.edu wrote:
|| >>> EXTERNAL MAIL
|| >>
|| >>>
|| >>> I still understand the desire to include expected sew tag in load
|| >>> instructions, but am trying to find a solution that doesn't require a
|| >>> microarchitectural hint.
|| >>>
|| >>> Krste
|| >>
|| >> Well, if you can think of one, great. But I don't think either of the
|| >> above is usable.
|| >>
|| >> Why is there an issue putting the size in the whole register load
|| >> instructions? Seems trivial to me. Especially if it enables removing
|| >> the SLEN parameter entirely from the spec.
|| >>
|| >> Bill
|| >
|| > I agree it’s difficult to find an alternative, and I am OK with having this as an architected hint.
|| >
|| > Dropping SLEN completely is a major win.
|| >
|| > Krste
||
|| It's a lesser issue, as you said, but the millicode case might want a
|| single store and single load of whole registers that expects prediction
|| in addition to the single whole register store and hint-ed whole
|| register load that don't expect prediction.
||
|| That may depend on how much use of millicode routines is expected - or
|| at least millicode routines that need to use callee-saved registers.
||
|| Bill
||
|| >
|| >
|| >
|| >
|| >
||
||