Whole Register Loads and Stores
Bill Huffman
The whole register loads and stores in section 7.9 of the spec are
currently specified as having an element size of 8-bits. Could they be extended to cover all sizes instead of just the 8-bit size? It looks like the encoding space is there. The different sizes would do the same thing functionally, but they allow software to avoid the requirement for hardware to insert a cast operation in many circumstances by defining a byte arrangement and setting the corresponding tag for element size. Bill |
|
andrew@...
I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know the previous tag? If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore, using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m being too optimistic about the implementability of this compiler analysis.) On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@...> wrote: The whole register loads and stores in section 7.9 of the spec are |
|
Bill Huffman
On 6/15/20 6:54 PM, Andrew Waterman wrote:
Intra-procedure spills were my first concern. I assume "callee-save code" means the code that saves callee-save registers before using them and restores after. If so, they may be helped in an ipa context, I guess, so they're a reason as well, though not as strong. Context-switch code can't know, but is much less important. I'm thinking about wide SIMD and multiple instructions per cycle. We currently complete several instructions per cycle and more than one memory transfer per cycle. Each SIMD instruction is 512-bits wide or more. SIMD will probably get wider and the number of instructions per cycle will probably increase. And then, an inserted cast instruction will take multiple cycles because of the physical arrangement. So the cost of inserting a cast instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler.
I've thought about that. I don't know how often the length will be different currently than it is for the registers that need to be saved (and so need two vsetvli instructions). With realistic latencies and high dispatch
rates, software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that spilling and restoring is an issue. In general, every iteration has a different vector length so it's very common for the register you're
spilling to have a different length than the one you're going to load. And whole register operations cost the same as other memory operations in a wide SIMD machine. If we had a load instruction that was capable of the whole register load with the desired size expectation, I think it would help. Now that I'm thinking about it, I think a store of that sort is useless. All sizes have the same effect for store. :-) Bill
|
|
andrew@...
On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@...> wrote:
Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically scheduled wide-issue machines with short chimes. Is a microarchitectural solution out of the question? It's obviously possible to predict the tag for stack reloads with high accuracy at low cost. The repacking still needs to occur, but instead of the uop being inserted when the register is accessed, it can happen as a part of the load. This approach reduces issue bandwidth and incorporates the latency into the load writeback, making it much easier to schedule. I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to obviate or solve some other way.
|
|
Bill Huffman
On 6/15/20 11:14 PM, Andrew Waterman wrote:
I've also thought about prediction. If it works, it's just like a whole register load always of the correct size. The inserted cast will almost never be needed. And no cycles are added. I think that prediction could be more accurate than having the amount in the load instruction. Or less. I don't know enough about how well compilers can do at knowing the element widths in different circumstances. Or how often spills and restores can be only, so to speak, one deep - where a predictor that remembered the most recent spill of any particular register and used that size for the fill might do extremely well. For example, if a callee-save register were spilled and used without spilling again, such a mechanism would have the restore before return working correctly. And perhaps if the compiler could be convinced to avoid spilling
any register in a "nested" fashion, predicting the fill that way would always work.
It's not much opcode space. There are 32 lumop codes and only three are used. This wouldn't increase the number of used codes, just make the whole register load code apply to more widths - unless there are code combinations I'm not realizing are there. And simpler implementations would be identical for every width. Still, more is more. And if a predictor is necessary anyway because of too many cases like callee-saved registers, then we don't want additional codes. Bill
|
|
andrew@...
On Mon, Jun 15, 2020 at 11:38 PM Bill Huffman <huffman@...> wrote:
Yeah. I'm thinking into the future where pressure to avoid spilling over into 48-/64-bit instruction encodings will use up more code points in the 32b load/store encoding space, reducing their orthogonality.
The callee-saved register case might not be a red herring, because of vector function calls for transcendentals etc. Even though the standard C ABI must eschew callee-saved vector registers for compatibility reasons, these millicode routines will spill and fill temporaries.
|
|
Bill Huffman
Hi Andrew, I've been thinking about this some more. It seems to me there's value in pursuing both element sized whole vector loads as well as predictors. Taking the cases that seem to matter here, one at a time: Intra-procedure spills:
Callee-saved registers (where such things exist for vectors):
Context switch:
In the end, I see three categories for whole register load/store:
I'd like to see #1 and #2 possible to cover. I think they will help not only in-order machines but OoO vector machines as well. The penalty for the OoO machines won't be as high, but they will still insert a uop and it will
still cost time. You've expressed concern for size in a 48-/64-bit instruction encoding. But given that the entire set of stride-1 cases is much smaller than strided and indexed. The whole register cases, are even smaller as they don't use the nf field. To compare to indexed:
So, I'm having a hard time thinking this will matter much even in the 48-/64-bit encoding. And simple machines can implement both stores and all loads exactly alike. Bill On 6/15/20 11:56 PM, Andrew Waterman wrote:
|
|
If SLEN=VLEN layout is in force, then whole vector register
load/stores don't need to be specified as using SEW=8. They can use current SEW from vtype - this will reduce, though not eliminate, incorrect microarchitectural SEW tag settings. They still use VLEN as number of bits saved for each register. For callee-saved under this "whole-register moves imply vtype.sew tag" model, the paradigm would be to restore vtype, including sew, before restoring the callee-saved registers. I still understand the desire to include expected sew tag in load instructions, but am trying to find a solution that doesn't require a microarchitectural hint. Krste | Hi Andrew,On Thu, 18 Jun 2020 02:58:21 +0000, "Bill Huffman" <huffman@...> said: | I've been thinking about this some more. It seems to me there's value in pursuing both element sized whole vector loads as well as | predictors. Taking the cases that seem to matter here, one at a time: | Intra-procedure spills: | • Here the compiler should know the element size and can use it on the load. If element sized whole register loads are available, | predictors can be left to work for cases the compiler doesn't know about. | • Not wanting to use the predictors for this case leads to a desire for two store types as well as a set of sized load types and a | unknown size load type. A store and a set of sized loads that work with no prediction and a store and load pair that it makes | sense to predict. | • We mentioned that using normal length loads and stores for intra-preciedure spills is an issue with software pipelining and | short chimes. | • This case can be quite involved and I'm not convinced that predictors won't often have trouble in complex cases. | • Caller-saved registers are similar from the compiler point of view, but not from the predictor point of view. The prediction is | likely to be ruined during the called function | Callee-saved registers (where such things exist for vectors): | • Here, there's no way for the compiler to know what size to use in a library. So prediction is better. | • If the store/load pair that depends on prediction are different instructions from the ones where the compiler knows, the | prediction will work better. Same reason we use an absolute jump and not BEQ x0,x0 at the end of an "if" to branch around the | "then." | Context switch: | • It would be nice to solve, but it doesn't cost enough in cycles to be worth adding architectural state. | In the end, I see three categories for whole register load/store: | 1. Intra-procedure spills, which need a single store and a set of loads per element size which the compiler can use. | 2. A separate store and load pair, which is used when the compiler doesn't know the size. These can be predicted at some | percentage of correctness. | 3. The context switch case which can use anything and hardware will spend cycles fixing it up later because it's not worth adding | architectural state. | I'd like to see #1 and #2 possible to cover. I think they will help not only in-order machines but OoO vector machines as well. | The penalty for the OoO machines won't be as high, but they will still insert a uop and it will still cost time. | You've expressed concern for size in a 48-/64-bit instruction encoding. But given that the entire set of stride-1 cases is much | smaller than strided and indexed. The whole register cases, are even smaller as they don't use the nf field. To compare to | indexed: | • The sized whole register loads have 5-bits for an address register, 5-bits for vector register, and 3-bits for a size - total 13 | bits. | • Indexed loads have 5-bits for an address register, 5-bits for an index register, 5-bits for a vector result register, 3-bits for | a size, 3-bits for segment number - total 21 bits (and indexed stores have 22 bits because of ordered or not ordered). | So, I'm having a hard time thinking this will matter much even in the 48-/64-bit encoding. And simple machines can implement both | stores and all loads exactly alike. | Bill | On 6/15/20 11:56 PM, Andrew Waterman wrote: | EXTERNAL MAIL | On Mon, Jun 15, 2020 at 11:38 PM Bill Huffman <huffman@...> wrote: | On 6/15/20 11:14 PM, Andrew Waterman wrote: | EXTERNAL MAIL | On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@...> wrote: | On 6/15/20 6:54 PM, Andrew Waterman wrote: | EXTERNAL MAIL | I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know | the previous tag? | Intra-procedure spills were my first concern. I assume "callee-save code" means the code that saves callee-save | registers before using them and restores after. If so, they may be helped in an ipa context, I guess, so they're a | reason as well, though not as strong. Context-switch code can't know, but is much less important. | I'm thinking about wide SIMD and multiple instructions per cycle. We currently complete several instructions per | cycle and more than one memory transfer per cycle. Each SIMD instruction is 512-bits wide or more. SIMD will | probably get wider and the number of instructions per cycle will probably increase. And then, an inserted cast | instruction will take multiple cycles because of the physical arrangement. So the cost of inserting a cast | instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler. | If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular | unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL | rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore, | using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m | being too optimistic about the implementability of this compiler analysis.) | I've thought about that. I don't know how often the length will be different currently than it is for the registers | that need to be saved (and so need two vsetvli instructions). With realistic latencies and high dispatch rates, | software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that | spilling and restoring is an issue. In general, every iteration has a different vector length so it's very common | for the register you're spilling to have a different length than the one you're going to load. And whole register | operations cost the same as other memory operations in a wide SIMD machine. | Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically | scheduled wide-issue machines with short chimes. | Is a microarchitectural solution out of the question? It's obviously possible to predict the tag for stack reloads with | high accuracy at low cost. The repacking still needs to occur, but instead of the uop being inserted when the register | is accessed, it can happen as a part of the load. This approach reduces issue bandwidth and incorporates the latency | into the load writeback, making it much easier to schedule. | I've also thought about prediction. If it works, it's just like a whole register load always of the correct size. The | inserted cast will almost never be needed. And no cycles are added. I think that prediction could be more accurate than | having the amount in the load instruction. Or less. I don't know enough about how well compilers can do at knowing the | element widths in different circumstances. Or how often spills and restores can be only, so to speak, one deep - where a | predictor that remembered the most recent spill of any particular register and used that size for the fill might do | extremely well. | For example, if a callee-save register were spilled and used without spilling again, such a mechanism would have the restore | before return working correctly. And perhaps if the compiler could be convinced to avoid spilling any register in a | "nested" fashion, predicting the fill that way would always work. | I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to | obviate or solve some other way. | It's not much opcode space. There are 32 lumop codes and only three are used. This wouldn't increase the number of used | codes, just make the whole register load code apply to more widths - unless there are code combinations I'm not realizing | are there. And simpler implementations would be identical for every width. | Yeah. I'm thinking into the future where pressure to avoid spilling over into 48-/64-bit instruction encodings will use up more | code points in the 32b load/store encoding space, reducing their orthogonality. | Still, more is more. And if a predictor is necessary anyway because of too many cases like callee-saved registers, then we | don't want additional codes. | The callee-saved register case might not be a red herring, because of vector function calls for transcendentals etc. Even | though the standard C ABI must eschew callee-saved vector registers for compatibility reasons, these millicode routines will | spill and fill temporaries. | Bill | If we had a load instruction that was capable of the whole register load with the desired size expectation, I think | it would help. Now that I'm thinking about it, I think a store of that sort is useless. All sizes have the same | effect for store. :-) | Bill | On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@...> wrote: | The whole register loads and stores in section 7.9 of the spec are | currently specified as having an element size of 8-bits. Could they be | extended to cover all sizes instead of just the 8-bit size? It looks | like the encoding space is there. | The different sizes would do the same thing functionally, but they allow | software to avoid the requirement for hardware to insert a cast | operation in many circumstances by defining a byte arrangement and | setting the corresponding tag for element size. | Bill | |
|
The more I think through the options, the more I'm convinced we have
to support SLEN=VLEN, at least as an extension if not in all cases, primarily for software. Working through the design challenges of SLEN=VLEN: For narrower datapaths (<=128b), this is the simplest solution. For wider datapaths (>128b), the implementation complexity is for the widening/narrowing operations. The simple brute-force approach is to store bits in registers in memory order and deal with cross-datapath communication on widening/narrowing operations. The communication is always constrained to be within a single-cycle element group (= bits accessed in one beat of the vector functional unit), and this communication can be scheduled into pipeline design statically as is independent of microarch state. Single-width operations don't activate any long wires. The sophisticated approach is to store elements in an communication-optimized internal format (e.g., variants of our SLEN<VLEN layouts) and tag each element group with its last-written EEW. When instructions (both single width or widening/narrowing) access registers with the correct EEW there is no cross-lane communication. If an instruction tries to read an element group with the wrong EEW, a dynamic microop is inserted to rearrange the element group, using a separate permute network that crosses the datapath, and write back permuted pattern with the new EEW (except for stores, which can always write any EEW into memory order). Some instructions have 3 source operands, so might need all three to be reformatted in worst case. The worry for the sophisticated scheme (apart from complexity) is when there are common cases when values are written with one EEW and read with another. 1) Interrupt/OS save/restore The save/restore code does not know the EEW of each register, so there must be some penalty of load with incorrect EEW. Restoring with vtype.SEW might capture some cases better than a fixed EEW in load instruction. The expected overall penalty is low compared to other cases below, as context-switch vector save/restore is rarer operation. 2) Callee-save in vector millicode routines The vector millicode routine will know the EEW for argument types and the return value, which are not themselves callee-save but which can give a strong hint as to the EEW of the caller's registers (e.g., a single-precision exp millicode routine probably can expect quite a few EEW=32 reg in caller). Library software can therefore try to reduce performance impact by just using expect EEW of caller. Millicode routines are passed mask and length information, and can save only needed length and use tail-undisturbed instructions to avoid having to save/restore whole callee-saved register. 3) Spill code inside loop This is the most problematic case. I wonder about how often the compiler does not know the type and length of the values to be restored? I agree adding EEW to the whole-register move could help here, and doesn't add complexity to simpler implementations which can ignore it. Separately, I'm also wondering if the whole-register-move instructions actually make sense for case 2 and 3, as some temporal vector machines might have registers that are 64+ beats deep, more with LMUL, making a whole-register store/load very expensive in these cases when AVL is actually shorter. Case 1 does not really need the whole register move instructions, as regular load/stores are fine. Krste | If SLEN=VLEN layout is in force, then whole vector registerOn Sun, 21 Jun 2020 23:55:32 -0700, "Krste Asanovic via lists.riscv.org" <krste=berkeley.edu@...> said: | load/stores don't need to be specified as using SEW=8. They can use | current SEW from vtype - this will reduce, though not eliminate, | incorrect microarchitectural SEW tag settings. They still use VLEN as | number of bits saved for each register. | For callee-saved under this "whole-register moves imply vtype.sew tag" | model, the paradigm would be to restore vtype, including sew, before | restoring the callee-saved registers. | I still understand the desire to include expected sew tag in load | instructions, but am trying to find a solution that doesn't require a | microarchitectural hint. | Krste | | Hi Andrew,On Thu, 18 Jun 2020 02:58:21 +0000, "Bill Huffman" <huffman@...> said: | | I've been thinking about this some more. It seems to me there's value in pursuing both element sized whole vector loads as well as | | predictors. Taking the cases that seem to matter here, one at a time: | | Intra-procedure spills: | | • Here the compiler should know the element size and can use it on the load. If element sized whole register loads are available, | | predictors can be left to work for cases the compiler doesn't know about. | | • Not wanting to use the predictors for this case leads to a desire for two store types as well as a set of sized load types and a | | unknown size load type. A store and a set of sized loads that work with no prediction and a store and load pair that it makes | | sense to predict. | | • We mentioned that using normal length loads and stores for intra-preciedure spills is an issue with software pipelining and | | short chimes. | | • This case can be quite involved and I'm not convinced that predictors won't often have trouble in complex cases. | | • Caller-saved registers are similar from the compiler point of view, but not from the predictor point of view. The prediction is | | likely to be ruined during the called function | | Callee-saved registers (where such things exist for vectors): | | • Here, there's no way for the compiler to know what size to use in a library. So prediction is better. | | • If the store/load pair that depends on prediction are different instructions from the ones where the compiler knows, the | | prediction will work better. Same reason we use an absolute jump and not BEQ x0,x0 at the end of an "if" to branch around the | | "then." | | Context switch: | | • It would be nice to solve, but it doesn't cost enough in cycles to be worth adding architectural state. | | In the end, I see three categories for whole register load/store: | | 1. Intra-procedure spills, which need a single store and a set of loads per element size which the compiler can use. | | 2. A separate store and load pair, which is used when the compiler doesn't know the size. These can be predicted at some | | percentage of correctness. | | 3. The context switch case which can use anything and hardware will spend cycles fixing it up later because it's not worth adding | | architectural state. | | I'd like to see #1 and #2 possible to cover. I think they will help not only in-order machines but OoO vector machines as well. | | The penalty for the OoO machines won't be as high, but they will still insert a uop and it will still cost time. | | You've expressed concern for size in a 48-/64-bit instruction encoding. But given that the entire set of stride-1 cases is much | | smaller than strided and indexed. The whole register cases, are even smaller as they don't use the nf field. To compare to | | indexed: | | • The sized whole register loads have 5-bits for an address register, 5-bits for vector register, and 3-bits for a size - total 13 | | bits. | | • Indexed loads have 5-bits for an address register, 5-bits for an index register, 5-bits for a vector result register, 3-bits for | | a size, 3-bits for segment number - total 21 bits (and indexed stores have 22 bits because of ordered or not ordered). | | So, I'm having a hard time thinking this will matter much even in the 48-/64-bit encoding. And simple machines can implement both | | stores and all loads exactly alike. | | Bill | | On 6/15/20 11:56 PM, Andrew Waterman wrote: | | EXTERNAL MAIL | | On Mon, Jun 15, 2020 at 11:38 PM Bill Huffman <huffman@...> wrote: | | On 6/15/20 11:14 PM, Andrew Waterman wrote: | | EXTERNAL MAIL | | On Mon, Jun 15, 2020 at 9:55 PM Bill Huffman <huffman@...> wrote: | | On 6/15/20 6:54 PM, Andrew Waterman wrote: | | EXTERNAL MAIL | | I guess the main use case is intra-procedure spills, since callee-save code and context-switch code can’t know | | the previous tag? | | Intra-procedure spills were my first concern. I assume "callee-save code" means the code that saves callee-save | | registers before using them and restores after. If so, they may be helped in an ipa context, I guess, so they're a | | reason as well, though not as strong. Context-switch code can't know, but is much less important. | | I'm thinking about wide SIMD and multiple instructions per cycle. We currently complete several instructions per | | cycle and more than one memory transfer per cycle. Each SIMD instruction is 512-bits wide or more. SIMD will | | probably get wider and the number of instructions per cycle will probably increase. And then, an inserted cast | | instruction will take multiple cycles because of the physical arrangement. So the cost of inserting a cast | | instruction in hardware (in an in-order machine) is many times the cost of having one put in by the compiler. | | If the main use case is intra-procedure spills and caller-saved registers, is it implausible that the regular | | unit-stride loads and stores could be used instead? In many of these cases, it’s safe to use the current VL | | rather than saving the whole register. So the concern about extra setvls can often be avoided, and furthermore, | | using the runtime VL will sometimes reduce memory traffic versus loading and storing all VLEN bits. (Perhaps I’m | | being too optimistic about the implementability of this compiler analysis.) | | I've thought about that. I don't know how often the length will be different currently than it is for the registers | | that need to be saved (and so need two vsetvli instructions). With realistic latencies and high dispatch rates, | | software pipelining often overlaps two and more iterations - the more so when there's enough code in a loop that | | spilling and restoring is an issue. In general, every iteration has a different vector length so it's very common | | for the register you're spilling to have a different length than the one you're going to load. And whole register | | operations cost the same as other memory operations in a wide SIMD machine. | | Yeah, the existing unit-stride loads and stores are probably an unsuitable solution for this problem on statically | | scheduled wide-issue machines with short chimes. | | Is a microarchitectural solution out of the question? It's obviously possible to predict the tag for stack reloads with | | high accuracy at low cost. The repacking still needs to occur, but instead of the uop being inserted when the register | | is accessed, it can happen as a part of the load. This approach reduces issue bandwidth and incorporates the latency | | into the load writeback, making it much easier to schedule. | | I've also thought about prediction. If it works, it's just like a whole register load always of the correct size. The | | inserted cast will almost never be needed. And no cycles are added. I think that prediction could be more accurate than | | having the amount in the load instruction. Or less. I don't know enough about how well compilers can do at knowing the | | element widths in different circumstances. Or how often spills and restores can be only, so to speak, one deep - where a | | predictor that remembered the most recent spill of any particular register and used that size for the fill might do | | extremely well. | | For example, if a callee-save register were spilled and used without spilling again, such a mechanism would have the restore | | before return working correctly. And perhaps if the compiler could be convinced to avoid spilling any register in a | | "nested" fashion, predicting the fill that way would always work. | | I agree your proposed solution suffices, but I'm reluctant to spend opcode space on a problem we might be able to | | obviate or solve some other way. | | It's not much opcode space. There are 32 lumop codes and only three are used. This wouldn't increase the number of used | | codes, just make the whole register load code apply to more widths - unless there are code combinations I'm not realizing | | are there. And simpler implementations would be identical for every width. | | Yeah. I'm thinking into the future where pressure to avoid spilling over into 48-/64-bit instruction encodings will use up more | | code points in the 32b load/store encoding space, reducing their orthogonality. | | Still, more is more. And if a predictor is necessary anyway because of too many cases like callee-saved registers, then we | | don't want additional codes. | | The callee-saved register case might not be a red herring, because of vector function calls for transcendentals etc. Even | | though the standard C ABI must eschew callee-saved vector registers for compatibility reasons, these millicode routines will | | spill and fill temporaries. | | Bill | | If we had a load instruction that was capable of the whole register load with the desired size expectation, I think | | it would help. Now that I'm thinking about it, I think a store of that sort is useless. All sizes have the same | | effect for store. :-) | | Bill | | On Mon, Jun 15, 2020 at 6:13 PM Bill Huffman <huffman@...> wrote: | | The whole register loads and stores in section 7.9 of the spec are | | currently specified as having an element size of 8-bits. Could they be | | extended to cover all sizes instead of just the 8-bit size? It looks | | like the encoding space is there. | | The different sizes would do the same thing functionally, but they allow | | software to avoid the requirement for hardware to insert a cast | | operation in many circumstances by defining a byte arrangement and | | setting the corresponding tag for element size. | | Bill | | | |
|
Bill Huffman
On 6/21/20 11:55 PM, krste@... wrote:
EXTERNAL MAILI've thought about this as a solution and I don't believe it is enough. This will require an extra pair of vsetvli instructions around many restores of a spilled register. I think that's too costly. That means the compiler can't intermix the restores with the final operations of the function. Or it has to add two vsetvli instructions for each one. That seems poor to me. Well, if you can think of one, great. But I don't think either of the above is usable. Why is there an issue putting the size in the whole register load instructions? Seems trivial to me. Especially if it enables removing the SLEN parameter entirely from the spec. Bill | Hi Andrew,On Thu, 18 Jun 2020 02:58:21 +0000, "Bill Huffman" <huffman@...> said: |
|
On Jun 22, 2020, at 4:56 PM, Bill Huffman <huffman@...> wrote:I agree it’s difficult to find an alternative, and I am OK with having this as an architected hint. Dropping SLEN completely is a major win. Krste |
|
Bill Huffman
On 6/22/20 5:26 PM, Krste Asanovic wrote:
EXTERNAL MAILIt's a lesser issue, as you said, but the millicode case might want aOn Jun 22, 2020, at 4:56 PM, Bill Huffman <huffman@...> wrote:I agree it’s difficult to find an alternative, and I am OK with having this as an architected hint. single store and single load of whole registers that expects prediction in addition to the single whole register store and hint-ed whole register load that don't expect prediction. That may depend on how much use of millicode routines is expected - or at least millicode routines that need to use callee-saved registers. Bill |
|
Kito Cheng
Hi
3) Spill code inside loopSome point from compiler developer's view, we've implemented spill code gen with whole register load/store on GCC. Compiler/GCC know the type when spilling register but length (AVL) is unknown during generating spilling code, also confirmed with LLVM folks, there is same situation as GCC, and compiler also know the whole register move/load/store won't use vtpe and vl, so there won't generate extra vsetvl[i] around spill code. I think EEW=8 for whole register load/store doesn't matter for compiler, since the compiler only care the content can be saved/restored, so I am not sure the usage for (another) EEW whole-register move/load/store on the compiler side. The only concern is the debugging scenarios, while a value spilled into memory and debugger want to print out its content from memory, debugger might not know how to interpret that if VLEN != SLEN, the solution I can imagine is load it into vector register and then set the vtype correctly. On Tue, Jun 23, 2020 at 9:06 AM Bill Huffman <huffman@...> wrote:
|
|
| HiOn Tue, 23 Jun 2020 11:22:44 +0800, Kito Cheng <kito.cheng@...> said: Hi Kito, || 3) Spill code inside loop || || This is the most problematic case. I wonder about how often the || compiler does not know the type and length of the values to be || restored? I agree adding EEW to the whole-register move could help || here, and doesn't add complexity to simpler implementations which can || ignore it. | Some point from compiler developer's view, we've implemented spill | code gen with whole register load/store on GCC. | Compiler/GCC know the type when spilling register but length (AVL) is | unknown during generating spilling code, also confirmed with LLVM | folks, there is same situation as GCC, and compiler also know the | whole register move/load/store won't use vtpe and vl, so there won't | generate extra vsetvl[i] around spill code. | I think EEW=8 for whole register load/store doesn't matter for | compiler, since the compiler only care the content can be | saved/restored, so I am not sure the usage for (another) EEW | whole-register move/load/store on the compiler side. The EEW in the whole register load is a hint to aid wide microarchitectures in organizing data internally. If the compiler can put the correct EEW on the whole register load, then the microarch can avoid an internal rearrangement on the first use with different EEW. It is only a hint, which doesn't change functional behavior but could affect performance. If the compiler (or other software) doesn't know the value then everything still works, just possibly with an internal performance hiccup. | The only concern is the debugging scenarios, while a value spilled | into memory and debugger want to print out its content from memory, | debugger might not know how to interpret that if VLEN != SLEN, the | solution I can imagine is load it into vector register and then set | the vtype correctly. I think debugger issue is yet another reason to fix on SLEN=VLEN format. Krste | On Tue, Jun 23, 2020 at 9:06 AM Bill Huffman <huffman@...> wrote: || || || || On 6/22/20 5:26 PM, Krste Asanovic wrote: || > EXTERNAL MAIL || > || > || > || >> On Jun 22, 2020, at 4:56 PM, Bill Huffman <huffman@...> wrote: || >> || >> On 6/21/20 11:55 PM, krste@... wrote: || >>> EXTERNAL MAIL || >> || >>> || >>> I still understand the desire to include expected sew tag in load || >>> instructions, but am trying to find a solution that doesn't require a || >>> microarchitectural hint. || >>> || >>> Krste || >> || >> Well, if you can think of one, great. But I don't think either of the || >> above is usable. || >> || >> Why is there an issue putting the size in the whole register load || >> instructions? Seems trivial to me. Especially if it enables removing || >> the SLEN parameter entirely from the spec. || >> || >> Bill || > || > I agree it’s difficult to find an alternative, and I am OK with having this as an architected hint. || > || > Dropping SLEN completely is a major win. || > || > Krste || || It's a lesser issue, as you said, but the millicode case might want a || single store and single load of whole registers that expects prediction || in addition to the single whole register store and hint-ed whole || register load that don't expect prediction. || || That may depend on how much use of millicode routines is expected - or || at least millicode routines that need to use callee-saved registers. || || Bill || || > || > || > || > || > || || |
|