Vector TG meeting minutes 2020/4/03
Date: 2020/4/03
Task Group: Vector Extension Chair: Krste Asanovic Number of Attendees: ~15 Current issues on github: https://github.com/riscv/riscv-v-spec Issues discussed: #354/362 The following issues were discussed. Closing on version v0.9. A list of proposed changes to form v0.9 were presented. The main dispute was around dropping byte/halfword/word vector load/stores. #354/362 Drop byte/halfword/word vector load/stores Most of the meeting time was spent discussing this issue, which was contentious. Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them. Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity. The group was going to consider some examples to be supplied by the members, including some mixed floating-point/integer code. Discussion to continue on mailing list. |
|
On Sat, Apr 4, 2020 at 1:43 PM Krste Asanovic <krste@...> wrote: #354/362 Drop byte/halfword/word vector load/stores On GitHub, this is Issue #362. While I'm generally in favor of dropping them, I am aware it will pose a challenge to several applications if, additionally, indexed loads and stores switch to XLEN-width indices (see Issue #306, Issue #381, and PR #401 for background). My particular concern is related to "index compression", a general software optimization to reduce storage and memory bandwidth costs. For example, the digit-reversal permutations in Cooley-Tukey fast Fourier transforms are typically index-compressed. Without widening loads, we'll need to manually widen the permutation indices out to width XLEN before using them in a gather or scatter. An analogous example appears in sparse matrix codes. However, as Nagendra Gulur commented (in email to this list on 10 March), this example is more complex since the indices typically also need to be left-shifted by sizeof(matrix_element_t) to obtain byte offsets. Thus, we'll need a combination of widening multiplies and widening adds. (Note that there's currently no widening left shift.) A third example, concerning only the XLEN-width indices in indexed loads/stores so less relevant here, was mentioned here on GitHub. Best, Nick Knight |
|
Thang Tran
There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small critical loop where every additional instruction causes > 10% impact on performance.
toggle quoted message
Show quoted text
I am strongly against dropping the byte/halfword/word for load/store. Thanks, Thang -----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic Sent: Saturday, April 4, 2020 1:43 PM To: tech-vector-ext@... Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03 Date: 2020/4/03 Task Group: Vector Extension Chair: Krste Asanovic Number of Attendees: ~15 Current issues on github: https://github.com/riscv/riscv-v-spec Issues discussed: #354/362 The following issues were discussed. Closing on version v0.9. A list of proposed changes to form v0.9 were presented. The main dispute was around dropping byte/halfword/word vector load/stores. #354/362 Drop byte/halfword/word vector load/stores Most of the meeting time was spent discussing this issue, which was contentious. Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them. Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity. The group was going to consider some examples to be supplied by the members, including some mixed floating-point/integer code. Discussion to continue on mailing list. |
|
Hi Thang, Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email. In my opinion, concrete examples are crucial to making an informed decision. I hope you agree. Best, Nick Knight On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote: There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small critical loop where every additional instruction causes > 10% impact on performance. |
|
Thang Tran
Hi Nick,
It is confidential customer application code.
Thanks, Thang
From: Nick Knight [mailto:nick.knight@...]
Sent: Saturday, April 4, 2020 5:04 PM To: Thang Tran <thang@...> Cc: Krste Asanovic <krste@...>; tech-vector-ext@... Subject: Re: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03
Hi Thang,
Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email.
In my opinion, concrete examples are crucial to making an informed decision. I hope you agree.
Best, Nick Knight
On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote:
|
|
Alex Solomatnikov
Do you really have a 2x or 4x wider write port to the vector register file to make vlb and the like work at full memory bandwidth? If yes, what is the impact on PPA, i.e. clock frequency, area, power? If not, then extra widening instruction would not matter because vlb itself is the bottleneck. Alex On Sat, Apr 4, 2020 at 5:19 PM Thang Tran <thang@...> wrote:
|
|
Thang Tran
In scalar code, there is always signed/zero extension for the data and alignment. I do not see a different with vector load/store. If alignment is needed, not much additional cost for signed/zero extension, and an extra pipeline stage is added.
Depended on how the load is pipelined, the load-to-use penalty may be none. So, widening is much preferred in our design.
Thanks, Thang
From: tech-vector-ext@... [mailto:tech-vector-ext@...]
On Behalf Of Alex Solomatnikov
Sent: Saturday, April 4, 2020 7:09 PM To: Thang Tran <thang@...> Cc: Nick Knight <nick.knight@...>; Krste Asanovic <krste@...>; tech-vector-ext@... Subject: Re: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03
Bob Dreyer said he would share an example code.
Do you really have a 2x or 4x wider write port to the vector register file to make vlb and the like work at full memory bandwidth?
If yes, what is the impact on PPA, i.e. clock frequency, area, power?
If not, then extra widening instruction would not matter because vlb itself is the bottleneck.
Alex
On Sat, Apr 4, 2020 at 5:19 PM Thang Tran <thang@...> wrote:
|
|
David Horner
I agree Nick. So here is a suggestion, not completely facetiously:
For load byte/half/word example when SEW = 64 An implementation can optimize the sequence strided load by 1/2/4 shift left 56/48/32 arith right 56/48/32
but a sign extend byte/half/word to SEW would make fusing/chaining simpler. And these without widening.
For stores: a “pack” SEW (of byte/half/word) instruction by SLEN into appropriate LMUL=1/8, 1/4 or 1/2 would allow standard unit strided store to work.
On 2020-04-04 8:04 p.m., Nick Knight
wrote:
|
|
These are basic operations, not application kernels.
It's easy to call out missing instructions when considering individual operations. It's more important to gather and evaluate actual application kernels. Krste | I agree Nick.On Sat, 4 Apr 2020 23:25:13 -0400, "David Horner" <ds2horner@...> said: | So here is a suggestion, not completely facetiously: | For load byte/half/word | example when SEW = 64 | An implementation can optimize the sequence | strided load by 1/2/4 | shift left 56/48/32 | arith right 56/48/32 | but a sign extend byte/half/word to SEW would make fusing/chaining simpler. | And these without widening. | For stores: | a “pack” SEW (of byte/half/word) instruction by SLEN into appropriate LMUL=1/8, 1/4 or 1/2 would allow standard unit strided store to work. | A fractional LMUL that uses interleave (rather than right justified SLEN chunks) would not need this pack instruction. | On 2020-04-04 8:04 p.m., Nick Knight wrote: | Hi Thang, | Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email. | In my opinion, concrete examples are crucial to making an informed decision. I hope you agree. | Best, | Nick Knight | On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote: | There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small | critical loop where every additional instruction causes > 10% impact on performance. | I am strongly against dropping the byte/halfword/word for load/store. | Thanks, Thang | -----Original Message----- | From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic | Sent: Saturday, April 4, 2020 1:43 PM | To: tech-vector-ext@... | Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03 | Date: 2020/4/03 | Task Group: Vector Extension | Chair: Krste Asanovic | Number of Attendees: ~15 | Current issues on github: https://github.com/riscv/riscv-v-spec | Issues discussed: #354/362 | The following issues were discussed. | Closing on version v0.9. A list of proposed changes to form v0.9 were presented. The main dispute was around dropping byte/halfword/word vector load/stores. | #354/362 Drop byte/halfword/word vector load/stores | Most of the meeting time was spent discussing this issue, which was contentious. | Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them. | Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these | instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity. The group was going to consider some examples to be supplied by the members, including some mixed | floating-point/integer code. | Discussion to continue on mailing list. | |
|
David Horner
I agree, It's more important to gather and evaluate actual
application kernels.
toggle quoted message
Show quoted text
Is there such an effort on-going? I further agree to the implicit idea that much, even most, of the processing in any given kernel can occur in fractional and the lower LMUL>=1 modes. Fractional LMUL in these cases is mostly "set up" for the inevitable but usually deferred widen performed only as needed and no earlier. RISCV tuned kernels can incorporate these efficiencies. However, there will be a lag before such kernels are developed and widely used. Further, existing code ported to RISCV cannot be expected to be optimal in this way. Consider Coremark with RV64I, not the only program that uses poor coding practices. We can expect other programs with such biases to be used to challenge RVV.Not only that , but there is much other code in the wild that is adversely affected by not having an efficient load to double from byte, half or word. Especially bit manipulation logic. Consider memory structure of a word (to load) accompanied by a byte decode, sign and scale factor. vsetvli t01,t02,e64 vlb.v v5,(xarryscale) vxor.vv v4,v4,v5 /* apply sign bit and "decode" shift bits vsll.vv v4,v4,v5 /* scale by lower 6 bits. /* (one bit is unused, 1 bit shifts for each of v5 and v4 could insert it, but you have the idea. Granted, the program could use a different memory layout, with 8
way interleaved shift-decode bytes, that are loaded with a byte
offset and arith right shifted by 56, processed in sets of 8. Without an efficient mechanism to load to double from word and byte this type of operation (general bit manipulation, scaling, etc.) is substantially hampered. The vlb.v could be replaced with the strided vlse.v and shift left and arithmetic shift right as previously described (and quoted by Krste below). And the vlwu.v by a LMUL=1/2 word load followed by a widening vwaddu.wv into a previously zeroed double. Neither of these is efficient. (see #411 for elucidating clustering in fractional LMUL) https://github.com/riscv/riscv-v-spec/issues/413 https://github.com/riscv/riscv-v-spec/issues/411 CLSTR and clstr: width specifiers for data cluster in each SLEN chunk. (when LMUL<=1/2) Expand only when needed, compensate for
the implementation specific clustering. The above becomes: vsetvli t01,t02,e32,lf2 /* word in LMUL=1/2vle.v v4,(xarray) dclstru,v v4,v4 /* unsigned extend to double (unclustered) vsetvli t01,t02,e8,lf8 /* byte in LMUL=1/8 vle.v v5,(xarryscale) dclstr,v v5,v5 /* sign extend to double (unclustered) vsetvli t01,t02,e64 vxor.vv v4,v4,v5 /* apply sign bit and "decode" shift bits vsll.vv v4,v4,v5 /* scale by lower 6 bits. For convenience I post #413 here: cluster/decluster instructions: with LMUL<1
loads/stores provide byte/half/word support. Two new vector-vector unary unmasked instructions , vdclstr and vclstr undo / apply the clustering specified in clstr. *** Using the given SEW width as cluster element width and LMUL<1 as the expansion factor: For each SLEN chunk in vs2, vclstr instruction selects SEW width field from each SEW/LMUL element concatenates the SEW elements into a LMUL<1 clusters and stores the result into vd. vdclstr does the reverse; changes vs2 from LMUL clustering into SEW/LMUL width interleaved elements (effectively CLSTR=1) and stores result into vd. This operation can be performed in place with vd = vs2 as each
operation is on a SLEN chunk. See #411 for the specifics of cluster fields and gaps. For vdclstr, the available options for gap fill are undisturbed,
agnostic, zero fill and sign extend. I am proposing this as unmasked only. vm=0 is reserved. There are potential difficulties with disparate mask structure between clustered and interleaved when effective CLSTR > 1. An obvious use of vdclstr instruction is coupled with a load to emulate byte/half/word to SEW. It is low overhead, can be chained/fused. When vd = vs2, clstr is zero (CLSTR=1) and fill is undisturbed: The sequence vclstr and vdclstr with vd = vs2 can be fused to provide byte, half and word sign- or zero-extend up to double word SEW. If not constrained by the strawman model, zero- or sign-extension of SEW to 2 * SEW, 4 * SEW or 8 * SEW are possible. There are numerous other uses especially if SEW is not
constrained by the strawman model. *** The current vector-vector unary class only supports float. It is possible that these could live in that encoding, but likely would be allocated in their own unary group. I will model the encoding after VFUNARY1. **** I don’t see much value in fill1s and less for sign-extend: which element’s sign to extend? and it would lend itself to specific CLSTR idiosyncratic coding even more so than zero fill does. Also relates to #362
On 2020-04-12 6:28 a.m.,
krste@... wrote:
These are basic operations, not application kernels. It's easy to call out missing instructions when considering individual operations. It's more important to gather and evaluate actual application kernels. KrsteOn Sat, 4 Apr 2020 23:25:13 -0400, "David Horner" <ds2horner@...> said:| I agree Nick. | So here is a suggestion, not completely facetiously: | For load byte/half/word | example when SEW = 64 | An implementation can optimize the sequence | strided load by 1/2/4 | shift left 56/48/32 | arith right 56/48/32 | but a sign extend byte/half/word to SEW would make fusing/chaining simpler. | And these without widening. | For stores: | a “pack” SEW (of byte/half/word) instruction by SLEN into appropriate LMUL=1/8, 1/4 or 1/2 would allow standard unit strided store to work. | A fractional LMUL that uses interleave (rather than right justified SLEN chunks) would not need this pack instruction. | On 2020-04-04 8:04 p.m., Nick Knight wrote: | Hi Thang, | Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email. | In my opinion, concrete examples are crucial to making an informed decision. I hope you agree. | Best, | Nick Knight | On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote: | There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small | critical loop where every additional instruction causes > 10% impact on performance. | I am strongly against dropping the byte/halfword/word for load/store. | Thanks, Thang | -----Original Message----- | From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic | Sent: Saturday, April 4, 2020 1:43 PM | To: tech-vector-ext@... | Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03 | Date: 2020/4/03 | Task Group: Vector Extension | Chair: Krste Asanovic | Number of Attendees: ~15 | Current issues on github: https://github.com/riscv/riscv-v-spec |
|