I agree with your point. But nowadays, the spec is not stable, and there is no target SOC to verify it. So, in my opinion, the number of instructions is an important indicator at present. 发件人: Bruce Hoult <bruce@...> 发送时间: 2020年4月6日 11:14 收件人: 俞林杰 <linjie.ylj@...> 抄送: Krste Asanovic <krste@...>; tech-vector-ext@... 主题: Re: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03 The number of instructions does not necessarily correspond to the speed, and especially not to the PPA or efficiency. Making the load/store simpler might save enough area to make it as or more energy efficient at the same clock rate. It might also allow a higher clock rate (less likely).
toggle quoted message
Show quoted text
On Mon, Apr 6, 2020 at 3:05 PM Linjie Yu <linjie.ylj@...> wrote:Hi,all
I have some applications about byte/halfword/word vector load/stores, like gemm, direct convolution and son on. For 3x3 direct convolution, the code without byte/halfword/word vector load/stores can be:
Int gvl = vsetvli(16, RVV_E32, RVV_M4); int32xm4_t out = vmvvi_int32xm4(0, gvl); for(unsigned int r = 0; r < 3; ++r) { gvl = vsetvli(16, RVV_E8, RVV_M1); const uint8xm1_t data = vle_uint8xm1(input_ptrs[r], gvl); convolve_row3x1(out, data, conv + r * cols); }
inline void convolve_row3x1 (int32xm4_t &out, const uint8xm1_t &row_data, const int16_t *convolution) { const int16_t mat0 = *(convolution); const int16_t mat1 = *(convolution + 1); const int16_t mat2 = *(convolution + 2);
unsigned int gvl = vsetvli(16, RVV_E8, RVV_M1); int16xm2_t row = (int16xm2_t)vwadduvx_uint16xm2_uint8xm1(row_data, 0, gvl);
gvl = vsetvli(8, RVV_E16, RVV_M2); int16xm2_t row_03 = vslidedownvx_int16xm2(row, 1, gvl); int16xm2_t row_47 = vslidedownvx_int16xm2(row, 2, gvl);
out = vwmaccvx_int32xm4_int16xm2(mat0, row, out, gvl); out = vwmaccvx_int32xm4_int16xm2(mat1, row_03, out, gvl); out = vwmaccvx_int32xm4_int16xm2(mat2, row_47, out, gvl); }
the code with byte/halfword/word vector load/stores can be: Int gvl = vsetvli(16, RVV_E32, RVV_M4); int32xm4_t out = vmvvi_int32xm4(0, gvl); for(unsigned int r = 0; r < 3; ++r) { gvl = vsetvli(16, RVV_E16, RVV_M2); const uint16xm2_t data = vlbv_uint8xm1(input_ptrs[r], gvl); convolve_row3x1(out, data, conv + r * cols); }
inline void convolve_row3x1 (int32xm4_t &out, const uint16xm2_t &row_data, const int16_t *convolution) { const int16_t mat0 = *(convolution); const int16_t mat1 = *(convolution + 1); const int16_t mat2 = *(convolution + 2);
gvl = vsetvli(8, RVV_E16, RVV_M2); int16xm2_t row_03 = vslidedownvx_int16xm2(row, 1, gvl); int16xm2_t row_47 = vslidedownvx_int16xm2(row, 2, gvl);
out = vwmaccvx_int32xm4_int16xm2(mat0, row, out, gvl); out = vwmaccvx_int32xm4_int16xm2(mat1, row_03, out, gvl); out = vwmaccvx_int32xm4_int16xm2(mat2, row_47, out, gvl); }
The instructions number of the code with byte/halfword/word vector load/stores, can be reduced about 15%. But when the kernel size becomes Lager, this gap will be smaller. When the size is 9x9, the gap is about 7%. So, in my opinion, these load/store instructions is useful.
Yours Damon -----邮件原件----- 发件人: tech-vector-ext@... <tech-vector-ext@...> 代 表 Krste Asanovic 发送时间: 2020年4月5日 4:43 收件人: tech-vector-ext@... 主题: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03
Date: 2020/4/03 Task Group: Vector Extension Chair: Krste Asanovic Number of Attendees: ~15 Current issues on github: https://github.com/riscv/riscv-v-spec
Issues discussed: #354/362
The following issues were discussed.
Closing on version v0.9. A list of proposed changes to form v0.9 were presented. The main dispute was around dropping byte/halfword/word vector load/stores.
#354/362 Drop byte/halfword/word vector load/stores
Most of the meeting time was spent discussing this issue, which was contentious.
Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them. Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity. The group was going to consider some examples to be supplied by the members, including some mixed floating-point/integer code.
Discussion to continue on mailing list.
|