答复: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03
Hi, Nick
That is a good suggestion for my code, thank you very much. But I develop my code depend on spec 0.7.1. The q-wide instructions had not been added to the spec. And I am confused with the updating of spec. How can I make my code compatible with the new spec?
Best
Damon
发件人: tech-vector-ext@... <tech-vector-ext@...> 代表 Nick Knight
发送时间: 2020年4月6日 12:49
收件人: 俞林杰 <linjie.ylj@...>
抄送: Krste Asanovic <krste@...>; tech-vector-ext@...
主题: Re: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03
Hi Damon,
Thanks for providing a concrete example!
I think you can improve the performance of your first example (non-widening loads). Instead of immediately widening, you could instead perform your subsequent slidedowns on narrower data (goes faster), and then finally use quad-widening MACCs instead of the double-widening MACCs.
Two caveats:
-- I'm not claiming this optimization will result in something faster than your second example (with widening loads): this depends on the hardware implementation.
-- These quad-widening instructions are still considered to be a (sub-) extension, so perhaps they won't be available on your platform. However, for the application domain I'm certain you have in mind, I think implementing Zvqmac is a very smart idea ;)
Best,
Nick Knight
Hi,all
I have some applications about byte/halfword/word vector load/stores, like
gemm, direct convolution and son on.
For 3x3 direct convolution, the code without byte/halfword/word vector
load/stores can be:
Int gvl = vsetvli(16, RVV_E32, RVV_M4);
int32xm4_t out = vmvvi_int32xm4(0, gvl);
for(unsigned int r = 0; r < 3; ++r)
{
gvl = vsetvli(16, RVV_E8, RVV_M1);
const uint8xm1_t data = vle_uint8xm1(input_ptrs[r], gvl);
convolve_row3x1(out, data, conv + r * cols);
}
inline void convolve_row3x1 (int32xm4_t &out, const uint8xm1_t &row_data,
const int16_t *convolution)
{
const int16_t mat0 = *(convolution);
const int16_t mat1 = *(convolution + 1);
const int16_t mat2 = *(convolution + 2);
unsigned int gvl = vsetvli(16, RVV_E8, RVV_M1);
int16xm2_t row = (int16xm2_t)vwadduvx_uint16xm2_uint8xm1(row_data, 0,
gvl);
gvl = vsetvli(8, RVV_E16, RVV_M2);
int16xm2_t row_03 = vslidedownvx_int16xm2(row, 1, gvl);
int16xm2_t row_47 = vslidedownvx_int16xm2(row, 2, gvl);
out = vwmaccvx_int32xm4_int16xm2(mat0, row, out, gvl);
out = vwmaccvx_int32xm4_int16xm2(mat1, row_03, out, gvl);
out = vwmaccvx_int32xm4_int16xm2(mat2, row_47, out, gvl);
}
the code with byte/halfword/word vector load/stores can be:
Int gvl = vsetvli(16, RVV_E32, RVV_M4);
int32xm4_t out = vmvvi_int32xm4(0, gvl);
for(unsigned int r = 0; r < 3; ++r)
{
gvl = vsetvli(16, RVV_E16, RVV_M2);
const uint16xm2_t data = vlbv_uint8xm1(input_ptrs[r], gvl);
convolve_row3x1(out, data, conv + r * cols);
}
inline void convolve_row3x1 (int32xm4_t &out, const uint16xm2_t &row_data,
const int16_t *convolution)
{
const int16_t mat0 = *(convolution);
const int16_t mat1 = *(convolution + 1);
const int16_t mat2 = *(convolution + 2);
gvl = vsetvli(8, RVV_E16, RVV_M2);
int16xm2_t row_03 = vslidedownvx_int16xm2(row, 1, gvl);
int16xm2_t row_47 = vslidedownvx_int16xm2(row, 2, gvl);
out = vwmaccvx_int32xm4_int16xm2(mat0, row, out, gvl);
out = vwmaccvx_int32xm4_int16xm2(mat1, row_03, out, gvl);
out = vwmaccvx_int32xm4_int16xm2(mat2, row_47, out, gvl);
}
The instructions number of the code with byte/halfword/word vector
load/stores, can be reduced about 15%. But when the kernel size becomes
Lager, this gap will be smaller. When the size is 9x9, the gap is about 7%.
So, in my opinion, these load/store instructions is useful.
Yours
Damon
-----邮件原件-----
发件人: tech-vector-ext@... <tech-vector-ext@...> 代
表 Krste Asanovic
发送时间: 2020年4月5日 4:43
收件人: tech-vector-ext@...
主题: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03
Date: 2020/4/03
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~15
Current issues on github: https://github.com/riscv/riscv-v-spec
Issues discussed: #354/362
The following issues were discussed.
Closing on version v0.9. A list of proposed changes to form v0.9 were
presented. The main dispute was around dropping byte/halfword/word vector
load/stores.
#354/362 Drop byte/halfword/word vector load/stores
Most of the meeting time was spent discussing this issue, which was
contentious.
Participants in favor of retaining these instructions were concerned about
the code size and performance impact of dropping them.
Proponents in favor of dropping them noted that the main impact was only for
integer code (floating-point code does not benefit from these instructions),
that performance might be lower using these instructions rather than
widening, and that there was a large benefit in reducing memory pipeline
complexity. The group was going to consider some examples to be supplied by
the members, including some mixed floating-point/integer code.
Discussion to continue on mailing list.