答复: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03


Linjie Yu
 

 

I agree with your point. But nowadays, the spec is not stable, and there is no target SOC to verify it. So, in my opinion, the number of instructions is an important indicator  at present.

发件人: Bruce Hoult <bruce@...>
发送时间: 202046 11:14
收件人: 俞林杰 <linjie.ylj@...>
抄送: Krste Asanovic <krste@...>; tech-vector-ext@...
主题: Re: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03

 

The number of instructions does not necessarily correspond to the speed, and especially not to the PPA or efficiency. Making the load/store simpler might save enough area to make it as or more energy efficient at the same clock rate. It might also allow a higher clock rate (less likely).

 

On Mon, Apr 6, 2020 at 3:05 PM Linjie Yu <linjie.ylj@...> wrote:

Hiall

I have some applications about byte/halfword/word vector load/stores, like
gemm, direct convolution and son on.
For 3x3 direct convolution, the code without byte/halfword/word vector
load/stores can be:

Int gvl = vsetvli(16, RVV_E32, RVV_M4);
int32xm4_t out = vmvvi_int32xm4(0, gvl);
for(unsigned int r = 0; r < 3; ++r)
{
    gvl = vsetvli(16, RVV_E8, RVV_M1);
    const uint8xm1_t data = vle_uint8xm1(input_ptrs[r], gvl);
        convolve_row3x1(out, data, conv + r * cols);
}

inline void convolve_row3x1 (int32xm4_t &out, const uint8xm1_t &row_data,
const int16_t *convolution)
{
    const int16_t mat0 = *(convolution);
    const int16_t mat1 = *(convolution + 1);
    const int16_t mat2 = *(convolution + 2);

    unsigned int gvl = vsetvli(16, RVV_E8, RVV_M1);
    int16xm2_t row  = (int16xm2_t)vwadduvx_uint16xm2_uint8xm1(row_data, 0,
gvl);

    gvl = vsetvli(8, RVV_E16, RVV_M2);
    int16xm2_t row_03 = vslidedownvx_int16xm2(row, 1, gvl);
    int16xm2_t row_47 = vslidedownvx_int16xm2(row, 2, gvl);

    out = vwmaccvx_int32xm4_int16xm2(mat0, row, out, gvl);
    out = vwmaccvx_int32xm4_int16xm2(mat1, row_03, out, gvl);
    out = vwmaccvx_int32xm4_int16xm2(mat2, row_47, out, gvl);
}

the code with byte/halfword/word vector load/stores can be:
Int gvl = vsetvli(16, RVV_E32, RVV_M4);
int32xm4_t out = vmvvi_int32xm4(0, gvl);
for(unsigned int r = 0; r < 3; ++r)
{
    gvl = vsetvli(16, RVV_E16, RVV_M2);
    const uint16xm2_t data = vlbv_uint8xm1(input_ptrs[r], gvl);
        convolve_row3x1(out, data, conv + r * cols);
}

inline void convolve_row3x1 (int32xm4_t &out, const uint16xm2_t &row_data,
const int16_t *convolution)
{
    const int16_t mat0 = *(convolution);
    const int16_t mat1 = *(convolution + 1);
    const int16_t mat2 = *(convolution + 2);

    gvl = vsetvli(8, RVV_E16, RVV_M2);
    int16xm2_t row_03 = vslidedownvx_int16xm2(row, 1, gvl);
    int16xm2_t row_47 = vslidedownvx_int16xm2(row, 2, gvl);

    out = vwmaccvx_int32xm4_int16xm2(mat0, row, out, gvl);
    out = vwmaccvx_int32xm4_int16xm2(mat1, row_03, out, gvl);
    out = vwmaccvx_int32xm4_int16xm2(mat2, row_47, out, gvl);
}

The instructions number of the code with byte/halfword/word vector
load/stores, can be reduced about 15%. But when the kernel size becomes
Lager, this gap will be smaller. When the size is 9x9, the gap is about 7%.
So, in my opinion, these load/store instructions is useful.

Yours
Damon
-----
邮件原件-----
发件人: tech-vector-ext@... <tech-vector-ext@...>
Krste Asanovic
发送时间: 202045 4:43
收件人: tech-vector-ext@...
主题: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03


Date: 2020/4/03
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~15
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed: #354/362

The following issues were discussed.

Closing on version v0.9. A list of proposed changes to form v0.9 were
presented.  The main dispute was around dropping byte/halfword/word vector
load/stores.

#354/362 Drop byte/halfword/word vector load/stores

Most of the meeting time was spent discussing this issue, which was
contentious.

Participants in favor of retaining these instructions were concerned about
the code size and performance impact of dropping them.
Proponents in favor of dropping them noted that the main impact was only for
integer code (floating-point code does not benefit from these instructions),
that performance might be lower using these instructions rather than
widening, and that there was a large benefit in reducing memory pipeline
complexity.  The group was going to consider some examples to be supplied by
the members, including some mixed floating-point/integer code.

Discussion to continue on mailing list.