Recently, I optimized the kernel of GEMM for int8 data. I found that there was no good solution to do in by the use of the present vector ISA.
The mainly difficult I meet is: The accumulator is 32bits, it needs wide 4 times(vqmacc or vwmul + vwmacc or vwmul + vwadd), which makes the registers are not enough to use.
There are 2 different ways I used to optimize it by he present vector ISA.
1. vdot.vv+vredsum.vs (the tail process is so complex)
2. vwmul + vwredsum.vs (vwredsum.vs used in the for loop)
For solving this, I come up with a new instruction, call vwredsum.vs(new)
Unlike the old vwredsum.vs, the result is put at the first element, the new one can put the result in any position by index. It can be used like this: vwredsum.vs v2, v1, v1, #2
But they are all not good enough. Does someone have better solution?