On 2020-12-11 3:34 a.m., Linjie Yu
Recently, I optimized
the kernel of GEMM for int8 data.
Can we see the git of your work?
I found that there was
no good solution to do in by the use of the present vector
The mainly difficult I
meet is: The accumulator is 32bits, it needs wide 4
times(vqmacc or vwmul + vwmacc or vwmul + vwadd), which
makes the registers are not enough to use.
Does this mean the 32 vector registers are not enough,
or that the number of elements for the given input vector length
are not enough?
There are 2 different
ways I used to optimize it by the present vector ISA.
1. vdot.vv+vredsum.vs (the tail process is so
2. vwmul + vwredsum.vs (vwredsum.vs used in the
Note vdot.vv is experimental. It is not planned for the v1.0
For solving this, I come
up with a new instruction, call vwredsum.vs(new)
Unlike the old
vwredsum.vs, the result is put at the first element, the new
one can put the result in any position by index. It can be
used like this: vwredsum.vs v2, v1, v1, #2
With a "temporary working vector" this new instruction is a
combination of the old with any "insert scalar into element"
instruction [such as vrgather.vv splatt with mask ].
But they are all not
good enough. Does someone have better solution?
I would be happy to look at your current work to make suggestions if
you could direct me to the code.