#### Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

Krste Asanovic

On Thu, 07 May 2020 11:43:27 -0700, "Nagendra Gulur" <nagendra.gd@...> said:
| I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bit-vector format of
| compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiply-accumulates non-zeroes of a row of M with corresponding values of the
| dense vector (say V[]) to form an output element. To implement this efficiently where bit-vectors are used to compress M is where I have a question/observation.
| In bit-vector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is
| non-zero. A values array -- say M_Vals[.] -- stores these non-zero values.

| The basic idea of my implementation is to treat the bit vector as a mask for vector load and vector-multiply. But I am running into a couple of efficiency issues
| doing so. Stating them with an example vector size of 8..
| 1. When I load a byte of the bit vector -- I need to convert this byte to a suitable mask that I can store in V0. I need to perform a widening of the bits that are
| loaded into V0. For eg: if the SEW is 4 bytes, then the bits need to be loaded into V0 at SEW byte gaps of 2 bits. Is this something doable in the current or
| upcoming specifications?

Masks are now stored as a packed bit vector in registers, in same
format as a packed bit vector would be in memory.

| 2. Even if the above is doable with some widening control, I can not use the bit vector or the mask to load from M_Vals[.]. Since M_Vals[.] is densely storing the
| non-zeroes of M[][], I just need to access X consecutive array elements where X = number of 1s in the mask. I will need to do a 1s-count on the mask (using vpopc
| instruction), and then convert that count to a new mask which will allow me to do a vector load of X adjacent elements. If the original bit vector has 10011001,
| then the 1s-count = 4 and I need to convert this to a new mask 00001111. I am not sure how to efficiently do this.

The viota.m instruction is what you need. Look at section 16.8 in
spec.

Krste

| 3. Even after I do the above, the loaded values of M_Vals[.] will now line up into a vector register as adjacent elements while the bit-vector masked load of V[.]
| has loaded these elements into corresponding unmasked element positions. I can not multiply these registers element-wise. I have to align them before I can do the
| multiplies.

| Given these overheads, it appears that vectorizing a bit-vector compressed spMV is rather hopeless.

| Thinking about useful vector instructions in this context, I could come up with a couple of suggestions:
| 1. Load a bit vector into V0 expanding each bit into its own element mask in V0. For eg: if the original bit vector is the byte "10011001" and SEW = 4 bytes, then
| V0 is loaded with "01 00 00 01 01 00 00 01". That is, we need the instruction to perform an automatic mask-element-sized expansion of bits.

| 2. After doing vpopc to get the 1s count into a scalar, to convert the scalar back into a vector mask for that many adjacent 1s in the mask. With the above
| example, vpopc would return 4 into a scalar register. Now we need an instruction that loads "00 00 00 00 01 01 01 01" back to V0 converting the 4 in the scalar to
| 4 successive elements that are un-masked.

| 3. For the alignment problem between M_Vals[.] and V[.], it might just be sufficient to add a 1-bit control to the vector loads. This bit can control whether the
| an indexed vector load but load into adjacent positions. That would make the elements pairwise aligned and compatible for pairwise multiplication. With this load
| semantic, I am separating controlling the load sources (via the mask) from where the load values are stored in the register (using the same mask to load at