#### Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

Nagendra Gulur

I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bit-vector format of compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiply-accumulates non-zeroes of a row of M with corresponding values of the dense vector (say V[]) to form an output element.  To implement this efficiently where bit-vectors are used to compress M is where I have a question/observation. In bit-vector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is non-zero. A values array -- say M_Vals[.] -- stores these non-zero values.

The basic idea of my implementation is to treat the bit vector as a mask for vector load and vector-multiply. But I am running into a couple of efficiency issues doing so. Stating them with an example vector size of 8..
1. When I load a byte of the bit vector -- I need to convert this byte to a suitable mask that I can store in V0. I need to perform a widening of the bits that are loaded into V0. For eg: if the SEW is 4 bytes, then the bits need to be loaded into V0 at SEW byte gaps of 2 bits. Is this something doable in the current or upcoming specifications?

2. Even if the above is doable with some widening control, I can not use the bit vector or the mask to load from M_Vals[.]. Since M_Vals[.] is densely storing the non-zeroes of M[][], I just need to access X consecutive array elements where X = number of 1s in the mask. I will need to do a 1s-count on the mask (using vpopc instruction), and then convert that count to a new mask which will allow me to do a vector load of X adjacent elements. If the original bit vector has 10011001, then the 1s-count = 4 and I need to convert this to a new mask 00001111. I am not sure how to efficiently do this.

3. Even after I do the above, the loaded values of M_Vals[.] will now line up into a vector register as adjacent elements while the bit-vector masked load of V[.] has loaded these elements into corresponding unmasked element positions. I can not multiply these registers element-wise. I have to align them before I can do the multiplies.

Given these overheads, it appears that vectorizing a bit-vector compressed spMV is rather hopeless.

Thinking about useful vector instructions in this context, I could come up with a couple of suggestions:
1. Load a bit vector into V0 expanding each bit into its own element mask in V0. For eg: if the original bit vector is the byte "10011001" and SEW = 4 bytes, then V0 is loaded with "01 00 00 01 01 00 00 01". That is, we need the instruction to perform an automatic mask-element-sized expansion of bits.

2. After doing vpopc to get the 1s count into a scalar, to convert the scalar back into a vector mask for that many adjacent 1s in the mask. With the above example, vpopc would return 4 into a scalar register. Now we need an instruction that loads "00 00 00 00 01 01 01 01" back to V0 converting the 4 in the scalar to 4 successive elements that are un-masked.