I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bit-vector format of compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiply-accumulates non-zeroes of a row of M with corresponding values of the dense vector (say V) to form an output element. To implement this efficiently where bit-vectors are used to compress M is where I have a question/observation. In bit-vector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is non-zero. A values array -- say M_Vals[.] -- stores these non-zero values.
The basic idea of my implementation is to treat the bit vector as a mask for vector load and vector-multiply. But I am running into a couple of efficiency issues doing so. Stating them with an example vector size of 8..
1. When I load a byte of the bit vector -- I need to convert this byte to a suitable mask that I can store in V0. I need to perform a widening of the bits that are loaded into V0. For eg: if the SEW is 4 bytes, then the bits need to be loaded into V0 at SEW byte gaps of 2 bits. Is this something doable in the current or upcoming specifications?
2. Even if the above is doable with some widening control, I can not use the bit vector or the mask to load from M_Vals[.]. Since M_Vals[.] is densely storing the non-zeroes of M, I just need to access X consecutive array elements where X = number of 1s in the mask. I will need to do a 1s-count on the mask (using vpopc instruction), and then convert that count to a new mask which will allow me to do a vector load of X adjacent elements. If the original bit vector has 10011001, then the 1s-count = 4 and I need to convert this to a new mask 00001111. I am not sure how to efficiently do this.
3. Even after I do the above, the loaded values of M_Vals[.] will now line up into a vector register as adjacent elements while the bit-vector masked load of V[.] has loaded these elements into corresponding unmasked element positions. I can not multiply these registers element-wise. I have to align them before I can do the multiplies.
Given these overheads, it appears that vectorizing a bit-vector compressed spMV is rather hopeless.
Thinking about useful vector instructions in this context, I could come up with a couple of suggestions:
1. Load a bit vector into V0 expanding each bit into its own element mask in V0. For eg: if the original bit vector is the byte "10011001" and SEW = 4 bytes, then V0 is loaded with "01 00 00 01 01 00 00 01". That is, we need the instruction to perform an automatic mask-element-sized expansion of bits.
2. After doing vpopc to get the 1s count into a scalar, to convert the scalar back into a vector mask for that many adjacent 1s in the mask. With the above example, vpopc would return 4 into a scalar register. Now we need an instruction that loads "00 00 00 00 01 01 01 01" back to V0 converting the 4 in the scalar to 4 successive elements that are un-masked.
3. For the alignment problem between M_Vals[.] and V[.], it might just be sufficient to add a 1-bit control to the vector loads. This bit can control whether the masked-loads load into element-positions or load into consecutive positions starting at position 0. If such a control was available, then one could load V[.] using an indexed vector load but load into adjacent positions. That would make the elements pairwise aligned and compatible for pairwise multiplication. With this load semantic, I am separating controlling the load sources (via the mask) from where the load values are stored in the register (using the same mask to load at unmasked elements or to load consecutively into adjacent positions).
Either I am missing some trick or bit-vector based compression is indeed tricky!
Appreciate some feedback about this topic and if there's a possibility to review the issue in greater detail at one of the upcoming meetings, I can prepare some slides.