On Thu, 07 May 2020 11:43:27 0700, "Nagendra Gulur" <nagendra.gd@...> said:
 I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bitvector format of
 compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiplyaccumulates nonzeroes of a row of M with corresponding values of the
 dense vector (say V[]) to form an output element. To implement this efficiently where bitvectors are used to compress M is where I have a question/observation.
 In bitvector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is
 nonzero. A values array  say M_Vals[.]  stores these nonzero values.
 The basic idea of my implementation is to treat the bit vector as a mask for vector load and vectormultiply. But I am running into a couple of efficiency issues
 doing so. Stating them with an example vector size of 8..
 1. When I load a byte of the bit vector  I need to convert this byte to a suitable mask that I can store in V0. I need to perform a widening of the bits that are
 loaded into V0. For eg: if the SEW is 4 bytes, then the bits need to be loaded into V0 at SEW byte gaps of 2 bits. Is this something doable in the current or
 upcoming specifications?
Masks are now stored as a packed bit vector in registers, in same
format as a packed bit vector would be in memory.
 2. Even if the above is doable with some widening control, I can not use the bit vector or the mask to load from M_Vals[.]. Since M_Vals[.] is densely storing the
 nonzeroes of M[][], I just need to access X consecutive array elements where X = number of 1s in the mask. I will need to do a 1scount on the mask (using vpopc
 instruction), and then convert that count to a new mask which will allow me to do a vector load of X adjacent elements. If the original bit vector has 10011001,
 then the 1scount = 4 and I need to convert this to a new mask 00001111. I am not sure how to efficiently do this.
The viota.m instruction is what you need. Look at section 16.8 in
spec.
Krste
 3. Even after I do the above, the loaded values of M_Vals[.] will now line up into a vector register as adjacent elements while the bitvector masked load of V[.]
 has loaded these elements into corresponding unmasked element positions. I can not multiply these registers elementwise. I have to align them before I can do the
 multiplies.
 Given these overheads, it appears that vectorizing a bitvector compressed spMV is rather hopeless.
 Thinking about useful vector instructions in this context, I could come up with a couple of suggestions:
 1. Load a bit vector into V0 expanding each bit into its own element mask in V0. For eg: if the original bit vector is the byte "10011001" and SEW = 4 bytes, then
 V0 is loaded with "01 00 00 01 01 00 00 01". That is, we need the instruction to perform an automatic maskelementsized expansion of bits.
 2. After doing vpopc to get the 1s count into a scalar, to convert the scalar back into a vector mask for that many adjacent 1s in the mask. With the above
 example, vpopc would return 4 into a scalar register. Now we need an instruction that loads "00 00 00 00 01 01 01 01" back to V0 converting the 4 in the scalar to
 4 successive elements that are unmasked.
 3. For the alignment problem between M_Vals[.] and V[.], it might just be sufficient to add a 1bit control to the vector loads. This bit can control whether the
 maskedloads load into elementpositions or load into consecutive positions starting at position 0. If such a control was available, then one could load V[.] using
 an indexed vector load but load into adjacent positions. That would make the elements pairwise aligned and compatible for pairwise multiplication. With this load
 semantic, I am separating controlling the load sources (via the mask) from where the load values are stored in the register (using the same mask to load at
 unmasked elements or to load consecutively into adjacent positions).
 Either I am missing some trick or bitvector based compression is indeed tricky!
 Appreciate some feedback about this topic and if there's a possibility to review the issue in greater detail at one of the upcoming meetings, I can prepare some
 slides.
 Best Regards
 Nagendra
