Re: Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

Nick Knight

Hi Nagendra,

The goal of the Zvediv extension is to add some support for sub-byte datatypes. So my hope is that applications that use bitvectors to encode nonzero structure will be supported via Zvediv. However, Zvediv is currently a very rough draft and likely won't be included in the base V-extension.

In the meanwhile, I suspect there are some tricks you could use. The following code fragment is a rough sketch of the idea I have in mind (I expect there are bugs.)

# a0 = number of matrix columns left in current row
# a1 = pointer to bitvector (current offset)
# a2 = pointer to matrix nonzero array (current offset)
# a3 = pointer to (dense) right-hand side array (current offset)
  vsetvli       t0, a0, e32, m8 # stripmine loop
    addi        t1, t0, 7
    srli        t1, t1, 3       # t1 = ceil(t0/8), the number of bitvector bytes that
                                # encode the nonzero structure of the next t0 columns
  vsetvli       x0, t1, e8, m8
  vle8          v0, (a1)        # load bitvector chunk for next t0 columns
  vsetvli       x0, t0, e8, m8
  vpopc.m       t2, v0          # t2 = number of nonzeros in chunk;
                                # **DANGER** sketchy type-punning on v0
  vid.v         v8
  vcompress.vm  v16, v8, v0
  vsetvli       x0, t2, e32, m8
  vle32         v0, (a2)        # load matrix nonzeros
  vlxei8        v8, (a3), v16   # gather RHS entries;
                                # **DANGER** sketchy type-punning on v16
# do dot product
# bump pointers and loop to next column chunk (or next row).

My hope was that the "sketchy type-punning" should succeed when SLEN = VLEN, but it will presumably fail when SLEN < VLEN. In the latter case, in the absence of typecast instructions, we'll need to use loads/stores (or uarch-dependent vrgathers) to permute data in the VRF.

Nick Knight

On Thu, May 7, 2020 at 11:43 AM Nagendra Gulur <> wrote:
I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bit-vector format of compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiply-accumulates non-zeroes of a row of M with corresponding values of the dense vector (say V[]) to form an output element.  To implement this efficiently where bit-vectors are used to compress M is where I have a question/observation. In bit-vector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is non-zero. A values array -- say M_Vals[.] -- stores these non-zero values. 

The basic idea of my implementation is to treat the bit vector as a mask for vector load and vector-multiply. But I am running into a couple of efficiency issues doing so. Stating them with an example vector size of 8..
1. When I load a byte of the bit vector -- I need to convert this byte to a suitable mask that I can store in V0. I need to perform a widening of the bits that are loaded into V0. For eg: if the SEW is 4 bytes, then the bits need to be loaded into V0 at SEW byte gaps of 2 bits. Is this something doable in the current or upcoming specifications? 

2. Even if the above is doable with some widening control, I can not use the bit vector or the mask to load from M_Vals[.]. Since M_Vals[.] is densely storing the non-zeroes of M[][], I just need to access X consecutive array elements where X = number of 1s in the mask. I will need to do a 1s-count on the mask (using vpopc instruction), and then convert that count to a new mask which will allow me to do a vector load of X adjacent elements. If the original bit vector has 10011001, then the 1s-count = 4 and I need to convert this to a new mask 00001111. I am not sure how to efficiently do this.

3. Even after I do the above, the loaded values of M_Vals[.] will now line up into a vector register as adjacent elements while the bit-vector masked load of V[.] has loaded these elements into corresponding unmasked element positions. I can not multiply these registers element-wise. I have to align them before I can do the multiplies.

Given these overheads, it appears that vectorizing a bit-vector compressed spMV is rather hopeless.

Thinking about useful vector instructions in this context, I could come up with a couple of suggestions:
1. Load a bit vector into V0 expanding each bit into its own element mask in V0. For eg: if the original bit vector is the byte "10011001" and SEW = 4 bytes, then V0 is loaded with "01 00 00 01 01 00 00 01". That is, we need the instruction to perform an automatic mask-element-sized expansion of bits.

2. After doing vpopc to get the 1s count into a scalar, to convert the scalar back into a vector mask for that many adjacent 1s in the mask. With the above example, vpopc would return 4 into a scalar register. Now we need an instruction that loads "00 00 00 00 01 01 01 01" back to V0 converting the 4 in the scalar to 4 successive elements that are un-masked.

3. For the alignment problem between M_Vals[.] and V[.], it might just be sufficient to add a 1-bit control to the vector loads. This bit can control whether the masked-loads load into element-positions or load into consecutive positions starting at position 0. If such a control was available, then one could load V[.] using an indexed vector load but load into adjacent positions. That would make the elements pairwise aligned and compatible for pairwise multiplication. With this load semantic, I am separating controlling the load sources (via the mask) from where the load values are stored in the register (using the same mask to load at unmasked elements or to load consecutively into adjacent positions).

Either I am missing some trick or bit-vector based compression is indeed tricky! 

Appreciate some feedback about this topic and if there's a possibility to review the issue in greater detail at one of the upcoming meetings, I can prepare some slides.

Best Regards

Join to automatically receive all group messages.