Date
1  13 of 13
Sparse MatrixVector Multiply (again) and BitVector Compression
lidawei14@...
Hi all,
If I use EDIV to compute SpMV y = A * x as size r * c blocks, I might have to load size r of y and size c of x, these are shorter than VL = r * c, is there an efficient way to do this by current support? If I would like to use a mask to compress, for VL = 16, I can store 16bit value to memory and load it to GPR, then how can I transform it to a vector mask? Thank you, Dawei


Krste Asanovic
The main reason not to include in v1.0 is that it has many details to work through, and resolving these would delay 1.0 by many months. Krste


lidawei14@...
Hi all,
Thank you Nick for the reply. I saw EDIV will not be included in v1.0, any issues to be resolved? Can I have a look at the discussion page on EDIV? Thanks a lot, Dawei


Nick Knight
Hi Dawei, Unfortunately the Vextension does not currently feature blockmatrix multiplication instructions, including the bitmasked versions that you're describing. Some of these operations are intended to be addressed by a future (sub) extension (perhaps EDIV). Presently, you can apply your scheme directly to the restricted class of BCSR matrices with 1dimensional blocks. You can imagine various (nonzerostructuredependent) tradeoffs between using rby1 and 1byc blocks. More generally you can recursively compose (singly or doubly compressed) blockCS{R,C} formats, encoding indices using bitvectors or lists at any level of the recursion. Best, Nick
Hi，


lidawei14@...
Hi，
Perhaps instead of using bit vector to encode an entire matrix, we can encode a sub block. There is a common sparse matrix format called BCSR that blocks the nonzero values of CSR, so that we can reduce col_ind[] storage and reused vector x. The main disadvantage of BCSR is we have to pad zeros, where we can actually use a bit mask to encode nonzeros of a sub block as Nagendra's bit vector implementation so that the overhead can be avoided. I could not find good reduction instructions for tiled matrix vector multiplications if we have multiple rows in a block. One sub block: A =
a b 0 d
Corresponding x:
x =
e f
Bit vector: 1 1 0 1 Computation:
a b 0 d
e f e f
fmul = ae bf 0e df
accumulate (reduction) ae+bf,0e+df
Thanks,(Note we can skip that zero computation using bit mask). Dawei


Krste Asanovic
For the code segment given, Blelloch's loop raking approach would be
worth exploring for the V extension. This approach involves large constant stride accesses to A[] and col[j] array but will keep even big vector units completely busy when nrows is big, and doesn't need reduction instructions. http://www.cs.cmu.edu/~scandal/papers/CMUCS93173.html Using strided segment operations and some inner loop unrolling can increase the spatial locality of accessing the A[] and col[] array. Brute force vectorization of the inner loop might work better on short vector machines using reduction operations. Krste  here is dongarra’s take on HPCG. hope this helps.On Thu, 9 Jul 2020 08:38:00 0400, swallach <steven.wallach@...> said:  For the SpMV, the vector lengths in the reference kernel average about 25, with indexed reads (gathers) and summation in to a single value. This is the inner loop of the sparse row format MV. This is the basic pseudo code:   for (i=0; i<nrow; ++i) { // Number of rows on this MPI process (big)  double sum = 0.0;  for (j=ptr[i]; j<ptr[i+1]; ++j) sum += A[j] * x[col[j]]; // This loop is on average length 25  y[i] = sum;  }   For the SymGS, the vector lengths is on average 12, with the same index read pattern, except that the loop does not naturally vectorize since the SymGS operation is like a triangular solve, recursive.   Optimized implementations of SpMV tend to reorder the matrix data structure so that the SpMV loops are still indexed reads, but are of length nrow (same as axpy), using, for example, the Jagged Diagonal or Ellpack format.   Optimized implementations of SymGS are also reordered to get longer vector lengths, but typically are a fraction of nrow, or much smaller, depending on whether targeted for the GPU or a CPU. The GPU approach uses multicoloring, so that vector lengths are approximately nrow/8. The CPU approach will use levelscheduling where vector lengths will vary a lot, but are typically in the range of 15  100. The GPU approach takes more iterations, which is penalized by HPCG. All optimized SymGS approaches use indexed reads.   I hope this is helpful.  Jack  ——————————  I believe that the (rough) idea I sketched earlier in this thread (May 8) still works with the latest version of the spec  please correct me if I'm wrong  what I called "sketchy typepunning" (of the mask register) is now kosher. A potential issue in my sketch is the vectortoscalar dependence caused by vpopc.m: I can imagine this performing poorly on implementations with decoupled scalar and vector pipes. If I correctly understand Krste's suggestion of using viota.m, it would avoid this vpopc.m and a few other mask operations at the cost of converting the unitstride load of matrix nonzeros into a gather.  Please keep in mind this is in the context of the "bitvector" sparse matrix representation that Nagendra was considering. I'm not aware of anyone using this representation for the SpMVs appearing in the HPCG benchmark, or more generally in HPC applications. (In a private email thread, Nagendra told me he had machine learning applications in mind.) The "standard" CSR SpMV algorithm is much simpler to express in RVV. Also regarding HPCG, I think it would be more interesting (and challenging) to study the preconditioner ("SymGS") than the SpMV.  Best,  Nick  http://bsc.es/disclaimer


swallach
here is dongarra’s take on HPCG. hope this helps.
For the SpMV, the vector lengths in the reference kernel average about 25, with indexed reads (gathers) and summation in to a single value. This is the inner loop of the sparse row format MV. This is the basic pseudo code: —————————— I believe that the (rough) idea I sketched earlier in this thread (May 8) still works with the latest version of the spec  please correct me if I'm wrong  what I called "sketchy typepunning" (of the mask register) is now kosher. A potential issue in my sketch is the vectortoscalar dependence caused by vpopc.m: I can imagine this performing poorly on implementations with decoupled scalar and vector pipes. If I correctly understand Krste's suggestion of using viota.m, it would avoid this vpopc.m and a few other mask operations at the cost of converting the unitstride load of matrix nonzeros into a gather. Please keep in mind this is in the context of the "bitvector" sparse matrix representation that Nagendra was considering. I'm not aware of anyone using this representation for the SpMVs appearing in the HPCG benchmark, or more generally in HPC applications. (In a private email thread, Nagendra told me he had machine learning applications in mind.) The "standard" CSR SpMV algorithm is much simpler to express in RVV. Also regarding HPCG, I think it would be more interesting (and challenging) to study the preconditioner ("SymGS") than the SpMV. Best, Nick http://bsc.es/disclaimer


Nick Knight
I believe that the (rough) idea I sketched earlier in this thread (May 8) still works with the latest version of the spec  please correct me if I'm wrong  what I called "sketchy typepunning" (of the mask register) is now kosher. A potential issue in my sketch is the vectortoscalar dependence caused by vpopc.m: I can imagine this performing poorly on implementations with decoupled scalar and vector pipes. If I correctly understand Krste's suggestion of using viota.m, it would avoid this vpopc.m and a few other mask operations at the cost of converting the unitstride load of matrix nonzeros into a gather. Please keep in mind this is in the context of the "bitvector" sparse matrix representation that Nagendra was considering. I'm not aware of anyone using this representation for the SpMVs appearing in the HPCG benchmark, or more generally in HPC applications. (In a private email thread, Nagendra told me he had machine learning applications in mind.) The "standard" CSR SpMV algorithm is much simpler to express in RVV. Also regarding HPCG, I think it would be more interesting (and challenging) to study the preconditioner ("SymGS") than the SpMV. Best, Nick
On Wed, Jul 8, 2020 at 3:11 PM swallach <steven.wallach@...> wrote:


swallach
please share the asm for spmv, the key kernel (s),
in any case, the execution time for operations using a mask, is very implementation/machine dependent it is a function on how aggressive, in hardware, addressing the vector register elements is. is addressing always sequential or addresses generated out of sequence based on the bit mask (in V0 as a result of the increasing performance of HPCG, spmv is an extremely important computation kernel. in my experience, the dominant time for HPCG, is the memory latency time (short vectors). also node to node networking ———  I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bitvector format of
 compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiplyaccumulates nonzeroes of a row of M with corresponding values of the  dense vector (say V[]) to form an output element. To implement this efficiently where bitvectors are used to compress M is where I have a question/observation.  In bitvector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is  nonzero. A values array  say M_Vals[.]  stores these nonzero values. WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer


Krste Asanovic
 I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bitvector format ofOn Thu, 07 May 2020 11:43:27 0700, "Nagendra Gulur" <nagendra.gd@...> said:  compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiplyaccumulates nonzeroes of a row of M with corresponding values of the  dense vector (say V[]) to form an output element. To implement this efficiently where bitvectors are used to compress M is where I have a question/observation.  In bitvector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is  nonzero. A values array  say M_Vals[.]  stores these nonzero values.  The basic idea of my implementation is to treat the bit vector as a mask for vector load and vectormultiply. But I am running into a couple of efficiency issues  doing so. Stating them with an example vector size of 8..  1. When I load a byte of the bit vector  I need to convert this byte to a suitable mask that I can store in V0. I need to perform a widening of the bits that are  loaded into V0. For eg: if the SEW is 4 bytes, then the bits need to be loaded into V0 at SEW byte gaps of 2 bits. Is this something doable in the current or  upcoming specifications? Masks are now stored as a packed bit vector in registers, in same format as a packed bit vector would be in memory.  2. Even if the above is doable with some widening control, I can not use the bit vector or the mask to load from M_Vals[.]. Since M_Vals[.] is densely storing the  nonzeroes of M[][], I just need to access X consecutive array elements where X = number of 1s in the mask. I will need to do a 1scount on the mask (using vpopc  instruction), and then convert that count to a new mask which will allow me to do a vector load of X adjacent elements. If the original bit vector has 10011001,  then the 1scount = 4 and I need to convert this to a new mask 00001111. I am not sure how to efficiently do this. The viota.m instruction is what you need. Look at section 16.8 in spec. Krste  3. Even after I do the above, the loaded values of M_Vals[.] will now line up into a vector register as adjacent elements while the bitvector masked load of V[.]  has loaded these elements into corresponding unmasked element positions. I can not multiply these registers elementwise. I have to align them before I can do the  multiplies.  Given these overheads, it appears that vectorizing a bitvector compressed spMV is rather hopeless.  Thinking about useful vector instructions in this context, I could come up with a couple of suggestions:  1. Load a bit vector into V0 expanding each bit into its own element mask in V0. For eg: if the original bit vector is the byte "10011001" and SEW = 4 bytes, then  V0 is loaded with "01 00 00 01 01 00 00 01". That is, we need the instruction to perform an automatic maskelementsized expansion of bits.  2. After doing vpopc to get the 1s count into a scalar, to convert the scalar back into a vector mask for that many adjacent 1s in the mask. With the above  example, vpopc would return 4 into a scalar register. Now we need an instruction that loads "00 00 00 00 01 01 01 01" back to V0 converting the 4 in the scalar to  4 successive elements that are unmasked.  3. For the alignment problem between M_Vals[.] and V[.], it might just be sufficient to add a 1bit control to the vector loads. This bit can control whether the  maskedloads load into elementpositions or load into consecutive positions starting at position 0. If such a control was available, then one could load V[.] using  an indexed vector load but load into adjacent positions. That would make the elements pairwise aligned and compatible for pairwise multiplication. With this load  semantic, I am separating controlling the load sources (via the mask) from where the load values are stored in the register (using the same mask to load at  unmasked elements or to load consecutively into adjacent positions).  Either I am missing some trick or bitvector based compression is indeed tricky!  Appreciate some feedback about this topic and if there's a possibility to review the issue in greater detail at one of the upcoming meetings, I can prepare some  slides.  Best Regards  Nagendra 


Nick Knight
PS: Instead of loading the RHS entries with an indexed load, we could also use a unitstride load (possibly masked), followed by a vcompress. This didn't seem like a clear win, and required additional mask trickery, so I didn't pursue it.


Nick Knight
Hi Nagendra, The goal of the Zvediv extension is to add some support for subbyte datatypes. So my hope is that applications that use bitvectors to encode nonzero structure will be supported via Zvediv. However, Zvediv is currently a very rough draft and likely won't be included in the base Vextension. In the meanwhile, I suspect there are some tricks you could use. The following code fragment is a rough sketch of the idea I have in mind (I expect there are bugs.) # a0 = number of matrix columns left in current row # a1 = pointer to bitvector (current offset) # a2 = pointer to matrix nonzero array (current offset) # a3 = pointer to (dense) righthand side array (current offset) vsetvli t0, a0, e32, m8 # stripmine loop addi t1, t0, 7 srli t1, t1, 3 # t1 = ceil(t0/8), the number of bitvector bytes that # encode the nonzero structure of the next t0 columns vsetvli x0, t1, e8, m8 vle8 v0, (a1) # load bitvector chunk for next t0 columns vsetvli x0, t0, e8, m8 vpopc.m t2, v0 # t2 = number of nonzeros in chunk; # **DANGER** sketchy typepunning on v0 vid.v v8 vcompress.vm v16, v8, v0 vsetvli x0, t2, e32, m8 vle32 v0, (a2) # load matrix nonzeros vlxei8 v8, (a3), v16 # gather RHS entries; # **DANGER** sketchy typepunning on v16 # do dot product # bump pointers and loop to next column chunk (or next row). My hope was that the "sketchy typepunning" should succeed when SLEN = VLEN, but it will presumably fail when SLEN < VLEN. In the latter case, in the absence of typecast instructions, we'll need to use loads/stores (or uarchdependent vrgathers) to permute data in the VRF. Best, Nick Knight
On Thu, May 7, 2020 at 11:43 AM Nagendra Gulur <nagendra.gd@...> wrote: I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bitvector format of compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiplyaccumulates nonzeroes of a row of M with corresponding values of the dense vector (say V[]) to form an output element. To implement this efficiently where bitvectors are used to compress M is where I have a question/observation. In bitvector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is nonzero. A values array  say M_Vals[.]  stores these nonzero values.


Nagendra Gulur
I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bitvector format of compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiplyaccumulates nonzeroes of a row of M with corresponding values of the dense vector (say V[]) to form an output element. To implement this efficiently where bitvectors are used to compress M is where I have a question/observation. In bitvector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is nonzero. A values array  say M_Vals[.]  stores these nonzero values.
The basic idea of my implementation is to treat the bit vector as a mask for vector load and vectormultiply. But I am running into a couple of efficiency issues doing so. Stating them with an example vector size of 8.. 1. When I load a byte of the bit vector  I need to convert this byte to a suitable mask that I can store in V0. I need to perform a widening of the bits that are loaded into V0. For eg: if the SEW is 4 bytes, then the bits need to be loaded into V0 at SEW byte gaps of 2 bits. Is this something doable in the current or upcoming specifications? 2. Even if the above is doable with some widening control, I can not use the bit vector or the mask to load from M_Vals[.]. Since M_Vals[.] is densely storing the nonzeroes of M[][], I just need to access X consecutive array elements where X = number of 1s in the mask. I will need to do a 1scount on the mask (using vpopc instruction), and then convert that count to a new mask which will allow me to do a vector load of X adjacent elements. If the original bit vector has 10011001, then the 1scount = 4 and I need to convert this to a new mask 00001111. I am not sure how to efficiently do this. 3. Even after I do the above, the loaded values of M_Vals[.] will now line up into a vector register as adjacent elements while the bitvector masked load of V[.] has loaded these elements into corresponding unmasked element positions. I can not multiply these registers elementwise. I have to align them before I can do the multiplies. Given these overheads, it appears that vectorizing a bitvector compressed spMV is rather hopeless. Thinking about useful vector instructions in this context, I could come up with a couple of suggestions: 1. Load a bit vector into V0 expanding each bit into its own element mask in V0. For eg: if the original bit vector is the byte "10011001" and SEW = 4 bytes, then V0 is loaded with "01 00 00 01 01 00 00 01". That is, we need the instruction to perform an automatic maskelementsized expansion of bits. 2. After doing vpopc to get the 1s count into a scalar, to convert the scalar back into a vector mask for that many adjacent 1s in the mask. With the above example, vpopc would return 4 into a scalar register. Now we need an instruction that loads "00 00 00 00 01 01 01 01" back to V0 converting the 4 in the scalar to 4 successive elements that are unmasked. 3. For the alignment problem between M_Vals[.] and V[.], it might just be sufficient to add a 1bit control to the vector loads. This bit can control whether the maskedloads load into elementpositions or load into consecutive positions starting at position 0. If such a control was available, then one could load V[.] using an indexed vector load but load into adjacent positions. That would make the elements pairwise aligned and compatible for pairwise multiplication. With this load semantic, I am separating controlling the load sources (via the mask) from where the load values are stored in the register (using the same mask to load at unmasked elements or to load consecutively into adjacent positions). Either I am missing some trick or bitvector based compression is indeed tricky! Appreciate some feedback about this topic and if there's a possibility to review the issue in greater detail at one of the upcoming meetings, I can prepare some slides. Best Regards Nagendra

