Re: Sparse MatrixVector Multiply (again) and BitVector Compression
Krste Asanovic
For the code segment given, Blelloch's loop raking approach would be
worth exploring for the V extension. This approach involves large constant stride accesses to A[] and col[j] array but will keep even big vector units completely busy when nrows is big, and doesn't need reduction instructions. http://www.cs.cmu.edu/~scandal/papers/CMUCS93173.html Using strided segment operations and some inner loop unrolling can increase the spatial locality of accessing the A[] and col[] array. Brute force vectorization of the inner loop might work better on short vector machines using reduction operations. Krste  here is dongarra’s take on HPCG. hope this helps.On Thu, 9 Jul 2020 08:38:00 0400, swallach <steven.wallach@bsc.es> said:  For the SpMV, the vector lengths in the reference kernel average about 25, with indexed reads (gathers) and summation in to a single value. This is the inner loop of the sparse row format MV. This is the basic pseudo code:   for (i=0; i<nrow; ++i) { // Number of rows on this MPI process (big)  double sum = 0.0;  for (j=ptr[i]; j<ptr[i+1]; ++j) sum += A[j] * x[col[j]]; // This loop is on average length 25  y[i] = sum;  }   For the SymGS, the vector lengths is on average 12, with the same index read pattern, except that the loop does not naturally vectorize since the SymGS operation is like a triangular solve, recursive.   Optimized implementations of SpMV tend to reorder the matrix data structure so that the SpMV loops are still indexed reads, but are of length nrow (same as axpy), using, for example, the Jagged Diagonal or Ellpack format.   Optimized implementations of SymGS are also reordered to get longer vector lengths, but typically are a fraction of nrow, or much smaller, depending on whether targeted for the GPU or a CPU. The GPU approach uses multicoloring, so that vector lengths are approximately nrow/8. The CPU approach will use levelscheduling where vector lengths will vary a lot, but are typically in the range of 15  100. The GPU approach takes more iterations, which is penalized by HPCG. All optimized SymGS approaches use indexed reads.   I hope this is helpful.  Jack  ——————————  I believe that the (rough) idea I sketched earlier in this thread (May 8) still works with the latest version of the spec  please correct me if I'm wrong  what I called "sketchy typepunning" (of the mask register) is now kosher. A potential issue in my sketch is the vectortoscalar dependence caused by vpopc.m: I can imagine this performing poorly on implementations with decoupled scalar and vector pipes. If I correctly understand Krste's suggestion of using viota.m, it would avoid this vpopc.m and a few other mask operations at the cost of converting the unitstride load of matrix nonzeros into a gather.  Please keep in mind this is in the context of the "bitvector" sparse matrix representation that Nagendra was considering. I'm not aware of anyone using this representation for the SpMVs appearing in the HPCG benchmark, or more generally in HPC applications. (In a private email thread, Nagendra told me he had machine learning applications in mind.) The "standard" CSR SpMV algorithm is much simpler to express in RVV. Also regarding HPCG, I think it would be more interesting (and challenging) to study the preconditioner ("SymGS") than the SpMV.  Best,  Nick  http://bsc.es/disclaimer

