#### Re: Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression

Krste Asanovic

For the code segment given, Blelloch's loop raking approach would be
worth exploring for the V extension. This approach involves large
constant stride accesses to A[] and col[j] array but will keep even
big vector units completely busy when nrows is big, and doesn't need
reduction instructions.

http://www.cs.cmu.edu/~scandal/papers/CMU-CS-93-173.html

Using strided segment operations and some inner loop unrolling can
increase the spatial locality of accessing the A[] and col[] array.

Brute force vectorization of the inner loop might work better on short
vector machines using reduction operations.

Krste

On Thu, 9 Jul 2020 08:38:00 -0400, swallach <steven.wallach@...> said:
| here is dongarra’s take on HPCG. hope this helps.

||| For the SpMV, the vector lengths in the reference kernel average about 25, with indexed reads (gathers) and summation in to a single value. This is the inner loop of the sparse row format MV. This is the basic pseudo code:
|||
||| for (i=0; i<nrow; ++i) { // Number of rows on this MPI process (big)
||| double sum = 0.0;
||| for (j=ptr[i]; j<ptr[i+1]; ++j) sum += A[j] * x[col[j]]; // This loop is on average length 25
||| y[i] = sum;
||| }
|||
||| For the SymGS, the vector lengths is on average 12, with the same index read pattern, except that the loop does not naturally vectorize since the SymGS operation is like a triangular solve, recursive.
|||
||| Optimized implementations of SpMV tend to re-order the matrix data structure so that the SpMV loops are still indexed reads, but are of length nrow (same as axpy), using, for example, the Jagged Diagonal or Ellpack format.
|||
||| Optimized implementations of SymGS are also reordered to get longer vector lengths, but typically are a fraction of nrow, or much smaller, depending on whether targeted for the GPU or a CPU. The GPU approach uses multicoloring, so that vector lengths are approximately nrow/8. The CPU approach will use level-scheduling where vector lengths will vary a lot, but are typically in the range of 15 - 100. The GPU approach takes more iterations, which is penalized by HPCG. All optimized SymGS approaches use indexed reads.
|||
||| I hope this is helpful.
||| Jack

| ——————————

| I believe that the (rough) idea I sketched earlier in this thread (May 8) still works with the latest version of the spec --- please correct me if I'm wrong --- what I called "sketchy type-punning" (of the mask register) is now kosher. A potential issue in my sketch is the vector-to-scalar dependence caused by vpopc.m: I can imagine this performing poorly on implementations with decoupled scalar and vector pipes. If I correctly understand Krste's suggestion of using viota.m, it would avoid this vpopc.m and a few other mask operations at the cost of converting the unit-stride load of matrix nonzeros into a gather.

| Please keep in mind this is in the context of the "bitvector" sparse matrix representation that Nagendra was considering. I'm not aware of anyone using this representation for the SpMVs appearing in the HPCG benchmark, or more generally in HPC applications. (In a private email thread, Nagendra told me he had machine learning applications in mind.) The "standard" CSR SpMV algorithm is much simpler to express in RVV. Also regarding HPCG, I think it would be more interesting (and challenging) to study the preconditioner ("SymGS") than the SpMV.

| Best,
| Nick

| http://bsc.es/disclaimer

Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.