Re: Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression
here is dongarra’s take on HPCG. hope this helps.
For the SpMV, the vector lengths in the reference kernel average about 25, with indexed reads (gathers) and summation in to a single value. This is the inner loop of the sparse row format MV. This is the basic pseudo code:
I believe that the (rough) idea I sketched earlier in this thread (May 8) still works with the latest version of the spec --- please correct me if I'm wrong --- what I called "sketchy type-punning" (of the mask register) is now kosher. A potential issue in my sketch is the vector-to-scalar dependence caused by vpopc.m: I can imagine this performing poorly on implementations with decoupled scalar and vector pipes. If I correctly understand Krste's suggestion of using viota.m, it would avoid this vpopc.m and a few other mask operations at the cost of converting the unit-stride load of matrix nonzeros into a gather.
Please keep in mind this is in the context of the "bitvector" sparse matrix representation that Nagendra was considering. I'm not aware of anyone using this representation for the SpMVs appearing in the HPCG benchmark, or more generally in HPC applications. (In a private email thread, Nagendra told me he had machine learning applications in mind.) The "standard" CSR SpMV algorithm is much simpler to express in RVV. Also regarding HPCG, I think it would be more interesting (and challenging) to study the preconditioner ("SymGS") than the SpMV.