Perhaps instead of using bit vector to encode an entire matrix, we can encode a sub block.
There is a common sparse matrix format called BCSR that blocks the non-zero values of CSR, so that we can reduce col_ind storage and reused vector x.
The main disadvantage of BCSR is we have to pad zeros, where we can actually use a bit mask to encode nonzeros of a sub block as Nagendra's bit vector implementation so that the overhead can be avoided.
I could not find good reduction instructions for tiled matrix vector multiplications if we have multiple rows in a block.
One sub block:
1 1 0 1
a b 0 d
e f e f
fmul = ae bf 0e df
accumulate (reduction) ae+bf,0e+dfThanks,
(Note we can skip that zero computation using bit mask).