Hi，

Perhaps instead of using bit vector to encode an entire matrix, we can encode a sub block.

There is a common sparse matrix format called BCSR that blocks the non-zero values of CSR, so that we can reduce col_ind[] storage and reused vector x.

The main disadvantage of BCSR is we have to pad zeros, where we can actually use a bit mask to encode nonzeros of a sub block as Nagendra's bit vector implementation so that the overhead can be avoided.

I could not find good reduction instructions for tiled matrix vector multiplications if we have multiple rows in a block.

One sub block:

A =

a b

0 d

Corresponding x:

x =

e

f

Bit vector:

1 1 0 1

Computation:

a b 0 d

e f e f

fmul = ae bf 0e df

accumulate (reduction) ae+bf,0e+df

(Note we can skip that zero computation using bit mask).

Thanks,

Dawei