Date
1 - 13 of 13
Sparse Matrix-Vector Multiply (again) and Bit-Vector Compression
lidawei14@...
Hi all,
If I use EDIV to compute SpMV y = A * x as size r * c blocks, I might have to load size r of y and size c of x, these are shorter than VL = r * c, is there an efficient way to do this by current support? If I would like to use a mask to compress, for VL = 16, I can store 16-bit value to memory and load it to GPR, then how can I transform it to a vector mask? Thank you, Dawei |
|
The main reason not to include in v1.0 is that it has many details to work through, and resolving these would delay 1.0 by many months. Krste
|
|
lidawei14@...
Hi all,
Thank you Nick for the reply. I saw EDIV will not be included in v1.0, any issues to be resolved? Can I have a look at the discussion page on EDIV? Thanks a lot, Dawei |
|
Hi Dawei, Unfortunately the V-extension does not currently feature block-matrix multiplication instructions, including the bit-masked versions that you're describing. Some of these operations are intended to be addressed by a future (sub-) extension (perhaps EDIV). Presently, you can apply your scheme directly to the restricted class of BCSR matrices with 1-dimensional blocks. You can imagine various (nonzero-structure-dependent) tradeoffs between using r-by-1 and 1-by-c blocks. More generally you can recursively compose (singly or doubly compressed) block-CS{R,C} formats, encoding indices using bitvectors or lists at any level of the recursion. Best, Nick Hi, |
|
lidawei14@...
Hi,
Perhaps instead of using bit vector to encode an entire matrix, we can encode a sub block. There is a common sparse matrix format called BCSR that blocks the non-zero values of CSR, so that we can reduce col_ind[] storage and reused vector x. The main disadvantage of BCSR is we have to pad zeros, where we can actually use a bit mask to encode nonzeros of a sub block as Nagendra's bit vector implementation so that the overhead can be avoided. I could not find good reduction instructions for tiled matrix vector multiplications if we have multiple rows in a block. One sub block: A =
a b 0 d
Corresponding x:
x =
e f
Bit vector: 1 1 0 1 Computation:
a b 0 d
e f e f
fmul = ae bf 0e df
accumulate (reduction) ae+bf,0e+df
Thanks,(Note we can skip that zero computation using bit mask). Dawei |
|
For the code segment given, Blelloch's loop raking approach would be
worth exploring for the V extension. This approach involves large constant stride accesses to A[] and col[j] array but will keep even big vector units completely busy when nrows is big, and doesn't need reduction instructions. http://www.cs.cmu.edu/~scandal/papers/CMU-CS-93-173.html Using strided segment operations and some inner loop unrolling can increase the spatial locality of accessing the A[] and col[] array. Brute force vectorization of the inner loop might work better on short vector machines using reduction operations. Krste | here is dongarra’s take on HPCG. hope this helps.On Thu, 9 Jul 2020 08:38:00 -0400, swallach <steven.wallach@...> said: ||| For the SpMV, the vector lengths in the reference kernel average about 25, with indexed reads (gathers) and summation in to a single value. This is the inner loop of the sparse row format MV. This is the basic pseudo code: ||| ||| for (i=0; i<nrow; ++i) { // Number of rows on this MPI process (big) ||| double sum = 0.0; ||| for (j=ptr[i]; j<ptr[i+1]; ++j) sum += A[j] * x[col[j]]; // This loop is on average length 25 ||| y[i] = sum; ||| } ||| ||| For the SymGS, the vector lengths is on average 12, with the same index read pattern, except that the loop does not naturally vectorize since the SymGS operation is like a triangular solve, recursive. ||| ||| Optimized implementations of SpMV tend to re-order the matrix data structure so that the SpMV loops are still indexed reads, but are of length nrow (same as axpy), using, for example, the Jagged Diagonal or Ellpack format. ||| ||| Optimized implementations of SymGS are also reordered to get longer vector lengths, but typically are a fraction of nrow, or much smaller, depending on whether targeted for the GPU or a CPU. The GPU approach uses multicoloring, so that vector lengths are approximately nrow/8. The CPU approach will use level-scheduling where vector lengths will vary a lot, but are typically in the range of 15 - 100. The GPU approach takes more iterations, which is penalized by HPCG. All optimized SymGS approaches use indexed reads. ||| ||| I hope this is helpful. ||| Jack | —————————— | I believe that the (rough) idea I sketched earlier in this thread (May 8) still works with the latest version of the spec --- please correct me if I'm wrong --- what I called "sketchy type-punning" (of the mask register) is now kosher. A potential issue in my sketch is the vector-to-scalar dependence caused by vpopc.m: I can imagine this performing poorly on implementations with decoupled scalar and vector pipes. If I correctly understand Krste's suggestion of using viota.m, it would avoid this vpopc.m and a few other mask operations at the cost of converting the unit-stride load of matrix nonzeros into a gather. | Please keep in mind this is in the context of the "bitvector" sparse matrix representation that Nagendra was considering. I'm not aware of anyone using this representation for the SpMVs appearing in the HPCG benchmark, or more generally in HPC applications. (In a private email thread, Nagendra told me he had machine learning applications in mind.) The "standard" CSR SpMV algorithm is much simpler to express in RVV. Also regarding HPCG, I think it would be more interesting (and challenging) to study the preconditioner ("SymGS") than the SpMV. | Best, | Nick | http://bsc.es/disclaimer |
|
here is dongarra’s take on HPCG. hope this helps.
For the SpMV, the vector lengths in the reference kernel average about 25, with indexed reads (gathers) and summation in to a single value. This is the inner loop of the sparse row format MV. This is the basic pseudo code: —————————— I believe that the (rough) idea I sketched earlier in this thread (May 8) still works with the latest version of the spec --- please correct me if I'm wrong --- what I called "sketchy type-punning" (of the mask register) is now kosher. A potential issue in my sketch is the vector-to-scalar dependence caused by vpopc.m: I can imagine this performing poorly on implementations with decoupled scalar and vector pipes. If I correctly understand Krste's suggestion of using viota.m, it would avoid this vpopc.m and a few other mask operations at the cost of converting the unit-stride load of matrix nonzeros into a gather. Please keep in mind this is in the context of the "bitvector" sparse matrix representation that Nagendra was considering. I'm not aware of anyone using this representation for the SpMVs appearing in the HPCG benchmark, or more generally in HPC applications. (In a private email thread, Nagendra told me he had machine learning applications in mind.) The "standard" CSR SpMV algorithm is much simpler to express in RVV. Also regarding HPCG, I think it would be more interesting (and challenging) to study the preconditioner ("SymGS") than the SpMV. Best, Nick http://bsc.es/disclaimer |
|
I believe that the (rough) idea I sketched earlier in this thread (May 8) still works with the latest version of the spec --- please correct me if I'm wrong --- what I called "sketchy type-punning" (of the mask register) is now kosher. A potential issue in my sketch is the vector-to-scalar dependence caused by vpopc.m: I can imagine this performing poorly on implementations with decoupled scalar and vector pipes. If I correctly understand Krste's suggestion of using viota.m, it would avoid this vpopc.m and a few other mask operations at the cost of converting the unit-stride load of matrix nonzeros into a gather. Please keep in mind this is in the context of the "bitvector" sparse matrix representation that Nagendra was considering. I'm not aware of anyone using this representation for the SpMVs appearing in the HPCG benchmark, or more generally in HPC applications. (In a private email thread, Nagendra told me he had machine learning applications in mind.) The "standard" CSR SpMV algorithm is much simpler to express in RVV. Also regarding HPCG, I think it would be more interesting (and challenging) to study the preconditioner ("SymGS") than the SpMV. Best, Nick On Wed, Jul 8, 2020 at 3:11 PM swallach <steven.wallach@...> wrote:
|
|
please share the asm for spmv, the key kernel (s),
in any case, the execution time for operations using a mask, is very implementation/machine dependent it is a function on how aggressive, in hardware, addressing the vector register elements is. is addressing always sequential or addresses generated out of sequence based on the bit mask (in V0 as a result of the increasing performance of HPCG, spmv is an extremely important computation kernel. in my experience, the dominant time for HPCG, is the memory latency time (short vectors). also node to node networking ——— | I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bit-vector format of
| compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiply-accumulates non-zeroes of a row of M with corresponding values of the | dense vector (say V[]) to form an output element. To implement this efficiently where bit-vectors are used to compress M is where I have a question/observation. | In bit-vector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is | non-zero. A values array -- say M_Vals[.] -- stores these non-zero values. WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer |
|
| I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bit-vector format ofOn Thu, 07 May 2020 11:43:27 -0700, "Nagendra Gulur" <nagendra.gd@...> said: | compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiply-accumulates non-zeroes of a row of M with corresponding values of the | dense vector (say V[]) to form an output element. To implement this efficiently where bit-vectors are used to compress M is where I have a question/observation. | In bit-vector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is | non-zero. A values array -- say M_Vals[.] -- stores these non-zero values. | The basic idea of my implementation is to treat the bit vector as a mask for vector load and vector-multiply. But I am running into a couple of efficiency issues | doing so. Stating them with an example vector size of 8.. | 1. When I load a byte of the bit vector -- I need to convert this byte to a suitable mask that I can store in V0. I need to perform a widening of the bits that are | loaded into V0. For eg: if the SEW is 4 bytes, then the bits need to be loaded into V0 at SEW byte gaps of 2 bits. Is this something doable in the current or | upcoming specifications? Masks are now stored as a packed bit vector in registers, in same format as a packed bit vector would be in memory. | 2. Even if the above is doable with some widening control, I can not use the bit vector or the mask to load from M_Vals[.]. Since M_Vals[.] is densely storing the | non-zeroes of M[][], I just need to access X consecutive array elements where X = number of 1s in the mask. I will need to do a 1s-count on the mask (using vpopc | instruction), and then convert that count to a new mask which will allow me to do a vector load of X adjacent elements. If the original bit vector has 10011001, | then the 1s-count = 4 and I need to convert this to a new mask 00001111. I am not sure how to efficiently do this. The viota.m instruction is what you need. Look at section 16.8 in spec. Krste | 3. Even after I do the above, the loaded values of M_Vals[.] will now line up into a vector register as adjacent elements while the bit-vector masked load of V[.] | has loaded these elements into corresponding unmasked element positions. I can not multiply these registers element-wise. I have to align them before I can do the | multiplies. | Given these overheads, it appears that vectorizing a bit-vector compressed spMV is rather hopeless. | Thinking about useful vector instructions in this context, I could come up with a couple of suggestions: | 1. Load a bit vector into V0 expanding each bit into its own element mask in V0. For eg: if the original bit vector is the byte "10011001" and SEW = 4 bytes, then | V0 is loaded with "01 00 00 01 01 00 00 01". That is, we need the instruction to perform an automatic mask-element-sized expansion of bits. | 2. After doing vpopc to get the 1s count into a scalar, to convert the scalar back into a vector mask for that many adjacent 1s in the mask. With the above | example, vpopc would return 4 into a scalar register. Now we need an instruction that loads "00 00 00 00 01 01 01 01" back to V0 converting the 4 in the scalar to | 4 successive elements that are un-masked. | 3. For the alignment problem between M_Vals[.] and V[.], it might just be sufficient to add a 1-bit control to the vector loads. This bit can control whether the | masked-loads load into element-positions or load into consecutive positions starting at position 0. If such a control was available, then one could load V[.] using | an indexed vector load but load into adjacent positions. That would make the elements pairwise aligned and compatible for pairwise multiplication. With this load | semantic, I am separating controlling the load sources (via the mask) from where the load values are stored in the register (using the same mask to load at | unmasked elements or to load consecutively into adjacent positions). | Either I am missing some trick or bit-vector based compression is indeed tricky! | Appreciate some feedback about this topic and if there's a possibility to review the issue in greater detail at one of the upcoming meetings, I can prepare some | slides. | Best Regards | Nagendra | |
|
PS: Instead of loading the RHS entries with an indexed load, we could also use a unit-stride load (possibly masked), followed by a vcompress. This didn't seem like a clear win, and required additional mask trickery, so I didn't pursue it.
|
|
Hi Nagendra, The goal of the Zvediv extension is to add some support for sub-byte datatypes. So my hope is that applications that use bitvectors to encode nonzero structure will be supported via Zvediv. However, Zvediv is currently a very rough draft and likely won't be included in the base V-extension. In the meanwhile, I suspect there are some tricks you could use. The following code fragment is a rough sketch of the idea I have in mind (I expect there are bugs.) # a0 = number of matrix columns left in current row # a1 = pointer to bitvector (current offset) # a2 = pointer to matrix nonzero array (current offset) # a3 = pointer to (dense) right-hand side array (current offset) vsetvli t0, a0, e32, m8 # stripmine loop addi t1, t0, 7 srli t1, t1, 3 # t1 = ceil(t0/8), the number of bitvector bytes that # encode the nonzero structure of the next t0 columns vsetvli x0, t1, e8, m8 vle8 v0, (a1) # load bitvector chunk for next t0 columns vsetvli x0, t0, e8, m8 vpopc.m t2, v0 # t2 = number of nonzeros in chunk; # **DANGER** sketchy type-punning on v0 vid.v v8 vcompress.vm v16, v8, v0 vsetvli x0, t2, e32, m8 vle32 v0, (a2) # load matrix nonzeros vlxei8 v8, (a3), v16 # gather RHS entries; # **DANGER** sketchy type-punning on v16 # do dot product # bump pointers and loop to next column chunk (or next row). My hope was that the "sketchy type-punning" should succeed when SLEN = VLEN, but it will presumably fail when SLEN < VLEN. In the latter case, in the absence of typecast instructions, we'll need to use loads/stores (or uarch-dependent vrgathers) to permute data in the VRF. Best, Nick Knight On Thu, May 7, 2020 at 11:43 AM Nagendra Gulur <nagendra.gd@...> wrote: I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bit-vector format of compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiply-accumulates non-zeroes of a row of M with corresponding values of the dense vector (say V[]) to form an output element. To implement this efficiently where bit-vectors are used to compress M is where I have a question/observation. In bit-vector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is non-zero. A values array -- say M_Vals[.] -- stores these non-zero values. |
|
I am now investigating how to efficiently implement sparse matrix X (dense) vector multiplications (spMV) using RISCV vectors using bit-vector format of compressing the sparse matrix. The inner loop of sequential spMV algorithm simply multiply-accumulates non-zeroes of a row of M with corresponding values of the dense vector (say V[]) to form an output element. To implement this efficiently where bit-vectors are used to compress M is where I have a question/observation. In bit-vector compression, the sparse matrix metadata is a bit map of 0s and 1s. A 0 indicates that the matrix value is a 0. A 1 indicates that the matrix value is non-zero. A values array -- say M_Vals[.] -- stores these non-zero values.
The basic idea of my implementation is to treat the bit vector as a mask for vector load and vector-multiply. But I am running into a couple of efficiency issues doing so. Stating them with an example vector size of 8.. 1. When I load a byte of the bit vector -- I need to convert this byte to a suitable mask that I can store in V0. I need to perform a widening of the bits that are loaded into V0. For eg: if the SEW is 4 bytes, then the bits need to be loaded into V0 at SEW byte gaps of 2 bits. Is this something doable in the current or upcoming specifications? 2. Even if the above is doable with some widening control, I can not use the bit vector or the mask to load from M_Vals[.]. Since M_Vals[.] is densely storing the non-zeroes of M[][], I just need to access X consecutive array elements where X = number of 1s in the mask. I will need to do a 1s-count on the mask (using vpopc instruction), and then convert that count to a new mask which will allow me to do a vector load of X adjacent elements. If the original bit vector has 10011001, then the 1s-count = 4 and I need to convert this to a new mask 00001111. I am not sure how to efficiently do this. 3. Even after I do the above, the loaded values of M_Vals[.] will now line up into a vector register as adjacent elements while the bit-vector masked load of V[.] has loaded these elements into corresponding unmasked element positions. I can not multiply these registers element-wise. I have to align them before I can do the multiplies. Given these overheads, it appears that vectorizing a bit-vector compressed spMV is rather hopeless. Thinking about useful vector instructions in this context, I could come up with a couple of suggestions: 1. Load a bit vector into V0 expanding each bit into its own element mask in V0. For eg: if the original bit vector is the byte "10011001" and SEW = 4 bytes, then V0 is loaded with "01 00 00 01 01 00 00 01". That is, we need the instruction to perform an automatic mask-element-sized expansion of bits. 2. After doing vpopc to get the 1s count into a scalar, to convert the scalar back into a vector mask for that many adjacent 1s in the mask. With the above example, vpopc would return 4 into a scalar register. Now we need an instruction that loads "00 00 00 00 01 01 01 01" back to V0 converting the 4 in the scalar to 4 successive elements that are un-masked. 3. For the alignment problem between M_Vals[.] and V[.], it might just be sufficient to add a 1-bit control to the vector loads. This bit can control whether the masked-loads load into element-positions or load into consecutive positions starting at position 0. If such a control was available, then one could load V[.] using an indexed vector load but load into adjacent positions. That would make the elements pairwise aligned and compatible for pairwise multiplication. With this load semantic, I am separating controlling the load sources (via the mask) from where the load values are stored in the register (using the same mask to load at unmasked elements or to load consecutively into adjacent positions). Either I am missing some trick or bit-vector based compression is indeed tricky! Appreciate some feedback about this topic and if there's a possibility to review the issue in greater detail at one of the upcoming meetings, I can prepare some slides. Best Regards Nagendra |
|