I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.
1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd[0]. This results in committing vd as destination and makes it hard to use other elements of vd (vd[1], vd[2], ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers.
2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by 0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift?
Best Regards
Nagendra