A couple of questions about the vector spec
I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.
1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd. This results in committing vd as destination and makes it hard to use other elements of vd (vd, vd, ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers.
2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by 0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift?