1. A vector register is deliberately used as the destination of
toggle quoted message
Show quoted text
reductions. If the destination is a scalar register, then tight
coupling between the vector and scalar units would be necessary, and
concurrency would be reduced (because the scalar unit might have to
stall until the vector reduction is completed).
2. Yes, vector-indexed-load instructions such as vlxe.v currently
treat offsets as byte offsets. I could see this issue being debated,
but it would require a shift by (0,1,2,3,4) for up to 64-bit SEWs. If
there is a way you can use vrgather.vv instread, it uses element-size
On Tue, Mar 10, 2020 at 2:44 PM Nagendra Gulur <nagendra.gd@...> wrote:
I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.
1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd. This results in committing vd as destination and makes it hard to use other elements of vd (vd, vd, ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers.
2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by 0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift?