A couple of questions about the vector spec


Nagendra Gulur
 

I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.

1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd[0]. This results in committing vd as destination and makes it hard to use other elements of vd (vd[1], vd[2], ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers.

2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by  0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift?

Best Regards
Nagendra


Guy Lemieux
 

1. A vector register is deliberately used as the destination of
reductions. If the destination is a scalar register, then tight
coupling between the vector and scalar units would be necessary, and
concurrency would be reduced (because the scalar unit might have to
stall until the vector reduction is completed).

2. Yes, vector-indexed-load instructions such as vlxe.v currently
treat offsets as byte offsets. I could see this issue being debated,
but it would require a shift by (0,1,2,3,4) for up to 64-bit SEWs. If
there is a way you can use vrgather.vv instread, it uses element-size
offsets.

Guy

On Tue, Mar 10, 2020 at 2:44 PM Nagendra Gulur <nagendra.gd@...> wrote:

I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.

1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd[0]. This results in committing vd as destination and makes it hard to use other elements of vd (vd[1], vd[2], ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers.

2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by 0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift?

Best Regards
Nagendra


Andrew Waterman
 



On Tue, Mar 10, 2020 at 3:07 PM Guy Lemieux <glemieux@...> wrote:
1. A vector register is deliberately used as the destination of
reductions. If the destination is a scalar register, then tight
coupling between the vector and scalar units would be necessary, and
concurrency would be reduced (because the scalar unit might have to
stall until the vector reduction is completed).

2. Yes, vector-indexed-load instructions such as vlxe.v currently
treat offsets as byte offsets. I could see this issue being debated,
but it would require a shift by (0,1,2,3,4) for up to 64-bit SEWs. If
there is a way you can use vrgather.vv instread, it uses element-size
offsets.

Indices must be able to represent general pointers in some cases (e.g. vectorizing (*(x[i]))++ instead of y[x[i]]++), so implicitly scaling the indices causes problems, particularly when misaligned addresses are legal.

The 64-bit instruction encoding could offer another variant of these instructions that scales indices by SEW/8.  In the mean time, I don't think the extra shift is too much to ask.


Guy


On Tue, Mar 10, 2020 at 2:44 PM Nagendra Gulur <nagendra.gd@...> wrote:
>
> I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.
>
> 1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd[0]. This results in committing vd as destination and makes it hard to use other elements of vd (vd[1], vd[2], ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers.
>
> 2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by  0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift?
>
> Best Regards
> Nagendra
>




Nagendra Gulur
 

I am not sure I replied right (I hit reply to sender but not sure who I responded to, learning to work with the list). But thanks for quick replies.

1. Yes, understood about the scalar destination complication. I have to figure a better use of vd[1], vd[2] etc.

2. The vrgather (per 0.8 spec) does vd[i] = vs2[vs1[i]] -- I am not sure how this fixes the conversion from index to address offset. I need the bit shift to happen somewhere in the code. 

I am not suggesting implicit scaling but programmer specified scaling amount in the instruction (0/1/2/3 bit shift). Based on knowledge of the matrix element data type, the programmer can certainly specify a shift amount.
Note that in my sparse matrix vector multiply code, the innermost loop is 9 instructions including the scaling instruction. If this were removed, it reduces dynamic instruction count by about 10%. It seems to be a valuable saving. 

Best Regards
Nagendra


Guy Lemieux
 

1. Yes, understood about the scalar destination complication. I have to figure a better use of vd[1], vd[2] etc.
possibly vslide1up after every reduction, producing a vector of
reductions (possibly in backwards order, unless you rearrange your
outer loop order).

2. The vrgather (per 0.8 spec) does vd[i] = vs2[vs1[i]] -- I am not sure how this fixes the conversion from index to address offset. I need the bit shift to happen somewhere in the code.
I'm not suggesting that you use vrgather to convert indices to byte
offsets. I'm wondering if there is a way to handle sparse rows/columns
entirely differently that uses vrgather instead of vlx (note: I have
no idea if it's possible, as I've never tried to implement sparse
matrix code). However, vlx and vrgather are very similar (one applies
to memory byte addresses, the other applies to vector elements, so
obviously there is some difference).

I am not suggesting implicit scaling but programmer specified scaling amount in the instruction (0/1/2/3 bit shift). Based on knowledge of the matrix element data type, the programmer can certainly specify a shift amount.
You are overthinking this. Well-designed vector units may be able to
fuse/chain a vssl.vv instruction with a vlx instruction. You shouldn't
think one instruction must run to completion before the next one
starts.

Note that in my sparse matrix vector multiply code, the innermost loop is 9 instructions including the scaling instruction. If this were removed, it reduces dynamic instruction count by about 10%. It seems to be a valuable saving.
Yes, it would save instruction issue bandwidth. On simple vector
implementations, it may speed things up. On complex ones, it shouldn't
make a difference as this will be an anticipated frequently occuring
pattern.

Guy


Nagendra Gulur
 

Thank you for the vslide1up suggestion. I might use this with loop unrolling.

Best Regards
Nagendra


Roger Espasa
 

On #2, it would require finding a 2 bit field in the vector load format to encode "no-scaling/2/4/8". Not trivial within the 32b format.

On Wed, Mar 11, 2020 at 1:30 AM Nagendra Gulur <nagendra.gd@...> wrote:

Thank you for the vslide1up suggestion. I might use this with loop unrolling.

Best Regards
Nagendra


Nagendra Gulur
 

Yes - agreed. I believe it is going to be considered for support in the 64 bit encoding.

Best regards 
Nagendra 


On Thu, Apr 2, 2020 at 6:54 AM Roger Espasa <roger.espasa@...> wrote:
On #2, it would require finding a 2 bit field in the vector load format to encode "no-scaling/2/4/8". Not trivial within the 32b format.

On Wed, Mar 11, 2020 at 1:30 AM Nagendra Gulur <nagendra.gd@...> wrote:
Thank you for the vslide1up suggestion. I might use this with loop unrolling.

Best Regards
Nagendra