I am not sure I replied right (I hit reply to sender but not sure who I responded to, learning to work with the list). But thanks for quick replies.
1. Yes, understood about the scalar destination complication. I have to figure a better use of vd[1], vd[2] etc.
2. The vrgather (per 0.8 spec) does vd[i] = vs2[vs1[i]] -- I am not sure how this fixes the conversion from index to address offset. I need the bit shift to happen somewhere in the code.
I am not suggesting implicit scaling but programmer specified scaling amount in the instruction (0/1/2/3 bit shift). Based on knowledge of the matrix element data type, the programmer can certainly specify a shift amount.
Note that in my sparse matrix vector multiply code, the innermost loop is 9 instructions including the scaling instruction. If this were removed, it reduces dynamic instruction count by about 10%. It seems to be a valuable saving.
Best Regards
Nagendra