Multiple accesses required to the same location for strided memory accesses
I see that section 7.5 of the vector spec currently says:
When rs2=x0, then an implementation is allowed, but not required, to perform fewer memory operations than the number of active elements, and may perform different numbers of memory operations across different dynamic executions of the same static instruction.
Note Compilers must be aware to not use the x0 form for rs2 when the immediate stride is 0 if the intent to is to require all memory accesses are performed.
When rs2!=x0 and the value of x[rs2]=0, the implementation must perform one memory access for each active element (but these accesses will not be ordered).
Note When repeating ordered vector accesses to the same memory address are required, then an ordered indexed operation can be used.
I’m not sure from reading this whether strided accesses that overlap are required to read the memory location multiple times. The first three paragraphs sound like they are. The fourth paragraph (the note) sounds like they are not – if one wants multiple accesses of the same memory location, one should use an ordered indexed operation (with constant index).
I thought we had said that the ordered indexed operations were the only ones that were constrained to access memory as many times as the naïvely interpreted instruction said. That seems to mean the first three paragraphs should be changed.
It would seem quite unfortunate to require strided memory operations to be special cased for zero stride (but not x0). If so, we also need to say what happens for positive and negative strides with absolute value less then the element size being accessed – or, for segmented accesses, less than the multiple segment size.
If strided, segmented loads where the stride is one segment are required to do multiple accesses, that would be even more unfortunate as it would keep them from being used efficiently for stencil operations.
Bill