vector strided stores when rs1=x0


Krste Asanovic
 

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements. This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back). The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry. Software could then get a
similar effect by settng vl=1 before the store.

Krste


Guy Lemieux
 

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO. (I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste






Krste Asanovic
 


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste







Krste Asanovic
 

I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste








Nick Knight
 

Sorry, slightly off topic, but what was the rationale for

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`. 


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste








Krste Asanovic
 

There’s a comment about this in spec already.

But note that this would be in a case where you're relying on having multiple accesses in a non-deterministic order to one memory location, which is probably fraught for other reasons.

Krste

On Nov 9, 2020, at 11:38 AM, Nick Knight <nick.knight@...> wrote:

Sorry, slightly off topic, but what was the rationale for 

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`.  


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste









Nick Knight
 

I understand now. I'm on board iff the memory consistency model experts assent.

On Mon, Nov 9, 2020 at 11:41 AM Krste Asanovic <krste@...> wrote:
There’s a comment about this in spec already.

But note that this would be in a case where you're relying on having multiple accesses in a non-deterministic order to one memory location, which is probably fraught for other reasons.

Krste

On Nov 9, 2020, at 11:38 AM, Nick Knight <nick.knight@...> wrote:

Sorry, slightly off topic, but what was the rationale for 

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`.  


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste









Bill Huffman
 

This sounds right to me as well.  No use making a special case for strided stores with rs2=x0.

      Bill

On 11/9/20 12:04 PM, Nick Knight wrote:

EXTERNAL MAIL

I understand now. I'm on board iff the memory consistency model experts assent.

On Mon, Nov 9, 2020 at 11:41 AM Krste Asanovic <krste@...> wrote:
There’s a comment about this in spec already.

But note that this would be in a case where you're relying on having multiple accesses in a non-deterministic order to one memory location, which is probably fraught for other reasons.

Krste

On Nov 9, 2020, at 11:38 AM, Nick Knight <nick.knight@...> wrote:

Sorry, slightly off topic, but what was the rationale for 

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`.  


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste