Effective element width encoding in vector load/stores


Krste Asanovic
 

There are two separate issues noted with the proposal to fixed-size
vector load/stores.  One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required.  We've discussed adding more widening instructions to help
with the latter.  I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.

This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension.  However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.

Using the width field, EEW is encoded as one of {8,16,32,SEW}.  This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.

Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.

The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.

This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.

So for example for unit-stride, we can reduce the original code:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vsetvli x0, x0, e32,m8
   vle.v v16, (a3)
   vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vsetvli x0, x0, e32,m8
   vse.v v16, (a3)         # EEW=32b, no vsetvli needed

down to:

vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vle32.v v16, (a3)       # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vse32.v v16, (a3)       # EEW=32b, no vsetvli needed

removing three vsetvli instructions from inner loop.

Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.

I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.

I think this also solves our indexed load/store problem.  We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices.  One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.

# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7  # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7

# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8  # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1, 


Krste


Bruce Hoult
 

Nice.


On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic <krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores.  One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required.  We've discussed adding more widening instructions to help
with the latter.  I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.

This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension.  However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.

Using the width field, EEW is encoded as one of {8,16,32,SEW}.  This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.

Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.

The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.

This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.

So for example for unit-stride, we can reduce the original code:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vsetvli x0, x0, e32,m8
   vle.v v16, (a3)
   vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vsetvli x0, x0, e32,m8
   vse.v v16, (a3)         # EEW=32b, no vsetvli needed

down to:

vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vle32.v v16, (a3)       # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vse32.v v16, (a3)       # EEW=32b, no vsetvli needed

removing three vsetvli instructions from inner loop.

Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.

I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.

I think this also solves our indexed load/store problem.  We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices.  One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.

# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7  # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7

# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8  # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1, 


Krste


David Horner
 


On 2020-04-16 11:02 p.m., Krste Asanovic wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores.  One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required.  We've discussed adding more widening instructions to help
with the latter.  I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.

This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension.  However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.

Using the width field, EEW is encoded as one of {8,16,32,SEW}.  This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.

What of SEW scaling factor instead. 1/4,1/2,1 and 2. This allows a much expanded dynamic range and addresses most scaling concerns.

It allows of 2 * SEW for vwop.wv source load, and store of all widened results.

And it allows source load for 4 * widening and 2 * widening to current SEW and even 8 * widening to 2 * SEW which as noted above can be the source and destination for the widening instructions.


Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
with SEW scaling this becomes ELMUL = EEW*LMUL
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.

The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.


yes.

Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.

yes.
I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.

 for SEW scalingI don't have any solid nomenclature suggestions, but it could parallel lmul , lf4, lf2, l1, l2 (like I said no good ideas)

I think this also solves our indexed load/store problem.  We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices.  One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.

I don't believe removing SEW from index is problematic for indexed load/stores.

The program will in almost all cases know the precision of its offsets.

Indeed, it is arguable that dynamic SEW has any practical application.

Rather, I see the wrinkle as Indexed load/stores do not support the scaled  element mode present in the others.

Given the field has been repurposed to index only, then it is even less a problematic wrinkle that SEW is dropped.



# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7  # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7

# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8  # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1, 


Krste


Bill Huffman
 

Agree.  It seems quite nice.

     Bill

On 4/16/20 8:30 PM, Bruce Hoult wrote:

EXTERNAL MAIL

Nice.


On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic <krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores.  One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required.  We've discussed adding more widening instructions to help
with the latter.  I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.

This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension.  However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.

Using the width field, EEW is encoded as one of {8,16,32,SEW}.  This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.

Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.

The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.

This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.

So for example for unit-stride, we can reduce the original code:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vsetvli x0, x0, e32,m8
   vle.v v16, (a3)
   vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vsetvli x0, x0, e32,m8
   vse.v v16, (a3)         # EEW=32b, no vsetvli needed

down to:

vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vle32.v v16, (a3)       # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vse32.v v16, (a3)       # EEW=32b, no vsetvli needed

removing three vsetvli instructions from inner loop.

Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.

I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.

I think this also solves our indexed load/store problem.  We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices.  One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.

# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7  # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7

# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8  # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1, 


Krste


Bill Huffman
 

I wonder if we might encode EEW/SEW in the instruction instead of encoding EEW.

We might encode SEW/4, SEW/2, SEW, and 2*SEW as the EEW widths. 

Another alternative would be to have an additional state which gave us a choice of encoding for EEW:

  • SEW/8, SEW/4, SEW/2, SEW;
  • SEW/4, SEW/2, SEW, 2*SEW;
  • SEW/2, SEW, 2*SEW, 4*SEW; or
  • SEW, 2*SEW, 4*SEW, 8*SEW

This would allow working with either narrower or wider variations.

We have generally had SEW match the narrowest of the active widths in widening or narrowing instructions.  Does that suggest that we would do well to have EEW usually be wider than SEW?  And that having only one set would desire the fourth bullet above?

      Bill

On 4/16/20 11:22 PM, Bill Huffman wrote:

Agree.  It seems quite nice.

     Bill

On 4/16/20 8:30 PM, Bruce Hoult wrote:
EXTERNAL MAIL

Nice.


On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic <krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores.  One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required.  We've discussed adding more widening instructions to help
with the latter.  I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.

This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension.  However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.

Using the width field, EEW is encoded as one of {8,16,32,SEW}.  This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.

Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.

The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.

This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.

So for example for unit-stride, we can reduce the original code:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vsetvli x0, x0, e32,m8
   vle.v v16, (a3)
   vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vsetvli x0, x0, e32,m8
   vse.v v16, (a3)         # EEW=32b, no vsetvli needed

down to:

vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vle32.v v16, (a3)       # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vse32.v v16, (a3)       # EEW=32b, no vsetvli needed

removing three vsetvli instructions from inner loop.

Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.

I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.

I think this also solves our indexed load/store problem.  We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices.  One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.

# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7  # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7

# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8  # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1, 


Krste


Roger Espasa
 

I really like the fact that this solves indexed.
It's a pretty good proposal overall, and I think it beats adding "quad" extension and then a debate over "quad-with-add", "quad-with-sub", "quad-with-<insert your favorite>". 

roger

On Fri, Apr 17, 2020 at 9:03 AM Bill Huffman <huffman@...> wrote:

I wonder if we might encode EEW/SEW in the instruction instead of encoding EEW.

We might encode SEW/4, SEW/2, SEW, and 2*SEW as the EEW widths. 

Another alternative would be to have an additional state which gave us a choice of encoding for EEW:

  • SEW/8, SEW/4, SEW/2, SEW;
  • SEW/4, SEW/2, SEW, 2*SEW;
  • SEW/2, SEW, 2*SEW, 4*SEW; or
  • SEW, 2*SEW, 4*SEW, 8*SEW

This would allow working with either narrower or wider variations.

We have generally had SEW match the narrowest of the active widths in widening or narrowing instructions.  Does that suggest that we would do well to have EEW usually be wider than SEW?  And that having only one set would desire the fourth bullet above?

      Bill

On 4/16/20 11:22 PM, Bill Huffman wrote:

Agree.  It seems quite nice.

     Bill

On 4/16/20 8:30 PM, Bruce Hoult wrote:
EXTERNAL MAIL

Nice.


On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic <krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores.  One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required.  We've discussed adding more widening instructions to help
with the latter.  I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.

This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension.  However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.

Using the width field, EEW is encoded as one of {8,16,32,SEW}.  This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.

Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.

The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.

This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.

So for example for unit-stride, we can reduce the original code:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vsetvli x0, x0, e32,m8
   vle.v v16, (a3)
   vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vsetvli x0, x0, e32,m8
   vse.v v16, (a3)         # EEW=32b, no vsetvli needed

down to:

vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vle32.v v16, (a3)       # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vse32.v v16, (a3)       # EEW=32b, no vsetvli needed

removing three vsetvli instructions from inner loop.

Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.

I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.

I think this also solves our indexed load/store problem.  We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices.  One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.

# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7  # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7

# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8  # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1, 


Krste


David Horner
 

I believe

On Fri, Apr 17, 2020, 03:03 Bill Huffman, <huffman@...> wrote:

I wonder if we might encode EEW/SEW in the instruction instead of encoding EEW.

We might encode SEW/4, SEW/2, SEW, and 2*SEW as the EEW widths. 

I believe these widths are the appropriate ones. See the explanation above.
The rationale being widening (and narrowing) instructions are already SEW based, so are single SEW instructions and therefore expanding narrowed data encodings to SEW is the common case, 1/2 and 1/4 the most common. 1/8 can often be supported by using vwop.wv operations in conjunction with SEW/4 load and the by 4 convert Sew/4 to SEW, with store 2*SEW.

Another alternative would be to have an additional state which gave us a choice of encoding for EEW:

  • SEW/8, SEW/4, SEW/2, SEW;
  • SEW/4, SEW/2, SEW, 2*SEW;
  • SEW/2, SEW, 2*SEW, 4*SEW; or
  • SEW, 2*SEW, 4*SEW, 8*SEW

This would allow working with either narrower or wider variations.

We have generally had SEW match the narrowest of the active widths in widening or narrowing instructions.  Does that suggest that we would do well to have EEW usually be wider than SEW?  And that having only one set would desire the fourth bullet above?

      Bill

On 4/16/20 11:22 PM, Bill Huffman wrote:

Agree.  It seems quite nice.

     Bill

On 4/16/20 8:30 PM, Bruce Hoult wrote:
EXTERNAL MAIL

Nice.


On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic <krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores.  One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required.  We've discussed adding more widening instructions to help
with the latter.  I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.

This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension.  However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.

Using the width field, EEW is encoded as one of {8,16,32,SEW}.  This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.

Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.

The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.

This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.

So for example for unit-stride, we can reduce the original code:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vsetvli x0, x0, e32,m8
   vle.v v16, (a3)
   vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vsetvli x0, x0, e32,m8
   vse.v v16, (a3)         # EEW=32b, no vsetvli needed

down to:

vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vle32.v v16, (a3)       # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vse32.v v16, (a3)       # EEW=32b, no vsetvli needed

removing three vsetvli instructions from inner loop.

Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.

I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.

I think this also solves our indexed load/store problem.  We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices.  One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.

# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7  # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7

# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8  # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1, 


Krste


Thang Tran
 

I sent on the wrong email list, so I resend hre.

 

The cost of narrow to wide is much easier to handle with vlb than with vector arithmetic instructions.

In Bob’s example, it is one instruction to load data and expand to 32.

Example:

1.       LMUL=8, SEW=32, VLEN=128, then loading 32 bytes à 32 elements of 32b. In implementation, it is still 8 vector registers to load data

2.       With vlb8, data is loaded into 2 vector registers + vw8to32 which is another 8 vector registers operations

 

I see 2 issues:

-          For LMUL=8, the number vector registers are very limited, there is no 2 extra registers to use without seriously affect the performance with all the renaming.

-          Two extra writes to vector registers will kill the performance

 

Thanks, Thang

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Bill Huffman
Sent: Thursday, April 16, 2020 11:22 PM
To: Bruce Hoult <bruce@...>; Krste Asanovic <krste@...>
Cc: tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Effective element width encoding in vector load/stores

 

Agree.  It seems quite nice.

     Bill

On 4/16/20 8:30 PM, Bruce Hoult wrote:

EXTERNAL MAIL

Nice.

 

 

On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic <krste@...> wrote:

There are two separate issues noted with the proposal to fixed-size
vector load/stores.  One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required.  We've discussed adding more widening instructions to help
with the latter.  I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.

This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension.  However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.

Using the width field, EEW is encoded as one of {8,16,32,SEW}.  This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.

Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.

The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.

This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.

So for example for unit-stride, we can reduce the original code:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vsetvli x0, x0, e32,m8
   vle.v v16, (a3)
   vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vsetvli x0, x0, e32,m8
   vse.v v16, (a3)         # EEW=32b, no vsetvli needed

down to:

vsetvli t0, a0, e16,m4  # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
   vle32.v v16, (a3)       # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of result is 2*SEW,2*LMUL
   vse32.v v16, (a3)       # EEW=32b, no vsetvli needed

removing three vsetvli instructions from inner loop.

Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.

I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.

I think this also solves our indexed load/store problem.  We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices.  One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.

# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7  # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7

# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8  # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1, 


Krste