I sent on the wrong email list, so I resend hre.
The cost of narrow to wide is much easier to handle with vlb than with vector arithmetic instructions.
In Bob’s example, it is one instruction to load data and expand to 32.
Example:
1.
LMUL=8, SEW=32, VLEN=128, then loading 32 bytes
à 32 elements of 32b. In implementation, it is still 8 vector registers to load data
2.
With vlb8, data is loaded into 2 vector registers + vw8to32 which is another 8 vector registers operations
I see 2 issues:
-
For LMUL=8, the number vector registers are very limited, there is no 2 extra registers to use without seriously affect the performance with all the
renaming.
-
Two extra writes to vector registers will kill the performance
Thanks, Thang
toggle quoted message
Show quoted text
From: tech-vector-ext@... [mailto:tech-vector-ext@...]
On Behalf Of Bill Huffman
Sent: Thursday, April 16, 2020 11:22 PM
To: Bruce Hoult <bruce@...>; Krste Asanovic <krste@...>
Cc: tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Effective element width encoding in vector load/stores
Agree. It seems quite nice.
Bill
On 4/16/20 8:30 PM, Bruce Hoult wrote:
On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic <krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores. One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required. We've discussed adding more widening instructions to help
with the latter. I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.
This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension. However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.
Using the width field, EEW is encoded as one of {8,16,32,SEW}. This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.
Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.
The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.
This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.
So for example for unit-stride, we can reduce the original code:
# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vsetvli x0, x0, e32,m8
vle.v v16, (a3)
vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vsetvli x0, x0, e32,m8
vse.v v16, (a3) # EEW=32b, no vsetvli needed
down to:
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vse32.v v16, (a3) # EEW=32b, no vsetvli needed
removing three vsetvli instructions from inner loop.
Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.
I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.
I think this also solves our indexed load/store problem. We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices. One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.
# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7 # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7
# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8 # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1,
Krste
|
|
I believe On Fri, Apr 17, 2020, 03:03 Bill Huffman, < huffman@...> wrote:
I wonder if we might encode EEW/SEW in the instruction instead of encoding EEW.
We might encode SEW/4, SEW/2, SEW, and 2*SEW as the EEW widths.
I believe these widths are the appropriate ones. See the explanation above. The rationale being widening (and narrowing) instructions are already SEW based, so are single SEW instructions and therefore expanding narrowed data encodings to SEW is the common case, 1/2 and 1/4 the most common. 1/8 can often be supported by using vwop.wv operations in conjunction with SEW/4 load and the by 4 convert Sew/4 to SEW, with store 2*SEW.
Another alternative would be to have an additional state which gave us a choice of encoding for EEW:
- SEW/8, SEW/4, SEW/2, SEW;
- SEW/4, SEW/2, SEW, 2*SEW;
- SEW/2, SEW, 2*SEW, 4*SEW; or
- SEW,
2*SEW, 4*SEW, 8*SEW
This would allow working with either narrower or wider variations.
We have generally had SEW match the narrowest of the active widths in widening or narrowing instructions. Does that suggest that we would do well to have EEW usually be wider than SEW? And that having only one set would desire
the fourth bullet above?
Bill
On 4/16/20 11:22 PM, Bill Huffman wrote:
Agree. It seems quite nice.
Bill
On 4/16/20 8:30 PM, Bruce Hoult wrote:
EXTERNAL MAIL
On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic < krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores. One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required. We've discussed adding more widening instructions to help
with the latter. I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.
This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension. However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.
Using the width field, EEW is encoded as one of {8,16,32,SEW}. This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.
Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.
The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.
This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.
So for example for unit-stride, we can reduce the original code:
# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vsetvli x0, x0, e32,m8
vle.v v16, (a3)
vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vsetvli x0, x0, e32,m8
vse.v v16, (a3) # EEW=32b, no vsetvli needed
down to:
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vse32.v v16, (a3) # EEW=32b, no vsetvli needed
removing three vsetvli instructions from inner loop.
Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.
I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.
I think this also solves our indexed load/store problem. We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices. One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.
# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7 # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7
# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8 # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1,
Krste
|
|

Roger Espasa
I really like the fact that this solves indexed. It's a pretty good proposal overall, and I think it beats adding "quad" extension and then a debate over "quad-with-add", "quad-with-sub", "quad-with-<insert your favorite>".
roger
toggle quoted message
Show quoted text
On Fri, Apr 17, 2020 at 9:03 AM Bill Huffman < huffman@...> wrote:
I wonder if we might encode EEW/SEW in the instruction instead of encoding EEW.
We might encode SEW/4, SEW/2, SEW, and 2*SEW as the EEW widths.
Another alternative would be to have an additional state which gave us a choice of encoding for EEW:
- SEW/8, SEW/4, SEW/2, SEW;
- SEW/4, SEW/2, SEW, 2*SEW;
- SEW/2, SEW, 2*SEW, 4*SEW; or
- SEW,
2*SEW, 4*SEW, 8*SEW
This would allow working with either narrower or wider variations.
We have generally had SEW match the narrowest of the active widths in widening or narrowing instructions. Does that suggest that we would do well to have EEW usually be wider than SEW? And that having only one set would desire
the fourth bullet above?
Bill
On 4/16/20 11:22 PM, Bill Huffman wrote:
Agree. It seems quite nice.
Bill
On 4/16/20 8:30 PM, Bruce Hoult wrote:
EXTERNAL MAIL
On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic < krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores. One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required. We've discussed adding more widening instructions to help
with the latter. I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.
This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension. However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.
Using the width field, EEW is encoded as one of {8,16,32,SEW}. This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.
Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.
The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.
This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.
So for example for unit-stride, we can reduce the original code:
# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vsetvli x0, x0, e32,m8
vle.v v16, (a3)
vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vsetvli x0, x0, e32,m8
vse.v v16, (a3) # EEW=32b, no vsetvli needed
down to:
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vse32.v v16, (a3) # EEW=32b, no vsetvli needed
removing three vsetvli instructions from inner loop.
Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.
I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.
I think this also solves our indexed load/store problem. We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices. One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.
# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7 # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7
# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8 # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1,
Krste
|
|
I wonder if we might encode EEW/SEW in the instruction instead of encoding EEW.
We might encode SEW/4, SEW/2, SEW, and 2*SEW as the EEW widths.
Another alternative would be to have an additional state which gave us a choice of encoding for EEW:
- SEW/8, SEW/4, SEW/2, SEW;
- SEW/4, SEW/2, SEW, 2*SEW;
- SEW/2, SEW, 2*SEW, 4*SEW; or
- SEW,
2*SEW, 4*SEW, 8*SEW
This would allow working with either narrower or wider variations.
We have generally had SEW match the narrowest of the active widths in widening or narrowing instructions. Does that suggest that we would do well to have EEW usually be wider than SEW? And that having only one set would desire
the fourth bullet above?
Bill
On 4/16/20 11:22 PM, Bill Huffman wrote:
toggle quoted message
Show quoted text
Agree. It seems quite nice.
Bill
On 4/16/20 8:30 PM, Bruce Hoult wrote:
EXTERNAL MAIL
On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic < krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores. One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required. We've discussed adding more widening instructions to help
with the latter. I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.
This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension. However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.
Using the width field, EEW is encoded as one of {8,16,32,SEW}. This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.
Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.
The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.
This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.
So for example for unit-stride, we can reduce the original code:
# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vsetvli x0, x0, e32,m8
vle.v v16, (a3)
vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vsetvli x0, x0, e32,m8
vse.v v16, (a3) # EEW=32b, no vsetvli needed
down to:
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vse32.v v16, (a3) # EEW=32b, no vsetvli needed
removing three vsetvli instructions from inner loop.
Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.
I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.
I think this also solves our indexed load/store problem. We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices. One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.
# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7 # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7
# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8 # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1,
Krste
|
|
Agree. It seems quite nice.
Bill
On 4/16/20 8:30 PM, Bruce Hoult wrote:
toggle quoted message
Show quoted text
EXTERNAL MAIL
On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic < krste@...> wrote:
There are two separate issues noted with the proposal to fixed-size
vector load/stores. One is the additional vsetvli instructions
needed, and the second is the additional widening instructions
required. We've discussed adding more widening instructions to help
with the latter. I have a proposal below to help with the former in a
way that improves FP also, and which also provides a solution to the
indexed vector index size wart we've had for a while.
This proposal still only supports packed load/stores, as opposed to
unpacked load/stores with sign/zero extension. However, the
problematic instruction overhead of many additional vsetvli
instructions when simply removing fixed-size load/stores is avoided by
repurposing the width field to encode the "effective" element width
(EEW) for the current vector load/store instruction.
Using the width field, EEW is encoded as one of {8,16,32,SEW}. This
now determines *both* the register element size and the memory element
size, where previously it only set the memory element size and
sign/zero extended this into the SEW-width register element.
Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to
keep SEW/LMUL constant. If this results in a bad LMUL value, an
illegal instruction exception is raised.
The effective EEW/ELMUL setting is only in effect for the single
instruction and does not change values in the vtype CSR.
This doesn’t add any real hardware complexity, as the underlying micro
architectural operations were all supported anyway, it just
streamlines loading/storing of different element widths with the same
SEW/LMUL ratio in the same loop. Our widening/narrowing operations
already sort of do this already.
So for example for unit-stride, we can reduce the original code:
# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vsetvli x0, x0, e32,m8
vle.v v16, (a3)
vsetvli x0, x0, e16,m4
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vsetvli x0, x0, e32,m8
vse.v v16, (a3) # EEW=32b, no vsetvli needed
down to:
vsetvli t0, a0, e16,m4 # vtype SEW=16b
vle.v v4, (a1)
vle.v v8, (a2)
vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL
vse32.v v16, (a3) # EEW=32b, no vsetvli needed
removing three vsetvli instructions from inner loop.
Note this approach also helps floating-point code, whereas
byte/halfword/word load/stores do not.
I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.
I think this also solves our indexed load/store problem. We use
vtype.SEW to encode the data elements, but use the width-field-encoded
EEW for indices. One wrinkle is that the largest EEW encoding
now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.
# Load 32b values using 8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7 # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7
# Load 8b values using 64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8 # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1,
Krste
|
|
On 2020-04-16 11:02 p.m., Krste
Asanovic wrote:
There are two separate
issues noted with the proposal to fixed-size
vector load/stores. One
is the additional vsetvli instructions
needed, and the second
is the additional widening instructions
required. We've
discussed adding more widening instructions to help
with the latter. I have
a proposal below to help with the former in a
way that improves FP
also, and which also provides a solution to the
indexed vector index
size wart we've had for a while.
This proposal still only
supports packed load/stores, as opposed to
unpacked load/stores
with sign/zero extension. However, the
problematic instruction
overhead of many additional vsetvli
instructions when simply
removing fixed-size load/stores is avoided by
repurposing the width
field to encode the "effective" element width
(EEW) for the current
vector load/store instruction.
Using the width field,
EEW is encoded as one of {8,16,32,SEW}. This
now determines *both*
the register element size and the memory element
size, where previously
it only set the memory element size and
sign/zero extended this
into the SEW-width register element.
What of SEW scaling factor instead. 1/4,1/2,1 and 2. This allows
a much expanded dynamic range and addresses most scaling concerns.
It allows of 2 * SEW for vwop.wv source load, and store of all
widened results.
And it allows source load for 4 * widening and 2 * widening to
current SEW and even 8 * widening to 2 * SEW which as noted above
can be the source and destination for the widening instructions.
Effective LMUL (ELMUL)
is now calculated as ELMUL = (EEW/SEW)*LMUL to
with SEW scaling this becomes ELMUL = EEW*LMUL
keep SEW/LMUL constant.
If this results in a bad LMUL value, an
illegal instruction
exception is raised.
The effective EEW/ELMUL
setting is only in effect for the single
instruction and does not
change values in the vtype CSR.
yes.
Note this approach also
helps floating-point code, whereas
byte/halfword/word
load/stores do not.
yes.
I'm using vle32 syntax
to mirror the assembler syntax for vsetvli e32 etc.
for SEW scalingI don't have any solid nomenclature suggestions,
but it could parallel lmul , lf4, lf2, l1, l2 (like I said no good
ideas)
I think this also solves
our indexed load/store problem. We use
vtype.SEW to encode the
data elements, but use the width-field-encoded
EEW for indices. One
wrinkle is that the largest EEW encoding
now indicates 64b not
SEW, i.e., index EEW is {8,16,32,64}.
I don't believe removing SEW from index is problematic for
indexed load/stores.
The program will in almost all cases know the precision of its
offsets.
Indeed, it is arguable that dynamic SEW has any practical
application.
Rather, I see the wrinkle as Indexed load/stores do not support
the scaled element mode present in the others.
Given the field has been repurposed to index only, then it is
even less a problematic wrinkle that SEW is dropped.
# Load 32b values using
8b offsets:
vsetvli t0, a0, e32,m4
vlx8.v v8, (a1), v7 #
Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7
# Load 8b values using
64b offsets:
vsetvli t0, a0, e8,fl2
vlx64.v v1, (a1), v8 #
EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1,
Krste
|
|

Bruce Hoult
toggle quoted message
Show quoted text
On Fri, Apr 17, 2020 at 3:03 PM Krste Asanovic < krste@...> wrote: There are two separate issues noted with the proposal to fixed-size vector load/stores. One is the additional vsetvli instructions needed, and the second is the additional widening instructions required. We've discussed adding more widening instructions to help with the latter. I have a proposal below to help with the former in a way that improves FP also, and which also provides a solution to the indexed vector index size wart we've had for a while.
This proposal still only supports packed load/stores, as opposed to unpacked load/stores with sign/zero extension. However, the problematic instruction overhead of many additional vsetvli instructions when simply removing fixed-size load/stores is avoided by repurposing the width field to encode the "effective" element width (EEW) for the current vector load/store instruction.
Using the width field, EEW is encoded as one of {8,16,32,SEW}. This now determines *both* the register element size and the memory element size, where previously it only set the memory element size and sign/zero extended this into the SEW-width register element.
Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to keep SEW/LMUL constant. If this results in a bad LMUL value, an illegal instruction exception is raised.
The effective EEW/ELMUL setting is only in effect for the single instruction and does not change values in the vtype CSR.
This doesn’t add any real hardware complexity, as the underlying micro architectural operations were all supported anyway, it just streamlines loading/storing of different element widths with the same SEW/LMUL ratio in the same loop. Our widening/narrowing operations already sort of do this already.
So for example for unit-stride, we can reduce the original code:
# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32 vsetvli t0, a0, e16,m4 # vtype SEW=16b vle.v v4, (a1) vle.v v8, (a2) vsetvli x0, x0, e32,m8 vle.v v16, (a3) vsetvli x0, x0, e16,m4 vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL vsetvli x0, x0, e32,m8 vse.v v16, (a3) # EEW=32b, no vsetvli needed
down to:
vsetvli t0, a0, e16,m4 # vtype SEW=16b vle.v v4, (a1) vle.v v8, (a2) vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed. vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL vse32.v v16, (a3) # EEW=32b, no vsetvli needed
removing three vsetvli instructions from inner loop.
Note this approach also helps floating-point code, whereas byte/halfword/word load/stores do not.
I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.
I think this also solves our indexed load/store problem. We use vtype.SEW to encode the data elements, but use the width-field-encoded EEW for indices. One wrinkle is that the largest EEW encoding now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.
# Load 32b values using 8b offsets: vsetvli t0, a0, e32,m4 vlx8.v v8, (a1), v7 # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7
# Load 8b values using 64b offsets: vsetvli t0, a0, e8,fl2 vlx64.v v1, (a1), v8 # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1,
Krste
|
|

Krste Asanovic
There are two separate issues noted with the proposal to fixed-size vector load/stores. One is the additional vsetvli instructions needed, and the second is the additional widening instructions required. We've discussed adding more widening instructions to help with the latter. I have a proposal below to help with the former in a way that improves FP also, and which also provides a solution to the indexed vector index size wart we've had for a while.
This proposal still only supports packed load/stores, as opposed to unpacked load/stores with sign/zero extension. However, the problematic instruction overhead of many additional vsetvli instructions when simply removing fixed-size load/stores is avoided by repurposing the width field to encode the "effective" element width (EEW) for the current vector load/store instruction.
Using the width field, EEW is encoded as one of {8,16,32,SEW}. This now determines *both* the register element size and the memory element size, where previously it only set the memory element size and sign/zero extended this into the SEW-width register element.
Effective LMUL (ELMUL) is now calculated as ELMUL = (EEW/SEW)*LMUL to keep SEW/LMUL constant. If this results in a bad LMUL value, an illegal instruction exception is raised.
The effective EEW/ELMUL setting is only in effect for the single instruction and does not change values in the vtype CSR.
This doesn’t add any real hardware complexity, as the underlying micro architectural operations were all supported anyway, it just streamlines loading/storing of different element widths with the same SEW/LMUL ratio in the same loop. Our widening/narrowing operations already sort of do this already.
So for example for unit-stride, we can reduce the original code:
# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32 vsetvli t0, a0, e16,m4 # vtype SEW=16b vle.v v4, (a1) vle.v v8, (a2) vsetvli x0, x0, e32,m8 vle.v v16, (a3) vsetvli x0, x0, e16,m4 vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL vsetvli x0, x0, e32,m8 vse.v v16, (a3) # EEW=32b, no vsetvli needed
down to:
vsetvli t0, a0, e16,m4 # vtype SEW=16b vle.v v4, (a1) vle.v v8, (a2) vle32.v v16, (a3) # Get previous 32b values, no vsetvli needed. vfwmacc.vv v16, v4, v8 # EEW,ELMUL of result is 2*SEW,2*LMUL vse32.v v16, (a3) # EEW=32b, no vsetvli needed
removing three vsetvli instructions from inner loop.
Note this approach also helps floating-point code, whereas byte/halfword/word load/stores do not.
I'm using vle32 syntax to mirror the assembler syntax for vsetvli e32 etc.
I think this also solves our indexed load/store problem. We use vtype.SEW to encode the data elements, but use the width-field-encoded EEW for indices. One wrinkle is that the largest EEW encoding now indicates 64b not SEW, i.e., index EEW is {8,16,32,64}.
# Load 32b values using 8b offsets: vsetvli t0, a0, e32,m4 vlx8.v v8, (a1), v7 # Load 32b values into v8-11, using EEW=8,ELMUL=1 vector in v7
# Load 8b values using 64b offsets: vsetvli t0, a0, e8,fl2 vlx64.v v1, (a1), v8 # EEW=64,ELMUL=4 indices in v8-v11, SEW=8,LMUL=1/2 in v1,
Krste
|
|