Vector TG meeting minutes 2020/4/03


Krste Asanovic
 

Date: 2020/4/03
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~15
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed: #354/362

The following issues were discussed.

Closing on version v0.9. A list of proposed changes to form v0.9 were
presented. The main dispute was around dropping byte/halfword/word
vector load/stores.

#354/362 Drop byte/halfword/word vector load/stores

Most of the meeting time was spent discussing this issue, which was
contentious.

Participants in favor of retaining these instructions were concerned
about the code size and performance impact of dropping them.
Proponents in favor of dropping them noted that the main impact was
only for integer code (floating-point code does not benefit from these
instructions), that performance might be lower using these
instructions rather than widening, and that there was a large benefit
in reducing memory pipeline complexity. The group was going to
consider some examples to be supplied by the members, including some
mixed floating-point/integer code.

Discussion to continue on mailing list.


Nick Knight
 

On Sat, Apr 4, 2020 at 1:43 PM Krste Asanovic <krste@...> wrote:
#354/362 Drop byte/halfword/word vector load/stores

[...]

Participants in favor of retaining these instructions were concerned
about the code size and performance impact of dropping them.

On GitHub, this is Issue #362.

While I'm generally in favor of dropping them, I am aware it will pose a challenge to several applications if, additionally, indexed loads and stores switch to XLEN-width indices (see Issue #306, Issue #381, and PR #401 for background).

My particular concern is related to "index compression", a general software optimization to reduce storage and memory bandwidth costs.

For example, the digit-reversal permutations in Cooley-Tukey fast Fourier transforms are typically index-compressed. Without widening loads, we'll need to manually widen the permutation indices out to width XLEN before using them in a gather or scatter.

An analogous example appears in sparse matrix codes. However, as Nagendra Gulur commented (in email to this list on 10 March), this example is more complex since the indices typically also need to be left-shifted by sizeof(matrix_element_t) to obtain byte offsets. Thus, we'll need a combination of widening multiplies and widening adds. (Note that there's currently no widening left shift.)

A third example, concerning only the XLEN-width indices in indexed loads/stores so less relevant here, was mentioned here on GitHub.

Best,
Nick Knight


Thang Tran
 

There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small critical loop where every additional instruction causes > 10% impact on performance.

I am strongly against dropping the byte/halfword/word for load/store.

Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Saturday, April 4, 2020 1:43 PM
To: tech-vector-ext@...
Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03


Date: 2020/4/03
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~15
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed: #354/362

The following issues were discussed.

Closing on version v0.9. A list of proposed changes to form v0.9 were presented. The main dispute was around dropping byte/halfword/word vector load/stores.

#354/362 Drop byte/halfword/word vector load/stores

Most of the meeting time was spent discussing this issue, which was contentious.

Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them.
Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity. The group was going to consider some examples to be supplied by the members, including some mixed floating-point/integer code.

Discussion to continue on mailing list.


Nick Knight
 

Hi Thang,

Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email.

In my opinion, concrete examples are crucial to making an informed decision. I hope you agree.

Best,
Nick Knight


On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote:
There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small critical loop where every additional instruction causes > 10% impact on performance.

I am strongly against dropping the byte/halfword/word for load/store.

Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Saturday, April 4, 2020 1:43 PM
To: tech-vector-ext@...
Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03


Date: 2020/4/03
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~15
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed: #354/362

The following issues were discussed.

Closing on version v0.9. A list of proposed changes to form v0.9 were presented.  The main dispute was around dropping byte/halfword/word vector load/stores.

#354/362 Drop byte/halfword/word vector load/stores

Most of the meeting time was spent discussing this issue, which was contentious.

Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them.
Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity.  The group was going to consider some examples to be supplied by the members, including some mixed floating-point/integer code.

Discussion to continue on mailing list.








Thang Tran
 

Hi Nick,

 

It is confidential customer application code.

 

Thanks, Thang

 

From: Nick Knight [mailto:nick.knight@...]
Sent: Saturday, April 4, 2020 5:04 PM
To: Thang Tran <thang@...>
Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03

 

Hi Thang,

 

Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email.

 

In my opinion, concrete examples are crucial to making an informed decision. I hope you agree.

 

Best,

Nick Knight

 

On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote:

There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small critical loop where every additional instruction causes > 10% impact on performance.

I am strongly against dropping the byte/halfword/word for load/store.

Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Saturday, April 4, 2020 1:43 PM
To: tech-vector-ext@...
Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03


Date: 2020/4/03
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~15
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed: #354/362

The following issues were discussed.

Closing on version v0.9. A list of proposed changes to form v0.9 were presented.  The main dispute was around dropping byte/halfword/word vector load/stores.

#354/362 Drop byte/halfword/word vector load/stores

Most of the meeting time was spent discussing this issue, which was contentious.

Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them.
Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity.  The group was going to consider some examples to be supplied by the members, including some mixed floating-point/integer code.

Discussion to continue on mailing list.






Alex Solomatnikov
 


Bob Dreyer said he would share an example code.

Do you really have a 2x or 4x wider write port to the vector register file to make vlb and the like work at full memory bandwidth?

If yes, what is the impact on PPA, i.e. clock frequency, area, power?

If not, then extra widening instruction would not matter because vlb itself is the bottleneck.

Alex

On Sat, Apr 4, 2020 at 5:19 PM Thang Tran <thang@...> wrote:

Hi Nick,

 

It is confidential customer application code.

 

Thanks, Thang

 

From: Nick Knight [mailto:nick.knight@...]
Sent: Saturday, April 4, 2020 5:04 PM
To: Thang Tran <thang@...>
Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03

 

Hi Thang,

 

Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email.

 

In my opinion, concrete examples are crucial to making an informed decision. I hope you agree.

 

Best,

Nick Knight

 

On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote:

There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small critical loop where every additional instruction causes > 10% impact on performance.

I am strongly against dropping the byte/halfword/word for load/store.

Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Saturday, April 4, 2020 1:43 PM
To: tech-vector-ext@...
Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03


Date: 2020/4/03
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~15
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed: #354/362

The following issues were discussed.

Closing on version v0.9. A list of proposed changes to form v0.9 were presented.  The main dispute was around dropping byte/halfword/word vector load/stores.

#354/362 Drop byte/halfword/word vector load/stores

Most of the meeting time was spent discussing this issue, which was contentious.

Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them.
Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity.  The group was going to consider some examples to be supplied by the members, including some mixed floating-point/integer code.

Discussion to continue on mailing list.






Thang Tran
 

In scalar code, there is always signed/zero extension for the data and alignment. I do not see a different with vector load/store. If alignment is needed, not much additional cost for signed/zero extension, and an extra pipeline stage is added.

 

Depended on how the load is pipelined, the load-to-use penalty may be none. So, widening is much preferred in our design.

 

Thanks, Thang

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Alex Solomatnikov
Sent: Saturday, April 4, 2020 7:09 PM
To: Thang Tran <thang@...>
Cc: Nick Knight <nick.knight@...>; Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03

 

 

Bob Dreyer said he would share an example code.

 

Do you really have a 2x or 4x wider write port to the vector register file to make vlb and the like work at full memory bandwidth?

 

If yes, what is the impact on PPA, i.e. clock frequency, area, power?

 

If not, then extra widening instruction would not matter because vlb itself is the bottleneck.

 

Alex

 

On Sat, Apr 4, 2020 at 5:19 PM Thang Tran <thang@...> wrote:

Hi Nick,

 

It is confidential customer application code.

 

Thanks, Thang

 

From: Nick Knight [mailto:nick.knight@...]
Sent: Saturday, April 4, 2020 5:04 PM
To: Thang Tran <thang@...>
Cc: Krste Asanovic <krste@...>; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03

 

Hi Thang,

 

Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email.

 

In my opinion, concrete examples are crucial to making an informed decision. I hope you agree.

 

Best,

Nick Knight

 

On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote:

There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small critical loop where every additional instruction causes > 10% impact on performance.

I am strongly against dropping the byte/halfword/word for load/store.

Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Saturday, April 4, 2020 1:43 PM
To: tech-vector-ext@...
Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03


Date: 2020/4/03
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~15
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed: #354/362

The following issues were discussed.

Closing on version v0.9. A list of proposed changes to form v0.9 were presented.  The main dispute was around dropping byte/halfword/word vector load/stores.

#354/362 Drop byte/halfword/word vector load/stores

Most of the meeting time was spent discussing this issue, which was contentious.

Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them.
Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity.  The group was going to consider some examples to be supplied by the members, including some mixed floating-point/integer code.

Discussion to continue on mailing list.





David Horner
 

I agree Nick.

So here is a suggestion, not completely facetiously:


For load byte/half/word

example when SEW = 64

An implementation can optimize the sequence

strided load by 1/2/4

shift left 56/48/32

arith right 56/48/32


but a sign extend byte/half/word to SEW would make fusing/chaining simpler.

And these without widening.


For stores:

a “pack” SEW (of byte/half/word) instruction by SLEN into appropriate LMUL=1/8, 1/4 or 1/2 would allow standard unit strided store to work.


A fractional LMUL that uses interleave (rather than right justified SLEN chunks) would not need this pack instruction.


On 2020-04-04 8:04 p.m., Nick Knight wrote:

Hi Thang,

Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email.

In my opinion, concrete examples are crucial to making an informed decision. I hope you agree.

Best,
Nick Knight

On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote:
There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small critical loop where every additional instruction causes > 10% impact on performance.

I am strongly against dropping the byte/halfword/word for load/store.

Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Saturday, April 4, 2020 1:43 PM
To: tech-vector-ext@...
Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03


Date: 2020/4/03
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~15
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed: #354/362

The following issues were discussed.

Closing on version v0.9. A list of proposed changes to form v0.9 were presented.  The main dispute was around dropping byte/halfword/word vector load/stores.

#354/362 Drop byte/halfword/word vector load/stores

Most of the meeting time was spent discussing this issue, which was contentious.

Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them.
Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity.  The group was going to consider some examples to be supplied by the members, including some mixed floating-point/integer code.

Discussion to continue on mailing list.








Krste Asanovic
 

These are basic operations, not application kernels.

It's easy to call out missing instructions when considering individual
operations.

It's more important to gather and evaluate actual application kernels.

Krste

On Sat, 4 Apr 2020 23:25:13 -0400, "David Horner" <ds2horner@...> said:
| I agree Nick.
| So here is a suggestion, not completely facetiously:

| For load byte/half/word

| example when SEW = 64

| An implementation can optimize the sequence

| strided load by 1/2/4

| shift left 56/48/32

| arith right 56/48/32

| but a sign extend byte/half/word to SEW would make fusing/chaining simpler.

| And these without widening.

| For stores:

| a “pack” SEW (of byte/half/word) instruction by SLEN into appropriate LMUL=1/8, 1/4 or 1/2 would allow standard unit strided store to work.

| A fractional LMUL that uses interleave (rather than right justified SLEN chunks) would not need this pack instruction.

| On 2020-04-04 8:04 p.m., Nick Knight wrote:

| Hi Thang,

| Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email.

| In my opinion, concrete examples are crucial to making an informed decision. I hope you agree.

| Best,
| Nick Knight

| On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote:

| There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small
| critical loop where every additional instruction causes > 10% impact on performance.

| I am strongly against dropping the byte/halfword/word for load/store.

| Thanks, Thang

| -----Original Message-----
| From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
| Sent: Saturday, April 4, 2020 1:43 PM
| To: tech-vector-ext@...
| Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03

| Date: 2020/4/03
| Task Group: Vector Extension
| Chair: Krste Asanovic
| Number of Attendees: ~15
| Current issues on github: https://github.com/riscv/riscv-v-spec

| Issues discussed: #354/362

| The following issues were discussed.

| Closing on version v0.9. A list of proposed changes to form v0.9 were presented.  The main dispute was around dropping byte/halfword/word vector load/stores.

| #354/362 Drop byte/halfword/word vector load/stores

| Most of the meeting time was spent discussing this issue, which was contentious.

| Participants in favor of retaining these instructions were concerned about the code size and performance impact of dropping them.
| Proponents in favor of dropping them noted that the main impact was only for integer code (floating-point code does not benefit from these instructions), that performance might be lower using these
| instructions rather than widening, and that there was a large benefit in reducing memory pipeline complexity.  The group was going to consider some examples to be supplied by the members, including some mixed
| floating-point/integer code.

| Discussion to continue on mailing list.

|


David Horner
 

I agree, It's more important to gather and evaluate actual application kernels.
Is there such an effort on-going?

I further agree to the implicit idea that much, even most, of the processing in any given kernel can occur in fractional and the lower LMUL>=1 modes.
Fractional LMUL in these cases is mostly "set up" for the inevitable but usually deferred widen performed only as needed and no earlier.
RISCV tuned kernels can incorporate these efficiencies.

However, there will be a lag before such kernels are developed and widely used.
Further, existing code ported to RISCV cannot be expected to be optimal in this way.

Consider Coremark with RV64I, not the only program that uses poor coding practices.

We can expect other programs with such biases to be used to challenge RVV.
Not only that , but there is much other code in the wild that is adversely affected by not having an efficient load to double from byte, half or word.

Especially bit manipulation logic.

Consider memory structure of a word (to load) accompanied by a byte decode, sign and scale factor.

    vsetvli t01,t02,e64

    vlwu.v   v4,(xarray)
    vlb.v  v5,(xarryscale)
    vxor.vv v4,v4,v5          /* apply sign bit and "decode" shift bits
    vsll.vv v4,v4,v5           /* scale by lower 6 bits.
                                    /* (one bit is unused,  1 bit shifts for each of v5 and v4 could insert it, but you have the idea.

Granted, the program could use a different memory layout, with 8 way interleaved shift-decode bytes,  that are loaded with a byte offset and arith right shifted by 56, processed in sets of 8.
But that is a fundamental change to the memory layout which may also be committed to disk and other processes or archives.

Without an efficient mechanism to load to double from word and byte this type of operation (general bit manipulation, scaling, etc.) is substantially hampered.

The vlb.v could be replaced with the strided vlse.v  and shift left and arithmetic shift right as previously described (and quoted by Krste below).

And the vlwu.v by a LMUL=1/2 word load followed by a widening vwaddu.wv into a previously zeroed double.

Neither of these is efficient.

That is why I proposed #413.

(see #411 for elucidating clustering in fractional LMUL) 

https://github.com/riscv/riscv-v-spec/issues/413
https://github.com/riscv/riscv-v-spec/issues/411
CLSTR and clstr: width specifiers for data cluster in each SLEN chunk. (when LMUL<=1/2) 

Expand only when needed, compensate for the implementation specific clustering.

The above becomes:

    vsetvli t01,t02,e32,lf2 /* word in LMUL=1/2
    vle.v   v4,(xarray)
    dclstru,v v4,v4             /* unsigned extend to double (unclustered)

    vsetvli t01,t02,e8,lf8   /* byte in LMUL=1/8
    vle.v  v5,(xarryscale)
    dclstr,v v5,v5              /* sign extend to double (unclustered)

    vsetvli t01,t02,e64
    vxor.vv v4,v4,v5         /* apply sign bit and "decode" shift bits
    vsll.vv v4,v4,v5           /* scale by lower 6 bits.

For convenience I post #413 here:

cluster/decluster instructions: with LMUL<1 loads/stores provide byte/half/word support.

Two new vector-vector unary unmasked instructions , vdclstr and vclstr undo / apply the clustering specified in clstr. ***

Using the given SEW width as cluster element width and LMUL<1 as the expansion factor:

For each SLEN chunk in vs2, vclstr instruction selects SEW width field from each SEW/LMUL element concatenates the SEW elements into a LMUL<1 clusters and stores the result into vd.

vdclstr does the reverse; changes vs2 from LMUL clustering into SEW/LMUL width interleaved elements (effectively CLSTR=1) and stores result into vd.

This operation can be performed in place with vd = vs2 as each operation is on a SLEN chunk.
When CLSTR=1 then the elements can be operated upon one at a time.

See #411 for the specifics of cluster fields and gaps.

For vdclstr, the available options for gap fill are undisturbed, agnostic, zero fill and sign extend.
For vclstr the options are undisturbed, agnostic and zero fill. ****

I am proposing this as unmasked only. vm=0 is reserved. There are potential difficulties with disparate mask structure between clustered and interleaved when effective CLSTR > 1.

An obvious use of vdclstr instruction is coupled with a load to emulate byte/half/word to SEW. It is low overhead, can be chained/fused.

When vd = vs2, clstr is zero (CLSTR=1) and fill is undisturbed:
vclstr and vdclstr are nops.

The sequence vclstr and vdclstr with vd = vs2 can be fused to provide byte, half and word sign- or zero-extend up to double word SEW. If not constrained by the strawman model, zero- or sign-extension of SEW to 2 * SEW, 4 * SEW or 8 * SEW are possible.

There are numerous other uses especially if SEW is not constrained by the strawman model.
e.g. a 1 splat followed by a vdclstr will Nan-box float into double float.

*** The current vector-vector unary class only supports float. It is possible that these could live in that encoding, but likely would be allocated in their own unary group. I will model the encoding after VFUNARY1.

**** I don’t see much value in fill1s and less for sign-extend: which element’s sign to extend? and it would lend itself to specific CLSTR idiosyncratic coding even more so than zero fill does.

Also relates to #362


On 2020-04-12 6:28 a.m., krste@... wrote:

These are basic operations, not application kernels.

It's easy to call out missing instructions when considering individual
operations.

It's more important to gather and evaluate actual application kernels.

Krste

On Sat, 4 Apr 2020 23:25:13 -0400, "David Horner" <ds2horner@...> said:
| I agree Nick.
| So here is a suggestion, not completely facetiously:

| For load byte/half/word

| example when SEW = 64

| An implementation can optimize the sequence

| strided load by 1/2/4

| shift left 56/48/32

| arith right 56/48/32

| but a sign extend byte/half/word to SEW would make fusing/chaining simpler.

| And these without widening.

| For stores:

| a “pack” SEW (of byte/half/word) instruction by SLEN into appropriate LMUL=1/8, 1/4 or 1/2 would allow standard unit strided store to work.

| A fractional LMUL that uses interleave (rather than right justified SLEN chunks) would not need this pack instruction.

| On 2020-04-04 8:04 p.m., Nick Knight wrote:

|     Hi Thang,
   
|     Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email.
   
|     In my opinion, concrete examples are crucial to making an informed decision. I hope you agree.
   
|     Best,
|     Nick Knight
   
|     On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote:
   
|         There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small
|         critical loop where every additional instruction causes > 10% impact on performance.
       
|         I am strongly against dropping the byte/halfword/word for load/store.
       
|         Thanks, Thang
       
|         -----Original Message-----
|         From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
|         Sent: Saturday, April 4, 2020 1:43 PM
|         To: tech-vector-ext@...
|         Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03

|         Date: 2020/4/03
|         Task Group: Vector Extension
|         Chair: Krste Asanovic
|         Number of Attendees: ~15
|         Current issues on github: https://github.com/riscv/riscv-v-spec