Re: Vector TG meeting minutes 2020/4/03

David Horner

I agree, It's more important to gather and evaluate actual application kernels.
Is there such an effort on-going?

I further agree to the implicit idea that much, even most, of the processing in any given kernel can occur in fractional and the lower LMUL>=1 modes.
Fractional LMUL in these cases is mostly "set up" for the inevitable but usually deferred widen performed only as needed and no earlier.
RISCV tuned kernels can incorporate these efficiencies.

However, there will be a lag before such kernels are developed and widely used.
Further, existing code ported to RISCV cannot be expected to be optimal in this way.

Consider Coremark with RV64I, not the only program that uses poor coding practices.

We can expect other programs with such biases to be used to challenge RVV.
Not only that , but there is much other code in the wild that is adversely affected by not having an efficient load to double from byte, half or word.

Especially bit manipulation logic.

Consider memory structure of a word (to load) accompanied by a byte decode, sign and scale factor.

    vsetvli t01,t02,e64

    vlwu.v   v4,(xarray)
    vlb.v  v5,(xarryscale)
    vxor.vv v4,v4,v5          /* apply sign bit and "decode" shift bits
    vsll.vv v4,v4,v5           /* scale by lower 6 bits.
                                    /* (one bit is unused,  1 bit shifts for each of v5 and v4 could insert it, but you have the idea.

Granted, the program could use a different memory layout, with 8 way interleaved shift-decode bytes,  that are loaded with a byte offset and arith right shifted by 56, processed in sets of 8.
But that is a fundamental change to the memory layout which may also be committed to disk and other processes or archives.

Without an efficient mechanism to load to double from word and byte this type of operation (general bit manipulation, scaling, etc.) is substantially hampered.

The vlb.v could be replaced with the strided vlse.v  and shift left and arithmetic shift right as previously described (and quoted by Krste below).

And the vlwu.v by a LMUL=1/2 word load followed by a widening vwaddu.wv into a previously zeroed double.

Neither of these is efficient.

That is why I proposed #413.

(see #411 for elucidating clustering in fractional LMUL)
CLSTR and clstr: width specifiers for data cluster in each SLEN chunk. (when LMUL<=1/2) 

Expand only when needed, compensate for the implementation specific clustering.

The above becomes:

    vsetvli t01,t02,e32,lf2 /* word in LMUL=1/2
    vle.v   v4,(xarray)
    dclstru,v v4,v4             /* unsigned extend to double (unclustered)

    vsetvli t01,t02,e8,lf8   /* byte in LMUL=1/8
    vle.v  v5,(xarryscale)
    dclstr,v v5,v5              /* sign extend to double (unclustered)

    vsetvli t01,t02,e64
    vxor.vv v4,v4,v5         /* apply sign bit and "decode" shift bits
    vsll.vv v4,v4,v5           /* scale by lower 6 bits.

For convenience I post #413 here:

cluster/decluster instructions: with LMUL<1 loads/stores provide byte/half/word support.

Two new vector-vector unary unmasked instructions , vdclstr and vclstr undo / apply the clustering specified in clstr. ***

Using the given SEW width as cluster element width and LMUL<1 as the expansion factor:

For each SLEN chunk in vs2, vclstr instruction selects SEW width field from each SEW/LMUL element concatenates the SEW elements into a LMUL<1 clusters and stores the result into vd.

vdclstr does the reverse; changes vs2 from LMUL clustering into SEW/LMUL width interleaved elements (effectively CLSTR=1) and stores result into vd.

This operation can be performed in place with vd = vs2 as each operation is on a SLEN chunk.
When CLSTR=1 then the elements can be operated upon one at a time.

See #411 for the specifics of cluster fields and gaps.

For vdclstr, the available options for gap fill are undisturbed, agnostic, zero fill and sign extend.
For vclstr the options are undisturbed, agnostic and zero fill. ****

I am proposing this as unmasked only. vm=0 is reserved. There are potential difficulties with disparate mask structure between clustered and interleaved when effective CLSTR > 1.

An obvious use of vdclstr instruction is coupled with a load to emulate byte/half/word to SEW. It is low overhead, can be chained/fused.

When vd = vs2, clstr is zero (CLSTR=1) and fill is undisturbed:
vclstr and vdclstr are nops.

The sequence vclstr and vdclstr with vd = vs2 can be fused to provide byte, half and word sign- or zero-extend up to double word SEW. If not constrained by the strawman model, zero- or sign-extension of SEW to 2 * SEW, 4 * SEW or 8 * SEW are possible.

There are numerous other uses especially if SEW is not constrained by the strawman model.
e.g. a 1 splat followed by a vdclstr will Nan-box float into double float.

*** The current vector-vector unary class only supports float. It is possible that these could live in that encoding, but likely would be allocated in their own unary group. I will model the encoding after VFUNARY1.

**** I don’t see much value in fill1s and less for sign-extend: which element’s sign to extend? and it would lend itself to specific CLSTR idiosyncratic coding even more so than zero fill does.

Also relates to #362

On 2020-04-12 6:28 a.m., krste@... wrote:

These are basic operations, not application kernels.

It's easy to call out missing instructions when considering individual

It's more important to gather and evaluate actual application kernels.


On Sat, 4 Apr 2020 23:25:13 -0400, "David Horner" <ds2horner@...> said:
| I agree Nick.
| So here is a suggestion, not completely facetiously:

| For load byte/half/word

| example when SEW = 64

| An implementation can optimize the sequence

| strided load by 1/2/4

| shift left 56/48/32

| arith right 56/48/32

| but a sign extend byte/half/word to SEW would make fusing/chaining simpler.

| And these without widening.

| For stores:

| a “pack” SEW (of byte/half/word) instruction by SLEN into appropriate LMUL=1/8, 1/4 or 1/2 would allow standard unit strided store to work.

| A fractional LMUL that uses interleave (rather than right justified SLEN chunks) would not need this pack instruction.

| On 2020-04-04 8:04 p.m., Nick Knight wrote:

|     Hi Thang,
|     Can you, and anyone else who responds, please be concrete about the applications you have in mind? I tried to do so in my email.
|     In my opinion, concrete examples are crucial to making an informed decision. I hope you agree.
|     Best,
|     Nick Knight
|     On Sat, Apr 4, 2020 at 4:56 PM Thang Tran <thang@...> wrote:
|         There are real application (mixed integer/FP - convert instruction is used) codes written with load/store byte/halfword/word. There is a huge performance impact by adding widening instruction in a small
|         critical loop where every additional instruction causes > 10% impact on performance.
|         I am strongly against dropping the byte/halfword/word for load/store.
|         Thanks, Thang
|         -----Original Message-----
|         From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
|         Sent: Saturday, April 4, 2020 1:43 PM
|         To: tech-vector-ext@...
|         Subject: [RISC-V] [tech-vector-ext] Vector TG meeting minutes 2020/4/03

|         Date: 2020/4/03
|         Task Group: Vector Extension
|         Chair: Krste Asanovic
|         Number of Attendees: ~15
|         Current issues on github:

Join to automatically receive all group messages.