Date   

Re: 答复: [RISC-V] [tech-vector-ext] The scenarios of GEMM for u/int8 data

Linjie Yu
 

HiDavid

 

Can we see the git of your work?

My code has not been upload to git, and I will show it in the mail.

            Does this mean the 32 vector registers are not enough,

or that the number of elements for the given input vector length are not enough?

Yes, for the width should be widen 4 times.

 

       With a "temporary working vector" this new instruction is a combination of the old with any "insert scalar into element" instruction [such as  vrgather.vv splatt with mask ].

 

To use vrgather.vv, the 128 bits const is complex to init.

 

Next, I will show my code:

 

Firstly,  the C code:

        int sum[8];

        for (int j = 0; j < 8; j++) {

            sum[j] = bias_ptr[j];

        }

        for (int j = 0; j < inch_16; j++) {

            for (int k = 0; k < 16; k++) {

                for (int x = 0; x < 8; x++) {

                   sum[x] += in_ptr[k] * f0[k + 16 * x];

                }

            }

            in_ptr += 16;

            f0 += 16 * 8;

        }

        for (int j = 0; j < 8; j++) {

            int lshift = -shift_value[j + i];

            if (lshift > 0) {

                sum[j] = (sum[j] + (1 << (lshift - 1))) >> lshift;

            } else {

                sum[j] = sum[j] << (-lshift);

            }

            out_ptr[j] = (char)sum[j];

        }

 

1.     vdot.vv+vredsum.vs  (the tail process is so complex)

                    "vsetvli        zero, zero, e64, m8\n\t"

                    "vxor.vv        v0, v0, v0\n\t"

                    "beqz           %4, 1f\n\t"

 

                    "0: \n\t"

                    "vsetvli        zero, zero, e8, m1\n\t"

                    "vle.v          v8, (%0)\n\t"

                    "addi           %0, %0, 16\n\t"

                    "vle.v          v9, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v10, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v11, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v12, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v13, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v14, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v15, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v16, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

 

                    "vsetvli        zero, zero, e32, m1, d4\n\t"

                    "vdot.vv        v0, v8, v9\n\t"

                    "vdot.vv        v1, v8, v10\n\t"

                    "vdot.vv        v2, v8, v11\n\t"

                    "vdot.vv        v3, v8, v12\n\t"

                    "vdot.vv        v4, v8, v13\n\t"

                    "addi           %4, %4, -1\n\t"

                    "vdot.vv        v5, v8, v14\n\t"

                    "vdot.vv        v6, v8, v15\n\t"

                    "vdot.vv        v7, v8, v16\n\t"

                    "bnez           %4, 0b\n\t"

 

                  "1: \n\t"

                    "vsetvli        zero, zero, e64, m8\n\t"

                    "vxor.vv        v8, v8, v8\n\t"

                    "vsetvli        zero, zero, e32, m1\n\t"

                    "vwredsum.vs    v8, v0, v8\n\t"

                    "vwredsum.vs    v9, v1, v9\n\t"

                    "vwredsum.vs    v10, v2, v10\n\t"

                    "vwredsum.vs    v11, v3, v11\n\t"

                    "vwredsum.vs    v12, v4, v12\n\t"

                    "vwredsum.vs    v13, v5, v13\n\t"

                    "vwredsum.vs    v14, v6, v14\n\t"

                    "vwredsum.vs    v15, v7, v15\n\t"

 

 

2.     vwmul + vwredsum.vs (vwredsum.vs used in the for loop)

                   "vsetvli        zero, zero, e64, m8\n\t"

                    "vxor.vv        v0, v0, v0\n\t"

                    "beqz           %4, 1f\n\t"

 

                    "0: \n\t"

                    "vsetvli        zero, zero, e8, m1\n\t"

                    "vle.v          v8, (%0)\n\t"

                    "addi           %0, %0, 16\n\t"

                    "vle.v          v9, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v10, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v11, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v12, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v13, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

 

                    "vwmul.vv       v14, v8, v9\n\t"

                    "vwmul.vv       v16, v8, v10\n\t"

                    "vle.v          v9, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vwmul.vv       v18, v8, v11\n\t"

                    "vle.v          v10, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vwmul.vv       v20, v8, v12\n\t"

                    "vle.v          v11, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

 

                    "vwmul.vv       v22, v8, v13\n\t"

                    "vwmul.vv       v24, v8, v9\n\t"

                    "vwmul.vv       v26, v8, v10\n\t"

                    "vwmul.vv       v28, v8, v11\n\t"

 

                    "vsetvli        zero, zero, e16, m2\n\t"

                    "vwredsum.vs    v0, v14, v0\n\t"

                    "vwredsum.vs    v1, v16, v1\n\t"

                    "vwredsum.vs    v2, v18, v2\n\t"

                    "addi           %4, %4, -1\n\t"

                    "vwredsum.vs    v3, v20, v3\n\t"

                    "vwredsum.vs    v4, v22, v4\n\t"

                    "vwredsum.vs    v5, v24, v5\n\t"

                    "vwredsum.vs    v6, v26, v6\n\t"

                    "vwredsum.vs    v7, v28, v7\n\t"

"bnez           %4, 0b\n\t"

 

3.     vwmul + vwredsum.vs(new)

                   "vsetvli        zero, zero, e16, m2\n\t"

                    "vxor.vv        v2, v2, v2\n\t"

                    "beqz           %4, 1f\n\t"

 

                    "0: \n\t"

                    "vsetvli        zero, zero, e8, m1\n\t"

                    "vle.v          v8, (%0)\n\t"

                    "addi           %0, %0, 16\n\t"

                    "vle.v          v9, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v10, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v11, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v12, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vle.v          v13, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

 

                    "vwmul.vv       v14, v8, v9\n\t"

                    "vwmul.vv       v16, v8, v10\n\t"

                    "vle.v          v9, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vwmul.vv       v18, v8, v11\n\t"

                    "vle.v          v10, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

                    "vwmul.vv       v20, v8, v12\n\t"

                    "vle.v          v11, (%1)\n\t"

                    "addi           %1, %1, 16\n\t"

 

                    "vwmul.vv       v22, v8, v13\n\t"

                    "vwmul.vv       v24, v8, v9\n\t"

                    "vwmul.vv       v26, v8, v10\n\t"

                    "vwmul.vv       v28, v8, v11\n\t"

 

                    "vsetvli        zero, zero, e16, m2\n\t"

                    "vwredsum.vs    v2, v14, v2, 0\n\t"

                    "vwredsum.vs    v2, v16, v2, 1\n\t"

                    "vwredsum.vs    v2, v18, v2, 2\n\t"

                    "addi           %4, %4, -1\n\t"

                    "vwredsum.vs    v2, v20, v2, 3\n\t"

                    "vwredsum.vs    v3, v22, v3, 0\n\t"

                    "vwredsum.vs    v3, v24, v3, 1\n\t"

                    "vwredsum.vs    v3, v26, v3, 2\n\t"

                    "vwredsum.vs    v3, v28, v3, 3\n\t"

 

                    "bnez           %4, 0b\n\t"

 

 

All of them are shown above. Any suggestions are welcomed.

 

Yours

Damon

 

 

 

 

发件人: tech-vector-ext@... <tech-vector-ext@...> 代表 David Horner
发送时间: 20201211 17:32
收件人: tech-vector-ext@...
主题: Re: [RISC-V] [tech-vector-ext] The scenarios of GEMM for u/int8 data

 

 

On 2020-12-11 3:34 a.m., Linjie Yu wrote:

Hiall

 

Recently, I optimized the kernel of GEMM for int8 data.

Can we see the git of your work?

I found that there was no good solution to do in by the use of the present vector ISA.

The mainly difficult I meet is: The accumulator is 32bits, it needs wide 4 times(vqmacc or vwmul + vwmacc or vwmul + vwadd), which makes the registers are not enough to use.

Does this mean the 32 vector registers are not enough,

or that the number of elements for the given input vector length are not enough?

 

There are 2 different ways I used to optimize it by the present vector ISA.

1.     vdot.vv+vredsum.vs  (the tail process is so complex)

2.     vwmul + vwredsum.vs (vwredsum.vs used in the for loop)

Note vdot.vv is experimental. It is not planned for the v1.0 ratification proposal.

 

For solving this, I come up with a new instruction, call vwredsum.vs(new)

Unlike the old vwredsum.vs, the result is put at the first element, the new one can put the result in any position by index. It can be used like this: vwredsum.vs v2, v1, v1, #2

With a "temporary working vector" this new instruction is a combination of the old with any "insert scalar into element" instruction [such as  vrgather.vv splatt with mask ].

 

But they are all not good enough. Does someone have better solution?

I would be happy to look at your current work to make suggestions if you could direct me to the code.

 

Yours

Damon


Re: The scenarios of GEMM for u/int8 data

Zakk Chen
 



Linjie Yu <linjie.ylj@...> 於 2020年12月11日 週五 下午4:34寫道:

Hiall

 

Recently, I optimized the kernel of GEMM for int8 data. I found that there was no good solution to do in by the use of the present vector ISA.

The mainly difficult I meet is: The accumulator is 32bits, it needs wide 4 times(vqmacc or vwmul + vwmacc or vwmul + vwadd), which makes the registers are not enough to use.


Have you consider to use fraction LMUL?
 

 

There are 2 different ways I used to optimize it by he present vector ISA.

1.     vdot.vv+vredsum.vs  (the tail process is so complex)

2.     vwmul + vwredsum.vs (vwredsum.vs used in the for loop)

 

For solving this, I come up with a new instruction, call vwredsum.vs(new)

Unlike the old vwredsum.vs, the result is put at the first element, the new one can put the result in any position by index. It can be used like this: vwredsum.vs v2, v1, v1, #2

 

But they are all not good enough. Does someone have better solution?

 

Yours

Damon


Re: The scenarios of GEMM for u/int8 data

David Horner
 


On 2020-12-11 3:34 a.m., Linjie Yu wrote:

Hiall

 

Recently, I optimized the kernel of GEMM for int8 data.

Can we see the git of your work?

I found that there was no good solution to do in by the use of the present vector ISA.

The mainly difficult I meet is: The accumulator is 32bits, it needs wide 4 times(vqmacc or vwmul + vwmacc or vwmul + vwadd), which makes the registers are not enough to use.

Does this mean the 32 vector registers are not enough,

or that the number of elements for the given input vector length are not enough?

 

There are 2 different ways I used to optimize it by the present vector ISA.

1.     vdot.vv+vredsum.vs  (the tail process is so complex)

2.     vwmul + vwredsum.vs (vwredsum.vs used in the for loop)

Note vdot.vv is experimental. It is not planned for the v1.0 ratification proposal.

 

For solving this, I come up with a new instruction, call vwredsum.vs(new)

Unlike the old vwredsum.vs, the result is put at the first element, the new one can put the result in any position by index. It can be used like this: vwredsum.vs v2, v1, v1, #2

With a "temporary working vector" this new instruction is a combination of the old with any "insert scalar into element" instruction [such as  vrgather.vv splatt with mask ].

 

But they are all not good enough. Does someone have better solution?

I would be happy to look at your current work to make suggestions if you could direct me to the code.

 

Yours

Damon


The scenarios of GEMM for u/int8 data

Linjie Yu
 

Hiall

 

Recently, I optimized the kernel of GEMM for int8 data. I found that there was no good solution to do in by the use of the present vector ISA.

The mainly difficult I meet is: The accumulator is 32bits, it needs wide 4 times(vqmacc or vwmul + vwmacc or vwmul + vwadd), which makes the registers are not enough to use.

 

There are 2 different ways I used to optimize it by he present vector ISA.

1.     vdot.vv+vredsum.vs  (the tail process is so complex)

2.     vwmul + vwredsum.vs (vwredsum.vs used in the for loop)

 

For solving this, I come up with a new instruction, call vwredsum.vs(new)

Unlike the old vwredsum.vs, the result is put at the first element, the new one can put the result in any position by index. It can be used like this: vwredsum.vs v2, v1, v1, #2

 

But they are all not good enough. Does someone have better solution?

 

Yours

Damon


Re: Vector Task Group minutes 2020/12/04

David Horner
 



On Thu, Dec 10, 2020, 04:44 Bill Huffman, <huffman@...> wrote:
On the issue of what bits to load for vle1.v, we need to decide whether
these are byte loads of length ceil(vl/8) or whether they are bit loads
of length vl.  Bit loads _can_ have the additional bits as tail-agnostic
but must not have them as tail-undisturbed. 
I concur.
Software can effect tail-undisturbed by
A pre conditioning the load,
B loading into temp register then use bitwise logic into target,
C save last byte of target , lde1, read last byte, write the last byte of the merged two saved
In most cases this 'need' could be avoided by other means.
It would be nice if these
were bit loads, but it will be a little more complex for implementation
and I expect we may run into other issues down the line.  I think I lean
toward byte loads.
+1

We have a similar issue for vse1.v as the remaining bits in the memory
byte _must_ be stored with something.  Here it seems simpler and perhaps
more logical to say this is a byte store with length ceil(vl/8) - which
helps re-enforce the choice of byte load for vle1.v.
+again

      Bill

On 12/4/20 7:01 PM, Krste Asanovic wrote:

-


Re: Vector Task Group minutes 2020/12/04

Bill Huffman
 

On the issue of what bits to load for vle1.v, we need to decide whether
these are byte loads of length ceil(vl/8) or whether they are bit loads
of length vl. Bit loads _can_ have the additional bits as tail-agnostic
but must not have them as tail-undisturbed. It would be nice if these
were bit loads, but it will be a little more complex for implementation
and I expect we may run into other issues down the line. I think I lean
toward byte loads.

We have a similar issue for vse1.v as the remaining bits in the memory
byte _must_ be stored with something. Here it seems simpler and perhaps
more logical to say this is a byte store with length ceil(vl/8) - which
helps re-enforce the choice of byte load for vle1.v.

Bill

On 12/4/20 7:01 PM, Krste Asanovic wrote:
EXTERNAL MAIL




Date: 2020/12/04
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~12
Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!Sg27GCQ1RctXuVdF8U71W-XBbf5km_mIznppsPpyX6Am_WvpWUSXdHoHSmu7Td0$

Note: No meeting week Dec 11 due to Summit week.

Issues discussed:

# Memory ordering

Most of the meeting was spent discussing a strategy to handle vector
memory ordering wrt to the RISC-V MCM (RVWMO, RVTSO), i.e., ordering
as observed/influenced by other harts in the system.

A big concern is ordering of younger scalar loads after older vector
loads when both are the same address, as this complicates
high-performance in-order implementations (OoO implementations already
have to deal with ordering around unknown addresses in any case, so
not considered a significant additional burden there). This load-load
ordering is required for the existing MCM, and the discussion was
around how it would be difficult to remove this ordering guarantee on
current vector load instructions while preserve existing software view
of memory, possibly either complicating mapping of standard languages
or requiring software to add fences that would hurt performance on a
large class of machines.

One possible approach that was discussed was to add separate vector
memory instructions with weaker memory ordering, either encoded as new
opcodes or with some CSR field that modifies behavior of existing
instruction encodings. This might only be required for gather
operations, but some discussion was whether even greater weakening,
including intra-thread ordering should be considered.

It was felt defining and experimenting with these variants on
memory ordering would delay the vector spec even further, and so the
consensus was to enter public review with the current PoR that follows
standard RVWMO (or really, the standard MCM including TSO) at the
instruction level with the current instructions (intra-instruction
ordering was already relaxed per current draft spec), and consider
weaker instruction forms as a later extension.

# Mask handling

We further discussed the challenges of distributing mask register
values for machines with spatial wide datapaths using internal dynamic
data striping. In particular, all common instructions used to produce
mask values are explicitly encoded in the ISA, except for loads from
memory. Machines with internal dynamic data striping will therefore require
hiccups (additional microops) in the pipeline to rearrange load data
whenever used as a mask (heuristics/predictors might be possible to
reduce hiccups).

The most important case is that of mask register spill/refill, but
another important case is loading of packed bit vectors from memory
for use as masks.

Oblivious context save/restore would still likely require hiccups as
the save/restore code would not know data type assumption for next use
of a register, but these hiccups would be rare.

To help reduce these hiccups, we discussed the addition of new
unit-stride loads and stores that would use the lumop/sumop field to
encode EEW=1, and also use effective vl = ceil(vl/8) (implying
effectively EMUL<=1). Proposed instructions would be:

vle1.v vd, (rs1) # Byte load with effective vl = ceil(vl/8)
vse1.v vs2, (rs1) # Byte store with effective vl = ceil(vl/8)

For context
switch, or where multiple vector lengths are present in a loop, whole
register versions would also be useful, and might be simpler to
provide and could be only alternative.

vl1re1.v vd, (rs1) # Whole register load

These options, and whether any should be added to v1.0 for public
review to be discussed further on email. How to treat the extra bits
in a byte loaded from memory is an open issue (1) 0s, or 2) 1s to match
tail-agnostic, or 3) use whole byte from memory - 3) is probably simplest for
implementations).







Re: Vector Task Group minutes 2020/12/04

lidawei14@...
 

Hi Krste,

This mask loading instruction is exactly the one we look forward.

I got some confusion on hiccups, why machines with internal dynamic data striping require hiccups whenever used as a mask?
Does it mean we have different arrangements of mask register and normal vector registers then we have to distinguish it while loading?
How the proposed instructions help reduce these hiccups?

Thank you,
Dawei


Vector Task Group minutes 2020/12/04

Krste Asanovic
 

Date: 2020/12/04
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~12
Current issues on github: https://github.com/riscv/riscv-v-spec

Note: No meeting week Dec 11 due to Summit week.

Issues discussed:

# Memory ordering

Most of the meeting was spent discussing a strategy to handle vector
memory ordering wrt to the RISC-V MCM (RVWMO, RVTSO), i.e., ordering
as observed/influenced by other harts in the system.

A big concern is ordering of younger scalar loads after older vector
loads when both are the same address, as this complicates
high-performance in-order implementations (OoO implementations already
have to deal with ordering around unknown addresses in any case, so
not considered a significant additional burden there). This load-load
ordering is required for the existing MCM, and the discussion was
around how it would be difficult to remove this ordering guarantee on
current vector load instructions while preserve existing software view
of memory, possibly either complicating mapping of standard languages
or requiring software to add fences that would hurt performance on a
large class of machines.

One possible approach that was discussed was to add separate vector
memory instructions with weaker memory ordering, either encoded as new
opcodes or with some CSR field that modifies behavior of existing
instruction encodings. This might only be required for gather
operations, but some discussion was whether even greater weakening,
including intra-thread ordering should be considered.

It was felt defining and experimenting with these variants on
memory ordering would delay the vector spec even further, and so the
consensus was to enter public review with the current PoR that follows
standard RVWMO (or really, the standard MCM including TSO) at the
instruction level with the current instructions (intra-instruction
ordering was already relaxed per current draft spec), and consider
weaker instruction forms as a later extension.

# Mask handling

We further discussed the challenges of distributing mask register
values for machines with spatial wide datapaths using internal dynamic
data striping. In particular, all common instructions used to produce
mask values are explicitly encoded in the ISA, except for loads from
memory. Machines with internal dynamic data striping will therefore require
hiccups (additional microops) in the pipeline to rearrange load data
whenever used as a mask (heuristics/predictors might be possible to
reduce hiccups).

The most important case is that of mask register spill/refill, but
another important case is loading of packed bit vectors from memory
for use as masks.

Oblivious context save/restore would still likely require hiccups as
the save/restore code would not know data type assumption for next use
of a register, but these hiccups would be rare.

To help reduce these hiccups, we discussed the addition of new
unit-stride loads and stores that would use the lumop/sumop field to
encode EEW=1, and also use effective vl = ceil(vl/8) (implying
effectively EMUL<=1). Proposed instructions would be:

vle1.v vd, (rs1) # Byte load with effective vl = ceil(vl/8)
vse1.v vs2, (rs1) # Byte store with effective vl = ceil(vl/8)

For context
switch, or where multiple vector lengths are present in a loop, whole
register versions would also be useful, and might be simpler to
provide and could be only alternative.

vl1re1.v vd, (rs1) # Whole register load

These options, and whether any should be added to v1.0 for public
review to be discussed further on email. How to treat the extra bits
in a byte loaded from memory is an open issue (1) 0s, or 2) 1s to match
tail-agnostic, or 3) use whole byte from memory - 3) is probably simplest for
implementations).


Vector Task Group minutes 2020/11/20 meeting

Krste Asanovic
 

Next meeting today in usual time slot as on calendar,
Krste


Date: 2020/11/20
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~17
Current issues on github: https://github.com/riscv/riscv-v-spec

No meeting Nov 27 due to US Thanksgiving.

Issues discussed:

# Mask handling

We discussed the challenges of distributing mask register values for
machines with spatial wide datapaths. Machines that perform dynamic
microarchitectural striping to reduce datapath wiring can do the same
for masks when it is known that a mask value is being written.
Currently mask writes are explicit in the ISA, except for loads from
memory (e.g., in a mask register spill/refill sequence).

One suggestion was that a heuristic could be used for loads, where it
would be assumed to not be a mask if LMUL>1 or not SEW=8, but this
would not work for many cases.

Another suggestion was to add new instructions to support load/store
of masks. This could be encoded as unit-stride instructions using
reserved space in lumop[4:0]. These would look like EEW=8 load/stores
except with vector length of ceil(vl/8) (implying effectively
EMUL<=1). The detailed behavior was not discussed.


回复: [RISC-V] [tech-vector-ext] What is the plan for rvv v1.0

Wang Weiwei
 

That is exactly what I want. Thanks Mark.

 

Weiwei

 

 

 

 

发件人: tech-vector-ext@... <tech-vector-ext@...> 代表 mark
发送时间: 20201125 23:13
收件人: Wang Weiwei <Weiwei.Wang@...>
抄送: vector <tech-vector-ext@...>
主题: Re: [RISC-V] [tech-vector-ext] What is the plan for rvv v1.0

 

If you are looking for expected dates they are always in the spec status spreadsheet at:

 

 

if it is something else, please let us know what specifically you are looking for.

 

thanks

Mark

 

 

On Wed, Nov 25, 2020 at 12:51 AM <weiwei.wang@...> wrote:

Hi Krste and Andrew, 

What is the rough plan for rvv v1.0 release? I searched vector-ext mailing list but can’t find the info I want.

 

Thanks

Weiwei  


Re: What is the plan for rvv v1.0

mark
 

If you are looking for expected dates they are always in the spec status spreadsheet at:


if it is something else, please let us know what specifically you are looking for.

thanks
Mark


On Wed, Nov 25, 2020 at 12:51 AM <weiwei.wang@...> wrote:
Hi Krste and Andrew, 

What is the rough plan for rvv v1.0 release? I searched vector-ext mailing list but can’t find the info I want.

 

Thanks

Weiwei  


What is the plan for rvv v1.0

Wang Weiwei
 

Hi Krste and Andrew, 

What is the rough plan for rvv v1.0 release? I searched vector-ext mailing list but can’t find the info I want.

 

Thanks

Weiwei  


next vector meeting in 7 hours

Krste Asanovic
 

I think we'll be spending a chunk of time on mask layout and
implementation issues.

See you then,

Krste


Re: rename vfrece7/vfrsqrte7 to vfrec7 and vfrsqrt7

Andrew Waterman
 



On Sun, Nov 15, 2020 at 3:08 PM Krste Asanovic <krste@...> wrote:


This is issue #601.



It was pointed out that *e7 (estimate to 7 bits) suffix on mnemonic is

easily confused with e32 (element size 32) on other mnemonics.



This is probably one we can handle on email thread.   I'm in favor of

change (can keep old names as alias in toolchain for now to avoid churn).

👍





Krste












rename vfrece7/vfrsqrte7 to vfrec7 and vfrsqrt7

Krste Asanovic
 

This is issue #601.

It was pointed out that *e7 (estimate to 7 bits) suffix on mnemonic is
easily confused with e32 (element size 32) on other mnemonics.

This is probably one we can handle on email thread. I'm in favor of
change (can keep old names as alias in toolchain for now to avoid churn).

Krste


Vector TG minutes 2020/11/13 meeting

Krste Asanovic
 

Date: 2020/11/13
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~22
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

# Removing half-precision FP as mandate in V extension.

The group discussed a proposal to remove half-precision floating-point
operations as part of mandated set for V extension, moving FP16
support to an additional standard extension. BF16 support could be
added in same way at later date. For both 16b FP formats, a standard
subextension supporting only conversion up/down to/from FP32 was
agreed to be defined (no integer conversions, and no .rod. variant).
This approach enables implementations to support either, both, or
neither of these 16-bit floating-point formats. The expected impact
on software stack is expected to be small given that half-precision FP
use is not yet standard.

Related, there is a discussion on what level of support should be
mandated in next application-processor profile (RVA21). There were
three options considered:
1) no FP16 supported mandate
2) RVA21 mandates FP16 convert operations only
3) V mandates FP16 convert operations only

This is to be discussed further, but group seemed to favor option 1 or
2.

# Mask handling

There was initial discussion of concerns over mask handling for wide
spatial implementations and/or implementations with vector register
renaming.

One proposal was to not allow "tail undisturbed" on mask results, as
this complicated renamed implementations and those with specialized
mask handling, while not being particularly useful to software.
Created new issue #602 for this proposal.


Half-Precision, BFloat16, and Other Float Encoding: Reference Model Recommendations from Task Group

Krste Asanovic
 

Because there is no "official" BF16 standard (beyond interchange
format) and because other vendors have made incompatible choices (so
no de-facto standard either), we will need to define the RISC-V BF16
arithmetic standard.

Some people are working towards this as part of alternate FP group
proposal - there are a few details that need some thought.

Once this is specified, there can be reference implementations
produced.

Krste

On Fri, 13 Nov 2020 09:11:18 -0800, "CDS" <cohen.steed@wdc.com> said:
| In support of Open Source Software and publicly released modeling schemes, does the Vector Task Group have a recommendation for arithmetic
| reference? The published ISSs can provide checking results from a heavy-lifting simulation perspective, but even they must rely on something to
| model and calculate arithmetic results. The IEEE-754 encodings are handled by some easily-found solutions - what about BFloat16 and other encodings?
|


Half-Precision, BFloat16, and Other Float Encoding: Reference Model Recommendations from Task Group

CDS
 

In support of Open Source Software and publicly released modeling schemes, does the Vector Task Group have a recommendation for arithmetic reference? The published ISSs can provide checking results from a heavy-lifting simulation perspective, but even they must rely on something to model and calculate arithmetic results. The IEEE-754 encodings are handled by some easily-found solutions - what about BFloat16 and other encodings?


Vector TG minutes from 2020/11/6 meeting

Krste Asanovic
 

Also, reminder we'll be meeting tomorrow (Friday Nov 13) as per
calendar entry (7 hours from now),

Krste

Date: 2020/11/06
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~18
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

#592 "Heritage" - old prior art

We discussed how to add notes showing known older designs that had
same features as included in spec, and spent some time covering all
details in spec. We cannot discuss live patents in meetings, but we
can describe public domain techniques from 20+ years ago. Proposal is
to accept pull requests with details incorporated as NOTE comments for
now. Should ideally format differently than other commentary, but
this can be done later.

(#529) Whole register load/store misalignment exceptions

While we had previously agreed (#529) that whole register load/stores
could report misalignment exceptions if the base address was not
aligned with the encoded hint EEW, we had not considered cases where
the machine did not support smaller EEWs. In particular, stores are
always encoded with EEW=8.

We rejected allowing machines to report exceptions if not
VLEN-aligned, as this would complicate stack save/restore in ABIs with
smaller stack alignments.

We decided to stay with current text that allows misaligned exceptions
to be reported according to greater of the smallest supported EEW or
the encoded EEW. Profiles can mandate support for certain EEWs.


Re: vector strided stores when rs1=x0

Bill Huffman
 

This sounds right to me as well.  No use making a special case for strided stores with rs2=x0.

      Bill

On 11/9/20 12:04 PM, Nick Knight wrote:
EXTERNAL MAIL

I understand now. I'm on board iff the memory consistency model experts assent.

On Mon, Nov 9, 2020 at 11:41 AM Krste Asanovic <krste@...> wrote:
There’s a comment about this in spec already.

But note that this would be in a case where you're relying on having multiple accesses in a non-deterministic order to one memory location, which is probably fraught for other reasons.

Krste

On Nov 9, 2020, at 11:38 AM, Nick Knight <nick.knight@...> wrote:

Sorry, slightly off topic, but what was the rationale for 

When `rs2!=x0` and the value of `x[rs2]=0`, the implementation must perform one memory access for each active element (but these accesses will not be ordered).

I guess I'm thinking about the possibility of a toolchain relaxing `li, x1, 0; inst x1` into `inst x0`.  


On Mon, Nov 9, 2020 at 10:09 AM Krste Asanovic <krste@...> wrote:
I made an error copied from my meeting notes - this should be when rs2=x0 (i.e., the stride value),

Krste

On Nov 9, 2020, at 9:12 AM, Krste Asanovic via lists.riscv.org <krste=berkeley.edu@...> wrote:


On Nov 9, 2020, at 8:57 AM, Guy Lemieux <guy.lemieux@...> wrote:

I think this is a bad idea for both loads and stores. If the intent is a single load or single store, then there should be another way to do it.

Using vector loads/stores with stride=0 is one way to read/write a vector from/to a memory-mapped FIFO.


(I think we also discussed a way to do ordered writes for such cases earlier, which is necessary for FIFO-based communication; I don't recall whether this was discussed around strides. If there is a special way to declare ordered writes, then I'm only concerned with using a FIFO with that mode.)


These are all supported with ordered scatters/gathers to/from a single address.

We wanted to remove ordering requirements from all other vector load/store types.

Krste

Guy


On Mon, Nov 9, 2020 at 8:38 AM Krste Asanovic <krste@...> wrote:

Also on github as issue #595

In our earlier TG discussion in 9/18 meeting, we were in favor of
allowing vector strided load instructions with rs1=x0 to perform fewer
memory accesses than the number of active elements.  This allows
higher-performing splats of a scalar memory value into a vector.

In writing this up, I inadvertently made this true for stores too.
But on review, I can't see a reason to not also allow strided stores
(which are now unordered), to also perform fewer memory operations (in
effect, picking a random active element to write back).  The behavior
is indistinguishable from a possible legal execution of prior scheme,
and has potential niche use of storing element value to memory when it
is known all elements have same value.

https://github.com/riscv/riscv-v-spec/commit/398d453e3592efbac77cc8f6658009759901185a#diff-ea57dd7a8daf0aa62f553688c1970c8e6608945d25597f8661c5ea6670fb509c

I suppose we could also reserve the encoding with strided stores of
rs1=x0, but this would add some asymmetry.  Software could then get a
similar effect by settng vl=1 before the store.

Krste








61 - 80 of 591