#### Vector extension TG meeting minutes 2020/5/1

Krste Asanovic

I pushed an update to spec that includes the sign/zero-extension
instructions so now can hopefuly answer this question more concretely.

The point of fractional LMUL is to reduce register usage, so example
could use LMUL=1 for the 32b values:

vsetvli t0, a0, e32, m1 # SEW=32b, LMUL=1
loop:
[...]
vle8.v v1, (x12) # Load bytes EEW=8b, EMUL=1/4
vsext.vf4 v1, v1 # Quad-widening sign-extension, dest EEW=32b, EMUL=1
vfcvt.f.x.v v1, v1 # Convert to float
[...]

I use v1 for the working register here to make it clear that it can be
any of the 32 registers since LMUL=1.

Theoretically, an implementation could fuse:

vle8.v v1, (x12) # dest EEW=8b, EMUL=1/4
vsext.vf4 v1, v1 # dest EEW=32b, EMUL=1

into
vle8_sextvf4 v1, (x12) # dest EEW=32b, EMUL=1

which effectively would restore the sign-extending byte load
instruction.

from SEW/4 to SEW (in addition to current dual widening that extend
from SEW to 2*SEW), but for your partcular example, would have to
allow the wide value to be a scalar

vadd.vf4x vd, vs2, rs1 # vd (EEW=SEW,EMUL=LMUL), vs2 (EEW=SEW/4,EMUL=LMUL/4), rs1 (EEW=SEW)

A fully orthogonal encoding would allow each source and destination
operand of any arithmetic operatoin to have an EEW encoding, with one
of these set by the SEW vtype setting, but can't fit this into 32b
encoding (never mind implementation+verification challenge).

Krste

On Sun, 3 May 2020 16:01:50 +0000, <thang@...> said:
| BTW, even if overlapping of source/destination registers is not an issue, would the use of fractional LMUL to load data, would it still be 2 extra instructions in the loop to change from fractional LMUL to full LMUL?
| Thanks, Thang

| -----Original Message-----
| From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Thang Tran
| Sent: Sunday, May 3, 2020 8:49 AM
| To: krste@...; tech-vector-ext@...
| Subject: Re: [RISC-V] [tech-vector-ext] Vector extension TG meeting minutes 2020/5/1

| Hi All,

| The issue that I brought up about the extra register that is needed to load data with LMUL=8 and then signed/unsigned extension that would need an extra register. I am still puzzled/confused about the code.

| There was suggestion of using fractional LMUL to load data and then using the same register for signed/unsigned extension, then there is no extra register set is needed but this would overwritten the data in v0, as such v16 was used as comment by Krste from #425:

| vsetvli t0, a0, e32, m8
| loop:
| [...]
| vle8.v v0, (x12) # Load bytes
| vqcvt.x.x.v v16, v0 # Quad-widening sign-extension
| vfcvt.f.x.v v16, v16 # Convert to float
| [...]

| I have another question in the below example by Krste about the vqadd.vx instruction. Is it an illegal instruction with LMUL=8? Should this instruction is with LMUL=2 to expand from v0-v1 to v16-v23? In this case, 2 vsetvli instructions are inserted into the loop.

| vsetvli t0, a0, e32, m8
| loop:
| [...]
| vle8.v v0, (x12) # Load bytes
| vfcvt.f.x.v v16, v16 # Convert to float
| [...]

| Thanks, Thang

| -----Original Message-----
| From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
| Sent: Friday, May 1, 2020 7:28 PM
| To: tech-vector-ext@...
| Subject: [RISC-V] [tech-vector-ext] Vector extension TG meeting minutes 2020/5/1

| Date: 2020/5/1
| Chair: Krste Asanovic
| Co-Chair: Roger Espasa
| Number of Attendees: ~20
| Current issues on github: https://github.com/riscv/riscv-v-spec

| Issues discussed:

| #440 LMUL contiguous?

| Discussed the recent change to add extra bit for fractional LMUL.
| Explained that the reason to revert to backwards-compatible layout was to reduce implementation churn. Not intended as a precedent that pre-freeze specs have to ensure backwards compatibility. Can be changed in later or 1.0 version.

| #418 Drop LMUL

| Agreed to retain LMUL given discussion on github, but to bring up some of the detailed suggestions as separate issues.

| #424/425 New explicit EEW scheme in load/stores

| Group was generally favorable on the new scheme that encodes effective element width in loads/stores.

| The spec still has not been updated to add 4* widening and 2* widening operations that would sign/zero extend from 1/4*SEW and 1/2*SEW to SEW.

| Some concern expressed that dropping fixed-width sign/zero-extending byte/halfword load/stores will hurt performance of some key loops.
| Also noted that bringing in smaller data items as fractional LMUL will avoid tying up as many architectural registers as bringing as full width elements. Will keep studying application kernels.

| #434 SLEN=VLEN as optional extension

| General sentiment was that it will be bad to fragment and we should find way to retain SLEN<=VLEN as single standard. Also, that SLEN=VLEN will be too expensive even on large cores, with only a subset of workload needing in-memory format in registers.

| Proposals suggested including adding another bit to vsetvli to change to in-memory format in vector registers (potentially reducing VLEN), or adding "cast" instructions to change in/out of memory byte order.
| The latter could be in-place overwrites that turn into NOPs in SLEN=VLEN implementations.

| Group to provide proposals.

| #423 transient vtype modifier

| Briefly discussed adding a transient vtype modifier, but no decisions.

|

| CONFIDENTIALITY NOTICE:

| This e-mail (and its attachments) may contain confidential and legally privileged information or information protected from disclosure. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein is strictly prohibited. In this case, please immediately notify the sender by return e-mail, delete the message (and any accompanying documents) and destroy all printed hard copies. Thank you for your cooperation.

Bill Huffman

Hi Krste,

On the SLEN=VLEN discussion, I agree with you that using a cast
instruction might well make for a de facto split in the code base. But
I don't think the solution is to force a de jure split instead. It
seems possible to me that if such a de jure split is forced, wide
machines might use a different vector instruction set entirely because
the losses may then outweigh the advantages.

I would rather pursue a mechanism that doesn't induce a split. One
possibility is a mode that can be implemented in one lane on wide
machines. Performance would be lower when the mode was enabled, but for
the codes in question, that might be fine. Compilers might be a bit
overzealous at first in setting the mode, but I think that would be
fixed over time as a performance issue.

A possible way to have a mode would be a vtype bit that is set by vsetvl
but not by vsetvli (unless there are more available immediate bits than
I'm thinking there are). The bit survives vsetvli. Machines that did
not care too much about performance on these codes could do something
simple - probably reduce VLEN to SLEN.

I would also be interested in seeing some example code. It might be
that wide machines would slow almost as much because of other aspects of
the code as with a reduced VLEN. Example code might also prompt a
better thought for how to solve this.

It might also help if someone could give some perspective on the kinds
of codes that would use SLEN=VLEN and how much it would help them.

Bill

On 5/1/20 7:27 PM, Krste Asanovic wrote:
EXTERNAL MAIL

Date: 2020/5/1
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!V2-6Em029ciI8AxucJRrSwBKuwZXtlmusOxGzH-eWoFfDDFanZeyPEKiEhOQS3I\$

Issues discussed:

#440 LMUL contiguous?

Discussed the recent change to add extra bit for fractional LMUL.
Explained that the reason to revert to backwards-compatible layout was
to reduce implementation churn. Not intended as a precedent that
pre-freeze specs have to ensure backwards compatibility. Can be
changed in later or 1.0 version.

#418 Drop LMUL

Agreed to retain LMUL given discussion on github, but to bring up some
of the detailed suggestions as separate issues.

#424/425 New explicit EEW scheme in load/stores

Group was generally favorable on the new scheme that encodes effective

The spec still has not been updated to add 4* widening and 2* widening
operations that would sign/zero extend from 1/4*SEW and 1/2*SEW to
SEW.

Some concern expressed that dropping fixed-width sign/zero-extending
byte/halfword load/stores will hurt performance of some key loops.
Also noted that bringing in smaller data items as fractional LMUL will
avoid tying up as many architectural registers as bringing as full
width elements. Will keep studying application kernels.

#434 SLEN=VLEN as optional extension

General sentiment was that it will be bad to fragment and we should
find way to retain SLEN<=VLEN as single standard. Also, that
SLEN=VLEN will be too expensive even on large cores, with only a
subset of workload needing in-memory format in registers.

Proposals suggested including adding another bit to vsetvli to change
to in-memory format in vector registers (potentially reducing VLEN),
or adding "cast" instructions to change in/out of memory byte order.
The latter could be in-place overwrites that turn into NOPs in
SLEN=VLEN implementations.

Group to provide proposals.

#423 transient vtype modifier

Briefly discussed adding a transient vtype modifier, but no decisions.

Thang Tran

BTW, even if overlapping of source/destination registers is not an issue, would the use of fractional LMUL to load data, would it still be 2 extra instructions in the loop to change from fractional LMUL to full LMUL?

Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Thang Tran
Sent: Sunday, May 3, 2020 8:49 AM
To: krste@...; tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Vector extension TG meeting minutes 2020/5/1

Hi All,

The issue that I brought up about the extra register that is needed to load data with LMUL=8 and then signed/unsigned extension that would need an extra register. I am still puzzled/confused about the code.

There was suggestion of using fractional LMUL to load data and then using the same register for signed/unsigned extension, then there is no extra register set is needed but this would overwritten the data in v0, as such v16 was used as comment by Krste from #425:

vsetvli t0, a0, e32, m8
loop:
[...]
vle8.v v0, (x12) # Load bytes
vqcvt.x.x.v v16, v0 # Quad-widening sign-extension
vfcvt.f.x.v v16, v16 # Convert to float
[...]

I have another question in the below example by Krste about the vqadd.vx instruction. Is it an illegal instruction with LMUL=8? Should this instruction is with LMUL=2 to expand from v0-v1 to v16-v23? In this case, 2 vsetvli instructions are inserted into the loop.

vsetvli t0, a0, e32, m8
loop:
[...]
vle8.v v0, (x12) # Load bytes
vfcvt.f.x.v v16, v16 # Convert to float
[...]

Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Friday, May 1, 2020 7:28 PM
To: tech-vector-ext@...
Subject: [RISC-V] [tech-vector-ext] Vector extension TG meeting minutes 2020/5/1

Date: 2020/5/1
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

#440 LMUL contiguous?

Discussed the recent change to add extra bit for fractional LMUL.
Explained that the reason to revert to backwards-compatible layout was to reduce implementation churn. Not intended as a precedent that pre-freeze specs have to ensure backwards compatibility. Can be changed in later or 1.0 version.

#418 Drop LMUL

Agreed to retain LMUL given discussion on github, but to bring up some of the detailed suggestions as separate issues.

#424/425 New explicit EEW scheme in load/stores

Group was generally favorable on the new scheme that encodes effective element width in loads/stores.

The spec still has not been updated to add 4* widening and 2* widening operations that would sign/zero extend from 1/4*SEW and 1/2*SEW to SEW.

Some concern expressed that dropping fixed-width sign/zero-extending byte/halfword load/stores will hurt performance of some key loops.
Also noted that bringing in smaller data items as fractional LMUL will avoid tying up as many architectural registers as bringing as full width elements. Will keep studying application kernels.

#434 SLEN=VLEN as optional extension

General sentiment was that it will be bad to fragment and we should find way to retain SLEN<=VLEN as single standard. Also, that SLEN=VLEN will be too expensive even on large cores, with only a subset of workload needing in-memory format in registers.

Proposals suggested including adding another bit to vsetvli to change to in-memory format in vector registers (potentially reducing VLEN), or adding "cast" instructions to change in/out of memory byte order.
The latter could be in-place overwrites that turn into NOPs in SLEN=VLEN implementations.

Group to provide proposals.

#423 transient vtype modifier

Briefly discussed adding a transient vtype modifier, but no decisions.

Thang Tran

Hi All,

The issue that I brought up about the extra register that is needed to load data with LMUL=8 and then signed/unsigned extension that would need an extra register. I am still puzzled/confused about the code.

There was suggestion of using fractional LMUL to load data and then using the same register for signed/unsigned extension, then there is no extra register set is needed but this would overwritten the data in v0, as such v16 was used as comment by Krste from #425:

vsetvli t0, a0, e32, m8
loop:
[...]
vle8.v v0, (x12) # Load bytes
vqcvt.x.x.v v16, v0 # Quad-widening sign-extension
vfcvt.f.x.v v16, v16 # Convert to float
[...]

I have another question in the below example by Krste about the vqadd.vx instruction. Is it an illegal instruction with LMUL=8? Should this instruction is with LMUL=2 to expand from v0-v1 to v16-v23? In this case, 2 vsetvli instructions are inserted into the loop.

vsetvli t0, a0, e32, m8
loop:
[...]
vle8.v v0, (x12) # Load bytes
vfcvt.f.x.v v16, v16 # Convert to float
[...]

Thanks, Thang

-----Original Message-----
From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: Friday, May 1, 2020 7:28 PM
To: tech-vector-ext@...
Subject: [RISC-V] [tech-vector-ext] Vector extension TG meeting minutes 2020/5/1

Date: 2020/5/1
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

#440 LMUL contiguous?

Discussed the recent change to add extra bit for fractional LMUL.
Explained that the reason to revert to backwards-compatible layout was to reduce implementation churn. Not intended as a precedent that pre-freeze specs have to ensure backwards compatibility. Can be changed in later or 1.0 version.

#418 Drop LMUL

Agreed to retain LMUL given discussion on github, but to bring up some of the detailed suggestions as separate issues.

#424/425 New explicit EEW scheme in load/stores

Group was generally favorable on the new scheme that encodes effective element width in loads/stores.

The spec still has not been updated to add 4* widening and 2* widening operations that would sign/zero extend from 1/4*SEW and 1/2*SEW to SEW.

Some concern expressed that dropping fixed-width sign/zero-extending byte/halfword load/stores will hurt performance of some key loops.
Also noted that bringing in smaller data items as fractional LMUL will avoid tying up as many architectural registers as bringing as full width elements. Will keep studying application kernels.

#434 SLEN=VLEN as optional extension

General sentiment was that it will be bad to fragment and we should find way to retain SLEN<=VLEN as single standard. Also, that SLEN=VLEN will be too expensive even on large cores, with only a subset of workload needing in-memory format in registers.

Proposals suggested including adding another bit to vsetvli to change to in-memory format in vector registers (potentially reducing VLEN), or adding "cast" instructions to change in/out of memory byte order.
The latter could be in-place overwrites that turn into NOPs in SLEN=VLEN implementations.

Group to provide proposals.

#423 transient vtype modifier

Briefly discussed adding a transient vtype modifier, but no decisions.

David Horner

On 2020-05-01 10:27 p.m., Krste Asanovic wrote:
Date: 2020/5/1
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

#440 LMUL contiguous?

Discussed the recent change to add extra bit for fractional LMUL.
Explained that the reason to revert to backwards-compatible layout was
to reduce implementation churn. Not intended as a precedent that
pre-freeze specs have to ensure backwards compatibility. Can be
changed in later or 1.0 version.

#418 Drop LMUL

Agreed to retain LMUL given discussion on github, but to bring up some
of the detailed suggestions as separate issues.

#424/425 New explicit EEW scheme in load/stores

Group was generally favorable on the new scheme that encodes effective

The spec still has not been updated to add 4* widening and 2* widening
operations that would sign/zero extend from 1/4*SEW and 1/2*SEW to
SEW.

Some concern expressed that dropping fixed-width sign/zero-extending
byte/halfword load/stores will hurt performance of some key loops.
Also noted that bringing in smaller data items as fractional LMUL will
avoid tying up as many architectural registers as bringing as full
width elements. Will keep studying application kernels.

#434 SLEN=VLEN as optional extension

General sentiment was that it will be bad to fragment and we should
find way to retain SLEN<=VLEN as single standard. Also, that
SLEN=VLEN will be too expensive even on large cores, with only a
subset of workload needing in-memory format in registers.

Proposals suggested including adding another bit to vsetvli to change
to in-memory format in vector registers (potentially reducing VLEN),
or adding "cast" instructions to change in/out of memory byte order.
The latter could be in-place overwrites that turn into NOPs in
SLEN=VLEN implementations.

Group to provide proposals.

#423 transient vtype modifier

Briefly discussed adding a transient vtype modifier, but no decisions.
Note: To help understand the #423 as the meeting introduction only mentioned transient SEW and LMUL overrides.
The proposal is generalized and includes extending modifiers beyond what are mapped by vsetvli.
It also includes a persistent variant (which is discribed first).

Krste Asanovic

Date: 2020/5/1
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed:

#440 LMUL contiguous?

Discussed the recent change to add extra bit for fractional LMUL.
Explained that the reason to revert to backwards-compatible layout was
to reduce implementation churn. Not intended as a precedent that
pre-freeze specs have to ensure backwards compatibility. Can be
changed in later or 1.0 version.

#418 Drop LMUL

Agreed to retain LMUL given discussion on github, but to bring up some
of the detailed suggestions as separate issues.

#424/425 New explicit EEW scheme in load/stores

Group was generally favorable on the new scheme that encodes effective

The spec still has not been updated to add 4* widening and 2* widening
operations that would sign/zero extend from 1/4*SEW and 1/2*SEW to
SEW.

Some concern expressed that dropping fixed-width sign/zero-extending
byte/halfword load/stores will hurt performance of some key loops.
Also noted that bringing in smaller data items as fractional LMUL will
avoid tying up as many architectural registers as bringing as full
width elements. Will keep studying application kernels.

#434 SLEN=VLEN as optional extension

General sentiment was that it will be bad to fragment and we should
find way to retain SLEN<=VLEN as single standard. Also, that
SLEN=VLEN will be too expensive even on large cores, with only a
subset of workload needing in-memory format in registers.

Proposals suggested including adding another bit to vsetvli to change
to in-memory format in vector registers (potentially reducing VLEN),
or adding "cast" instructions to change in/out of memory byte order.
The latter could be in-place overwrites that turn into NOPs in
SLEN=VLEN implementations.

Group to provide proposals.

#423 transient vtype modifier

Briefly discussed adding a transient vtype modifier, but no decisions.

 1 - 6 of 6