Date   

Vector Task Group meeting Friday March 26

Krste Asanovic
 

We'll meet again in usual slot.

The main discussion topic will be #545. Please read the issue thread
on github.

Summary: The proposal is to move vector AMOs from their current
encoding to leave space for scalar subword AMOs, and to drop vector
AMOs from the base application processor vector profile (they were
already excluded from Zve* subset profiles). We'll rework the
vector AMO encoding, but not in path for v1.0 ratification.

Krste


Vector Extension Task Group Minutes 2021/03/19

Krste Asanovic
 

Date: 2021/03/19
Task Group: Vector Extension
Chair: Krste Asanovic
Vice-Chair: Roger Espasa
Number of Attendees: ~16
Current issues on github: https://github.com/riscv/riscv-v-spec

Issues discussed

#640 Bound on VLMAX/VLEN

Previously, we'd discussed making the upper bound on VLMAX part of
profile, but realization was that bound cannot be later increased in a
software-compatibile way without adding a new instruction, so is
effectively part of the ISA spec.

We discussed having the more general case of VLMAX being the bound,
but consensus was that having bound be a function of VLEN (<=65536)
was simpler to specify and had no great effect on range of supported
systems.

The extension to add independent control of data input size of
vrgather, proposed in #655, was briefly discussed, but this will not
be included in v1.0.

#651 expanding tail-agnostic to allow result values (masks only, or data)

The discussion was around expanding the set of allowable tail-agnostic
values to include the results of the computation.

The consensus was to expand this for mask register writes (except
loads), where only tail-agnostic behavior is required.

But support was not as clear for data register writes, where
tail-undisturbed behavior must be supported and where FP operations
require masking off exception flags even for tail-agnostic.

PoR is to expand mask register writes to allow results to be written
in tail, while continuing discussion on further relaxing for data
register writes.

#457 Ordering in vector AMOs

Current vector AMOs have no facility to order writes to the same
address, whereas indexed stores have an ordered option.

Discussion was on proposal to tie address-based ordering to the wd
(write result data) bit. One concern was that this seemed to hamper
some cases, including where software wanted the results but knew
addresses were disjoint. Providing ordering only on same address
would likely require slow implementation on out-of-order machine where
addresses can be produced out of order for different element groups.

Decision was to maintain PoR and consider post-v1.0 ways to support
ordered vector AMOs.


Next Vector TG Meeting, Friday March 19

Krste Asanovic
 

There are a few issues to discuss, so we’ll meet in the regular time slot on the calendar,
Krste


cancel Mar 12 Vector TG meeting

Krste Asanovic
 

I'm cancelling meeting again, as I still have not been able to clean
spec. I realize it will be more efficient for folks to wait for a
clean version for a complete read through. Few issues are being
reported/found, so I do not anticipate any substantive change.

One realization on issue #640 is that VLEN limit (<=64Kb) has to be
part of ISA spec, not just profile, to allow backwards compatibility.
Details on github issue.

One update is that RIOS Lab has agreed to help with architecture tests
and the SAIL model - thank you, RIOS!

Krste


cancel next Vector TG meeting, Friday March 5

Krste Asanovic
 

I'm still working through spec cleanup.

The list and github has been quiet, and I have no new issues to raise,
so I suggest we cancel this meeting and push out for a week.

Krste


Vector Task Group meeting minutes for 2021/2/19

Krste Asanovic
 

Date: 2021/02/19
Task Group: Vector Extension
Chair: Krste Asanovic
Vice-Chair: Roger Espasa
Number of Attendees: ~23
Current issues on github: https://github.com/riscv/riscv-v-spec

# Next Meeting/Freezing

The schedule is to meet again in two weeks (Friday March 5). The plan
is to have all pending updates and cleanups in spec by that date, to
be able to agree to freeze and move forward into public review (v1.0),
which should happen soon after this meeting. Please continue to send
PRs for any small typos and clarifications, and use mailing list for
larger issues.

Issues discussed

#640 Bound on VLMAX

The major issue raised was that software would otherwise have to cope
with indices that might not fit in 16b. The group agreed that
profiles and/or platform specs can set the upper bound, with a current
recommendation that all profiles limit VLMAX to 64K elements
(VLEN=64Kib, or 8K bytes per vector register). The current ISA spec can
support larger VLMAX already, but adding vrgatherei32 instruction
would be a useful addition (post-v1.0) if architectural vector
regfiles >256KiB become common.

It was also discussed that a privileged setting will be desired to
modulate visible VLEN to support thread migration, or just different
application vector profiles with different VLMAX in general.


Re: Zfinx + Vector

Tariq Kurd
 

Thanks Krste, I’ve put exactly that I the spec.

 

Tariq

 

From: tech-vector-ext@... [mailto:tech-vector-ext@...] On Behalf Of Krste Asanovic
Sent: 18 February 2021 19:09
To: Tariq Kurd <tariq.kurd@...>
Cc: tech-vector-ext@...
Subject: Re: [RISC-V] [tech-vector-ext] Zfinx + Vector

 

If you check over the vector instruciton listing table It’s all the instructions in funct3=OPFVF with an F in the operand column. Most of these are missing.

 

Krste



On Feb 18, 2021, at 10:44 AM, Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

 

Hi everyone,

 

I’ve updated the Zfinx spec to show which V-extension instructions are affected.

 

 

Please review the list, and tell me of any impact on the vector spec which I’ve overlooked.

 

Thanks

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4SY, UK       

 

<image001.png>    http://www.huawei.com

<image002.jpg>

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 

 


Vector task group meeting, Friday Feb 19

Krste Asanovic
 

We’ll meet today in usual slot, details on Google calendar

Agenda is to discuss any issues found while reading over the v0.10 spec.
List and GitHub has been quite quiet, so this might be a short meeting.

Krste


Re: Zfinx + Vector

Krste Asanovic
 

If you check over the vector instruciton listing table It’s all the instructions in funct3=OPFVF with an F in the operand column. Most of these are missing.

Krste

On Feb 18, 2021, at 10:44 AM, Tariq Kurd via lists.riscv.org <tariq.kurd=huawei.com@...> wrote:

Hi everyone,
 
I’ve updated the Zfinx spec to show which V-extension instructions are affected.
 
 
Please review the list, and tell me of any impact on the vector spec which I’ve overlooked.
 
Thanks
 
Tariq
 
Tariq Kurd
Processor Design I RISC-V Cores, Bristol
Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4SY, UK       
 
<image001.png>    http://www.huawei.com
<image002.jpg>
This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
 


Zfinx + Vector

Tariq Kurd
 

Hi everyone,

 

I’ve updated the Zfinx spec to show which V-extension instructions are affected.

 

https://github.com/riscv/riscv-zfinx/blob/master/Zfinx_spec.adoc#vector

 

Please review the list, and tell me of any impact on the vector spec which I’ve overlooked.

 

Thanks

 

Tariq

 

Tariq Kurd

Processor Design I RISC-V Cores, Bristol

E-mail: Tariq.Kurd@...

Company: Huawei technologies R&D (UK) Ltd I Address: 290 Park Avenue, Aztec West, Almondsbury, Bristol, Avon, BS32 4SY, UK      

 

315px-Huawei    http://www.huawei.com

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure,reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it !

本邮件及其附件含有华为公司的保密信息,仅限于发送给上面 地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!

 


Re: Vector TG minutes for 2020/12/18 meeting

Bill Huffman
 

For hardware with very long vector registers, the same effect might be accomplished by having a custom way to change VLMAX dynamically (across all harts, etc.). It would seem that would cover a larger set of useful cases for what Guy is thinking about - if I'm following him.

Bill

-----Original Message-----
From: tech-vector-ext@lists.riscv.org <tech-vector-ext@lists.riscv.org> On Behalf Of Krste Asanovic
Sent: Tuesday, February 16, 2021 3:21 PM
To: Guy Lemieux <guy.lemieux@gmail.com>
Cc: krste@berkeley.edu; Zalman Stern <zalman@google.com>; tech-vector-ext@lists.riscv.org
Subject: Re: [RISC-V] [tech-vector-ext] Vector TG minutes for 2020/12/18 meeting

EXTERNAL MAIL



On Tue, 16 Feb 2021 15:12:46 -0800, Guy Lemieux <guy.lemieux@gmail.com> said:
| in terms of overlap with that case — that case normally selects
| maximally sized AVL. the implied goals there are to make best use of
| vector register capacity and throughput. l

| i’m suggesting a case where a minimally sized AVL is used, as chosen by the architect.

| this allows a programmer to optimize for minimum latency while still
| getting good throughput. in some cases, the full VLMAX state may still
| be used to hold data, but operations are chunked down to minimally sized AVL (eg for latency reasons).

I still don't see how hardware can set a <VLMAX value that will work well for any code in loop.

Your latency comment seems to imply an external observer sees the individual strips go by (e.g., in DSP applicaiton where data comes in and goes out in chunks), as otherwise only total time to finish loop matters.

In these situations, I also can't see having the microarchitecture pick the chunk size - usually the I/O latency constraint sets the chunk size and goal of vector execution is to execute the chunks as efficiently as possible.

Krste

| i’m not sure of the portability concerns. if an implementation is free
| to set VLMAX, and software must be written for any possible AVL that is returned, then it appears to me that deliberately returning a smaller implementation-defined AVL should still be portable.

| programming for min-latency isn’t common in HPC, but can be useful in real-time systems.

| g

| On Tue, Feb 16, 2021 at 3:01 PM <krste@berkeley.edu> wrote:

| There's a large overlap here with the (rd!=x0,rs1=x0) case that

| selects AVL=VLMAX.  If migration is intended, then VLMAX should be

| same across harts.

| Machines with long temporal vector registers might benefit from
| using

| less than VLMAX, but this is highly dependent on specifics of the

| interaction of the microarchitecture and the scheduled application

| kernel (otherwise, the long vector registers were a waste of

| resources).  I can't see how to do this portably beyond selecting

| VLMAX.

| Krste

| | Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.

| | Guy

| | On Mon, Feb 15, 2021 at 11:30 PM <krste@berkeley.edu> wrote:

| |     Replying to old thread to add rationale for current choice.

| |||||| On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@google.com> said:

| |     | Does it get easier if the specification is just the immediate value plus one?

| |     No - this costs more gates on critical path.  Mapping 00000
| => 32 is

| |     simpler in area and delay.

| |     | I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and
| it

| |     seems

| |     | like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:

| |     |     (first_bit ? 3 : 1) << rest_of_the_bits

| |     | or:

| |     |     map[] = { 1, 3, 5, 8 }; // Or maybe something else for
| 5 and 8

| |     |     map[first_two_bits] << rest_of_the_bits;

| |     | I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

| |     As a counter-example for this particular example, one code I
| looked at

| |     recently related to AR/VR used 9 as one dimension.

| |     The challenge is agreeing on the best mapping from the 32
| immediate

| |     encodings to the most commonly used AVL values.

| |     More creative mappings do consume some incremental logic and
| path

| |     delay (as well as adding some complexity to software toolchain).

| |     While they can provide small gains in some cases, this is
| offset by

| |     small losses in other cases (someone will want AVL=17
| somewhere, and

| |     it's not clear that say AVL=40 is a substantially better use
| of

| |     encoding).  There is not huge penalty if the immediate does
| not fit,

| |     at most a li instruction, which might be hoisted out of the loop.

| |     The curent v0.10 definition uses the obvious mapping of the immediate.

| |     Simplicity is a virtue, and any potential gains are small
| for AVL >

| |     31, where most implementation costs are amortized over the
| longer

| |     vector and many implementations won't support longer lengths
| for a

| |     given datatype in any case.

| |     Krste

| |     | -Z-

| |     | On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@gmail.com> wrote:

| |     |     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

| |     |     one option is to set vl=VLMAX. i have some concerns
| about software using this safely (eg, if VLMAX turns out to be much
| larger than software

| |     anticipated,

| |     |     then it would fail; correcting this requires more
| instructions than just using the regular vsetvl/vsetvli would have used).

| |     |     another option is to allow an implementation-defined
| vl to be chosen by hardware; this could be anywhere between 1 and
| VLMAX. for example,

| |     implementations

| |     |     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this
| may

| |     |     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent
| instructions), or

| |     |     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower
| power),

| |     or

| |     |     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether
| the

| |     |     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined
| CSR).

| |     |     software using a few extra cycles to check the
| returned vl against AVL should not a big problem (the simplest
| solution being vsetvli followed by

| |     vsetivli)

| |     |     g

| |     |     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@berkeley.edu> wrote:

| |     |         # vsetivli

| |     |         A new variant of vsetvl was proposed providing an
| immediate as the AVL

| |     |         in rs1[4:0].  The immediate encoding is the same
| as for CSR immediate

| |     |         instructions. The instruction would have bit 31:30
| = 11 and bits 29:20

| |     |         would be encoded same as vsetvli.

| |     |         This would be used when AVL was statically known,
| and known to fit

| |     |         inside vector register group.  Compared with
| existing PoR, it removes

| |     |         need to load immediate into a spare scalar
| register before executing

| |     |         vsetvli, and is useful for handling scalar values
| in vector register

| |     |         (vl=1) and other cases where short fixed-sized
| vectors are the

| |     |         datatype (e.g., graphics).

| |     |         There was discussion on whether uimm=00000 should
| represent 32 or be

| |     |         reserved.  32 is more useful, but adds a little
| complexity to

| |     |         hardware.

| |     |         There was also discussion on whether instruction
| should set vill if

| |     |         selected AVL is not supported, or whether should
| clip vl to VLMAX as

| |     |         with other instructions, or if behavior should be
| reserved.  Group

| |     |         generally favored writing vill to expose software errors.

| |     |


Re: Vector TG minutes for 2020/12/18 meeting

Krste Asanovic
 

On Tue, 16 Feb 2021 15:12:46 -0800, Guy Lemieux <guy.lemieux@gmail.com> said:
| in terms of overlap with that case — that case normally selects maximally sized AVL. the implied goals there are to make best use of vector register capacity and
| throughput. l

| i’m suggesting a case where a minimally sized AVL is used, as chosen by the architect.

| this allows a programmer to optimize for minimum latency while still
| getting good throughput. in some cases, the full VLMAX state may still be used to hold data, but operations are chunked down to minimally sized AVL (eg for
| latency reasons).

I still don't see how hardware can set a <VLMAX value that will work
well for any code in loop.

Your latency comment seems to imply an external observer sees the
individual strips go by (e.g., in DSP applicaiton where data comes in
and goes out in chunks), as otherwise only total time to finish loop
matters.

In these situations, I also can't see having the microarchitecture
pick the chunk size - usually the I/O latency constraint sets the
chunk size and goal of vector execution is to execute the chunks as
efficiently as possible.

Krste

| i’m not sure of the portability concerns. if an implementation is free to set VLMAX, and software must be written for any possible AVL that is returned, then it
| appears to me that deliberately returning a smaller implementation-defined AVL should still be portable.

| programming for min-latency isn’t common in HPC, but can be useful in real-time systems.

| g

| On Tue, Feb 16, 2021 at 3:01 PM <krste@berkeley.edu> wrote:

| There's a large overlap here with the (rd!=x0,rs1=x0) case that

| selects AVL=VLMAX.  If migration is intended, then VLMAX should be

| same across harts.

| Machines with long temporal vector registers might benefit from using

| less than VLMAX, but this is highly dependent on specifics of the

| interaction of the microarchitecture and the scheduled application

| kernel (otherwise, the long vector registers were a waste of

| resources).  I can't see how to do this portably beyond selecting

| VLMAX.

| Krste

| | Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.

| | Guy

| | On Mon, Feb 15, 2021 at 11:30 PM <krste@berkeley.edu> wrote:

| |     Replying to old thread to add rationale for current choice.

| |||||| On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@google.com> said:

| |     | Does it get easier if the specification is just the immediate value plus one?

| |     No - this costs more gates on critical path.  Mapping 00000 => 32 is

| |     simpler in area and delay.

| |     | I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and
| it

| |     seems

| |     | like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:

| |     |     (first_bit ? 3 : 1) << rest_of_the_bits

| |     | or:

| |     |     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8

| |     |     map[first_two_bits] << rest_of_the_bits;

| |     | I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

| |     As a counter-example for this particular example, one code I looked at

| |     recently related to AR/VR used 9 as one dimension.

| |     The challenge is agreeing on the best mapping from the 32 immediate

| |     encodings to the most commonly used AVL values.

| |     More creative mappings do consume some incremental logic and path

| |     delay (as well as adding some complexity to software toolchain).

| |     While they can provide small gains in some cases, this is offset by

| |     small losses in other cases (someone will want AVL=17 somewhere, and

| |     it's not clear that say AVL=40 is a substantially better use of

| |     encoding).  There is not huge penalty if the immediate does not fit,

| |     at most a li instruction, which might be hoisted out of the loop.

| |     The curent v0.10 definition uses the obvious mapping of the immediate.

| |     Simplicity is a virtue, and any potential gains are small for AVL >

| |     31, where most implementation costs are amortized over the longer

| |     vector and many implementations won't support longer lengths for a

| |     given datatype in any case.

| |     Krste

| |     | -Z-

| |     | On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@gmail.com> wrote:

| |     |     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

| |     |     one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software

| |     anticipated,

| |     |     then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

| |     |     another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example,

| |     implementations

| |     |     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this
| may

| |     |     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent
| instructions), or

| |     |     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower
| power),

| |     or

| |     |     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether
| the

| |     |     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined
| CSR).

| |     |     software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by

| |     vsetivli)

| |     |     g

| |     |     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@berkeley.edu> wrote:

| |     |         # vsetivli

| |     |         A new variant of vsetvl was proposed providing an immediate as the AVL

| |     |         in rs1[4:0].  The immediate encoding is the same as for CSR immediate

| |     |         instructions. The instruction would have bit 31:30 = 11 and bits 29:20

| |     |         would be encoded same as vsetvli.

| |     |         This would be used when AVL was statically known, and known to fit

| |     |         inside vector register group.  Compared with existing PoR, it removes

| |     |         need to load immediate into a spare scalar register before executing

| |     |         vsetvli, and is useful for handling scalar values in vector register

| |     |         (vl=1) and other cases where short fixed-sized vectors are the

| |     |         datatype (e.g., graphics).

| |     |         There was discussion on whether uimm=00000 should represent 32 or be

| |     |         reserved.  32 is more useful, but adds a little complexity to

| |     |         hardware.

| |     |         There was also discussion on whether instruction should set vill if

| |     |         selected AVL is not supported, or whether should clip vl to VLMAX as

| |     |         with other instructions, or if behavior should be reserved.  Group

| |     |         generally favored writing vill to expose software errors.

| |     |


Re: Vector TG minutes for 2020/12/18 meeting

Guy Lemieux
 

in terms of overlap with that case — that case normally selects maximally sized AVL. the implied goals there are to make best use of vector register capacity and throughput. l

i’m suggesting a case where a minimally sized AVL is used, as chosen by the architect. this allows a programmer to optimize for minimum latency while still getting good throughput. in some cases, the full VLMAX state may still be used to hold data, but operations are chunked down to minimally sized AVL (eg for latency reasons).

i’m not sure of the portability concerns. if an implementation is free to set VLMAX, and software must be written for any possible AVL that is returned, then it appears to me that deliberately returning a smaller implementation-defined AVL should still be portable.

programming for min-latency isn’t common in HPC, but can be useful in real-time systems.

g



On Tue, Feb 16, 2021 at 3:01 PM <krste@...> wrote:
There's a large overlap here with the (rd!=x0,rs1=x0) case that

selects AVL=VLMAX.  If migration is intended, then VLMAX should be

same across harts.



Machines with long temporal vector registers might benefit from using

less than VLMAX, but this is highly dependent on specifics of the

interaction of the microarchitecture and the scheduled application

kernel (otherwise, the long vector registers were a waste of

resources).  I can't see how to do this portably beyond selecting

VLMAX.



Krste





| Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.



| Guy



| On Mon, Feb 15, 2021 at 11:30 PM <krste@...> wrote:



|     Replying to old thread to add rationale for current choice.



|||||| On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@...> said:



|     | Does it get easier if the specification is just the immediate value plus one?



|     No - this costs more gates on critical path.  Mapping 00000 => 32 is

|     simpler in area and delay.



|     | I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and it

|     seems

|     | like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:

|     |     (first_bit ? 3 : 1) << rest_of_the_bits

|     | or:

|     |     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8

|     |     map[first_two_bits] << rest_of_the_bits;



|     | I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.



|     As a counter-example for this particular example, one code I looked at

|     recently related to AR/VR used 9 as one dimension.



|     The challenge is agreeing on the best mapping from the 32 immediate

|     encodings to the most commonly used AVL values.



|     More creative mappings do consume some incremental logic and path

|     delay (as well as adding some complexity to software toolchain).

|     While they can provide small gains in some cases, this is offset by

|     small losses in other cases (someone will want AVL=17 somewhere, and

|     it's not clear that say AVL=40 is a substantially better use of

|     encoding).  There is not huge penalty if the immediate does not fit,

|     at most a li instruction, which might be hoisted out of the loop.



|     The curent v0.10 definition uses the obvious mapping of the immediate.

|     Simplicity is a virtue, and any potential gains are small for AVL >

|     31, where most implementation costs are amortized over the longer

|     vector and many implementations won't support longer lengths for a

|     given datatype in any case.



|     Krste



|     | -Z-



|     | On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@...> wrote:



|     |     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?



|     |     one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software

|     anticipated,

|     |     then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 



|     |     another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example,

|     implementations

|     |     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may

|     |     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or

|     |     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power),

|     or

|     |     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the

|     |     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR).

|     |     software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by

|     vsetivli)



|     |     g



|     |     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@...> wrote:



|     |         # vsetivli



|     |         A new variant of vsetvl was proposed providing an immediate as the AVL

|     |         in rs1[4:0].  The immediate encoding is the same as for CSR immediate

|     |         instructions. The instruction would have bit 31:30 = 11 and bits 29:20

|     |         would be encoded same as vsetvli.



|     |         This would be used when AVL was statically known, and known to fit

|     |         inside vector register group.  Compared with existing PoR, it removes

|     |         need to load immediate into a spare scalar register before executing

|     |         vsetvli, and is useful for handling scalar values in vector register

|     |         (vl=1) and other cases where short fixed-sized vectors are the

|     |         datatype (e.g., graphics).



|     |         There was discussion on whether uimm=00000 should represent 32 or be

|     |         reserved.  32 is more useful, but adds a little complexity to

|     |         hardware.



|     |         There was also discussion on whether instruction should set vill if

|     |         selected AVL is not supported, or whether should clip vl to VLMAX as

|     |         with other instructions, or if behavior should be reserved.  Group

|     |         generally favored writing vill to expose software errors.




Re: Vector TG minutes for 2020/12/18 meeting

Krste Asanovic
 

On Tue, 16 Feb 2021 01:48:57 -0800, Guy Lemieux <guy.lemieux@gmail.com> said:
| I agree with you.
| I had suggested the mapping of 00000 to an implementation-defined value (chosen by the CPU architect). For some architectures, this may be 16, for others it may
| be 32, or even 2.

| The value selected should be selected as the minimum recommended vector length that can achieve good performance (high FU utilization or good memory bandwidth,
| or a balance) on the underlying hardware.

| This would greatly simplify software that just wants to get "reasonable" acceleration without writing code to measure performance of the underlying hardware.
| Such code may select poor values if harts are heterogeneous and a thread migrates. By making this implementation-defined, a value suitable for all harts can be
| selected by the processor architect.

There's a large overlap here with the (rd!=x0,rs1=x0) case that
selects AVL=VLMAX. If migration is intended, then VLMAX should be
same across harts.

Machines with long temporal vector registers might benefit from using
less than VLMAX, but this is highly dependent on specifics of the
interaction of the microarchitecture and the scheduled application
kernel (otherwise, the long vector registers were a waste of
resources). I can't see how to do this portably beyond selecting
VLMAX.

Krste


| Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.

| Guy

| On Mon, Feb 15, 2021 at 11:30 PM <krste@berkeley.edu> wrote:

| Replying to old thread to add rationale for current choice.

|||||| On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@google.com> said:

| | Does it get easier if the specification is just the immediate value plus one?

| No - this costs more gates on critical path.  Mapping 00000 => 32 is
| simpler in area and delay.

| | I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and it
| seems
| | like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:
| |     (first_bit ? 3 : 1) << rest_of_the_bits
| | or:
| |     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8
| |     map[first_two_bits] << rest_of_the_bits;

| | I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

| As a counter-example for this particular example, one code I looked at
| recently related to AR/VR used 9 as one dimension.

| The challenge is agreeing on the best mapping from the 32 immediate
| encodings to the most commonly used AVL values.

| More creative mappings do consume some incremental logic and path
| delay (as well as adding some complexity to software toolchain).
| While they can provide small gains in some cases, this is offset by
| small losses in other cases (someone will want AVL=17 somewhere, and
| it's not clear that say AVL=40 is a substantially better use of
| encoding).  There is not huge penalty if the immediate does not fit,
| at most a li instruction, which might be hoisted out of the loop.

| The curent v0.10 definition uses the obvious mapping of the immediate.
| Simplicity is a virtue, and any potential gains are small for AVL >
| 31, where most implementation costs are amortized over the longer
| vector and many implementations won't support longer lengths for a
| given datatype in any case.

| Krste

| | -Z-

| | On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@gmail.com> wrote:

| |     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

| |     one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software
| anticipated,
| |     then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

| |     another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example,
| implementations
| |     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may
| |     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or
| |     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power),
| or
| |     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the
| |     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR).
| |     software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by
| vsetivli)

| |     g

| |     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@berkeley.edu> wrote:

| |         # vsetivli

| |         A new variant of vsetvl was proposed providing an immediate as the AVL
| |         in rs1[4:0].  The immediate encoding is the same as for CSR immediate
| |         instructions. The instruction would have bit 31:30 = 11 and bits 29:20
| |         would be encoded same as vsetvli.

| |         This would be used when AVL was statically known, and known to fit
| |         inside vector register group.  Compared with existing PoR, it removes
| |         need to load immediate into a spare scalar register before executing
| |         vsetvli, and is useful for handling scalar values in vector register
| |         (vl=1) and other cases where short fixed-sized vectors are the
| |         datatype (e.g., graphics).

| |         There was discussion on whether uimm=00000 should represent 32 or be
| |         reserved.  32 is more useful, but adds a little complexity to
| |         hardware.

| |         There was also discussion on whether instruction should set vill if
| |         selected AVL is not supported, or whether should clip vl to VLMAX as
| |         with other instructions, or if behavior should be reserved.  Group
| |         generally favored writing vill to expose software errors.

| |


Re: Vector TG minutes for 2020/12/18 meeting

Guy Lemieux
 

I agree with you.

I had suggested the mapping of 00000 to an implementation-defined value (chosen by the CPU architect). For some architectures, this may be 16, for others it may be 32, or even 2.

The value selected should be selected as the minimum recommended vector length that can achieve good performance (high FU utilization or good memory bandwidth, or a balance) on the underlying hardware.

This would greatly simplify software that just wants to get "reasonable" acceleration without writing code to measure performance of the underlying hardware. Such code may select poor values if harts are heterogeneous and a thread migrates. By making this implementation-defined, a value suitable for all harts can be selected by the processor architect.

Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.

Guy



On Mon, Feb 15, 2021 at 11:30 PM <krste@...> wrote:

Replying to old thread to add rationale for current choice.

>>>>> On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@...> said:

| Does it get easier if the specification is just the immediate value plus one?

No - this costs more gates on critical path.  Mapping 00000 => 32 is
simpler in area and delay.

| I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and it seems
| like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:
|     (first_bit ? 3 : 1) << rest_of_the_bits
| or:
|     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8
|     map[first_two_bits] << rest_of_the_bits;

| I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

As a counter-example for this particular example, one code I looked at
recently related to AR/VR used 9 as one dimension.

The challenge is agreeing on the best mapping from the 32 immediate
encodings to the most commonly used AVL values.

More creative mappings do consume some incremental logic and path
delay (as well as adding some complexity to software toolchain).
While they can provide small gains in some cases, this is offset by
small losses in other cases (someone will want AVL=17 somewhere, and
it's not clear that say AVL=40 is a substantially better use of
encoding).  There is not huge penalty if the immediate does not fit,
at most a li instruction, which might be hoisted out of the loop.

The curent v0.10 definition uses the obvious mapping of the immediate.
Simplicity is a virtue, and any potential gains are small for AVL >
31, where most implementation costs are amortized over the longer
vector and many implementations won't support longer lengths for a
given datatype in any case.

Krste


| -Z-

| On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@...> wrote:

|     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

|     one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software anticipated,
|     then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

|     another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example, implementations
|     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may
|     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or
|     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power), or
|     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the
|     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR).
|     software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by vsetivli)

|     g

|     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@...> wrote:

|         # vsetivli

|         A new variant of vsetvl was proposed providing an immediate as the AVL
|         in rs1[4:0].  The immediate encoding is the same as for CSR immediate
|         instructions. The instruction would have bit 31:30 = 11 and bits 29:20
|         would be encoded same as vsetvli.

|         This would be used when AVL was statically known, and known to fit
|         inside vector register group.  Compared with existing PoR, it removes
|         need to load immediate into a spare scalar register before executing
|         vsetvli, and is useful for handling scalar values in vector register
|         (vl=1) and other cases where short fixed-sized vectors are the
|         datatype (e.g., graphics).

|         There was discussion on whether uimm=00000 should represent 32 or be
|         reserved.  32 is more useful, but adds a little complexity to
|         hardware.

|         There was also discussion on whether instruction should set vill if
|         selected AVL is not supported, or whether should clip vl to VLMAX as
|         with other instructions, or if behavior should be reserved.  Group
|         generally favored writing vill to expose software errors.


Re: Vector TG minutes for 2020/12/18 meeting

Krste Asanovic
 

Replying to old thread to add rationale for current choice.

On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@google.com> said:
| Does it get easier if the specification is just the immediate value plus one?

No - this costs more gates on critical path. Mapping 00000 => 32 is
simpler in area and delay.

| I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and it seems
| like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:
|     (first_bit ? 3 : 1) << rest_of_the_bits
| or:
|     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8
|     map[first_two_bits] << rest_of_the_bits;

| I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

As a counter-example for this particular example, one code I looked at
recently related to AR/VR used 9 as one dimension.

The challenge is agreeing on the best mapping from the 32 immediate
encodings to the most commonly used AVL values.

More creative mappings do consume some incremental logic and path
delay (as well as adding some complexity to software toolchain).
While they can provide small gains in some cases, this is offset by
small losses in other cases (someone will want AVL=17 somewhere, and
it's not clear that say AVL=40 is a substantially better use of
encoding). There is not huge penalty if the immediate does not fit,
at most a li instruction, which might be hoisted out of the loop.

The curent v0.10 definition uses the obvious mapping of the immediate.
Simplicity is a virtue, and any potential gains are small for AVL >
31, where most implementation costs are amortized over the longer
vector and many implementations won't support longer lengths for a
given datatype in any case.

Krste


| -Z-

| On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@gmail.com> wrote:

| for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

| one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software anticipated,
| then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

| another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example, implementations
| may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may
| consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or
| that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power), or
| matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the
| implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR).
| software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by vsetivli)

| g

| On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@berkeley.edu> wrote:

| # vsetivli

| A new variant of vsetvl was proposed providing an immediate as the AVL
| in rs1[4:0].  The immediate encoding is the same as for CSR immediate
| instructions. The instruction would have bit 31:30 = 11 and bits 29:20
| would be encoded same as vsetvli.

| This would be used when AVL was statically known, and known to fit
| inside vector register group.  Compared with existing PoR, it removes
| need to load immediate into a spare scalar register before executing
| vsetvli, and is useful for handling scalar values in vector register
| (vl=1) and other cases where short fixed-sized vectors are the
| datatype (e.g., graphics).

| There was discussion on whether uimm=00000 should represent 32 or be
| reserved.  32 is more useful, but adds a little complexity to
| hardware.

| There was also discussion on whether instruction should set vill if
| selected AVL is not supported, or whether should clip vl to VLMAX as
| with other instructions, or if behavior should be reserved.  Group
| generally favored writing vill to expose software errors.

|


Re: New member request for participation info

Jim Wilson
 

On Sun, Feb 7, 2021 at 9:48 PM ghost <ghost@...> wrote:
been overlooked.  Could someone please tell me (1) where to find the
current spec drafts, and (2) the best way to share any observations?

Try gitrhub.com/riscv/riscv-v-spec.  There is also software stuff like riscv/rvv-intrinsic-doc that is defining compiler intrinsics for the vector spec.

Like bitmanip, you can file issues or pull requests which is probably the best approach.  Or you can send email to this list.  Or raise issues in the meeting.

You can find the "Tech Groups Calendar" on the wiki, along with a lot of other useful info like specifications status.

Jim


New member request for participation info

ghost
 

Hello,

I've just joined the RISC-V technical community and the V Extension
Task group. I have very substantial experience in careful technical
documentation (I wrote the RFCs for gzip and DEFLATE, and was one of
the few non-Adobe reviewers for the PostScript and PDF reference
manuals). What I'm hoping to contribute to the RISC-V community is
primarily documentation review.

I know that the V extension(s) are quite close to release for public
comment, but I would still like the opportunity to review them in
detail -- another pair of eyes can sometimes spot things that have
been overlooked. Could someone please tell me (1) where to find the
current spec drafts, and (2) the best way to share any observations?

Thanks -

L Peter Deutsch <ghost@major2nd.com> :: Aladdin Enterprises :: Healdsburg, CA

Was your vote really counted? http://www.verifiedvoting.org


Vector Task Group minutes for 2021/02/05 meeting

Krste Asanovic
 

Date: 2021/02/05
Task Group: Vector Extension
Chair: Krste Asanovic
Vice-Chair: Roger Espasa
Number of Attendees: ~16
Current issues on github: https://github.com/riscv/riscv-v-spec

# Next Meeting

It was decided to meet again in two weeks (Feb 19) to allow time for
everyone to digest and comment on the v0.10 release version. Please
send PRs for any small typos and clarifications, and use mailing list
for larger issues.

Issues discussed

# Assembly synatx

There was a desire to move away from allowing vsetvl to imply
"undisturbed" behavior by default, to ensure maximum use of "agnostic"
by software. The assembler can issue errors instead of warnings when
the ta/tu/ma/mu fields are not explicitly given, with perhaps an
option to allow with a warning to allow older code to be compiled.

# Spec formatting

There was some discussion on use of Wavedrom formatting tools. New
tools to give diagrams for the register layout will be added. There
was also a promise of somewhat faster build times for the doc. There
apparently is a central flow for running document generation on
commits at riscv.org, and we need to sync up with that process.

# Extend agnostic behavior of mask logical operations

There was a request to extend tail-agnostic behavior of mask logical
instructions to allow the tail to be overwritten with values
corresponding to the logical operation (as opposed to agnostic values
that currently can only be all 1s or the previous destination value).
This is a relaxation of requireements, so would not affect
compatibility of existing implementations. To be discussed.


Next RISC-V Vector Task Group Meeting reminder

Krste Asanovic
 

We’ll meet tomorrow in usual slot per TG calendar.

The agenda is to review any feedback on the 0.10 spec and then to proceed through any outstanding issues,

Krste

181 - 200 of 761