Vector TG minutes for 2020/12/18 meeting


Krste Asanovic
 

Date: 2020/12/18
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~10
Current issues on github: https://github.com/riscv/riscv-v-spec

Note: No more meetings schedule until next year. Time slot may have
to change.

Issues discussed:

# Freeze process

We are close to freezing the spec. There is a waiver from chairs for
SAIL model and compatibility tests, but we will need to complete these
before ratification.

# auto pdf generation

There was a request to have the repo automatically generate a pdf
version on commits to avoid users having to install formatting tools.

# Mask handling

Continuing discussion, the concrete proposal is to add new unit-stride
loads and stores that would use the lumop/sumop field to encode byte
load/stores used for masks, and also use effective vl = ceil(vl/8)
(implying effectively EMUL<=1). Proposed instructions would be:

vle1.v vd, (rs1) # Byte load with effective vl = ceil(vl/8)
vse1.v vs2, (rs1) # Byte store with effective vl = ceil(vl/8)

Encoded with lumop/sumop = 00011.

We discussed adding whole vector register load version with
lumop=01011, which would only be a mask hint, but for now, this seems
less necessary so is not on PoR.

vl1re1.v vd, (rs1) # Whole register load

# vsetivli

A new variant of vsetvl was proposed providing an immediate as the AVL
in rs1[4:0]. The immediate encoding is the same as for CSR immediate
instructions. The instruction would have bit 31:30 = 11 and bits 29:20
would be encoded same as vsetvli.

This would be used when AVL was statically known, and known to fit
inside vector register group. Compared with existing PoR, it removes
need to load immediate into a spare scalar register before executing
vsetvli, and is useful for handling scalar values in vector register
(vl=1) and other cases where short fixed-sized vectors are the
datatype (e.g., graphics).

There was discussion on whether uimm=00000 should represent 32 or be
reserved. 32 is more useful, but adds a little complexity to
hardware.

There was also discussion on whether instruction should set vill if
selected AVL is not supported, or whether should clip vl to VLMAX as
with other instructions, or if behavior should be reserved. Group
generally favored writing vill to expose software errors.


Guy Lemieux
 

for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software anticipated, then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example, implementations may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power), or matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR). software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by vsetivli)

g


On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@...> wrote:

# vsetivli

A new variant of vsetvl was proposed providing an immediate as the AVL
in rs1[4:0].  The immediate encoding is the same as for CSR immediate
instructions. The instruction would have bit 31:30 = 11 and bits 29:20
would be encoded same as vsetvli.

This would be used when AVL was statically known, and known to fit
inside vector register group.  Compared with existing PoR, it removes
need to load immediate into a spare scalar register before executing
vsetvli, and is useful for handling scalar values in vector register
(vl=1) and other cases where short fixed-sized vectors are the
datatype (e.g., graphics).

There was discussion on whether uimm=00000 should represent 32 or be
reserved.  32 is more useful, but adds a little complexity to
hardware.

There was also discussion on whether instruction should set vill if
selected AVL is not supported, or whether should clip vl to VLMAX as
with other instructions, or if behavior should be reserved.  Group
generally favored writing vill to expose software errors.


Zalman Stern
 

Does it get easier if the specification is just the immediate value plus one?

I really don't understand how this encoding is particularly great for immediates as many of the values are likely very rarely or even never used and it seems like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:
    (first_bit ? 3 : 1) << rest_of_the_bits
or:
    map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8
    map[first_two_bits] << rest_of_the_bits;

I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

-Z-


On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@...> wrote:
for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software anticipated, then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example, implementations may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power), or matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR). software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by vsetivli)

g


On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@...> wrote:

# vsetivli

A new variant of vsetvl was proposed providing an immediate as the AVL
in rs1[4:0].  The immediate encoding is the same as for CSR immediate
instructions. The instruction would have bit 31:30 = 11 and bits 29:20
would be encoded same as vsetvli.

This would be used when AVL was statically known, and known to fit
inside vector register group.  Compared with existing PoR, it removes
need to load immediate into a spare scalar register before executing
vsetvli, and is useful for handling scalar values in vector register
(vl=1) and other cases where short fixed-sized vectors are the
datatype (e.g., graphics).

There was discussion on whether uimm=00000 should represent 32 or be
reserved.  32 is more useful, but adds a little complexity to
hardware.

There was also discussion on whether instruction should set vill if
selected AVL is not supported, or whether should clip vl to VLMAX as
with other instructions, or if behavior should be reserved.  Group
generally favored writing vill to expose software errors.


lidawei14@...
 

Perhaps for explicit naming conventions of mask operations, we can name "vle1.v" to "vmle1.v" instead.


Krste Asanovic
 

Replying to old thread to add rationale for current choice.

On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@google.com> said:
| Does it get easier if the specification is just the immediate value plus one?

No - this costs more gates on critical path. Mapping 00000 => 32 is
simpler in area and delay.

| I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and it seems
| like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:
|     (first_bit ? 3 : 1) << rest_of_the_bits
| or:
|     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8
|     map[first_two_bits] << rest_of_the_bits;

| I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

As a counter-example for this particular example, one code I looked at
recently related to AR/VR used 9 as one dimension.

The challenge is agreeing on the best mapping from the 32 immediate
encodings to the most commonly used AVL values.

More creative mappings do consume some incremental logic and path
delay (as well as adding some complexity to software toolchain).
While they can provide small gains in some cases, this is offset by
small losses in other cases (someone will want AVL=17 somewhere, and
it's not clear that say AVL=40 is a substantially better use of
encoding). There is not huge penalty if the immediate does not fit,
at most a li instruction, which might be hoisted out of the loop.

The curent v0.10 definition uses the obvious mapping of the immediate.
Simplicity is a virtue, and any potential gains are small for AVL >
31, where most implementation costs are amortized over the longer
vector and many implementations won't support longer lengths for a
given datatype in any case.

Krste


| -Z-

| On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@gmail.com> wrote:

| for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

| one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software anticipated,
| then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

| another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example, implementations
| may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may
| consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or
| that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power), or
| matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the
| implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR).
| software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by vsetivli)

| g

| On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@berkeley.edu> wrote:

| # vsetivli

| A new variant of vsetvl was proposed providing an immediate as the AVL
| in rs1[4:0].  The immediate encoding is the same as for CSR immediate
| instructions. The instruction would have bit 31:30 = 11 and bits 29:20
| would be encoded same as vsetvli.

| This would be used when AVL was statically known, and known to fit
| inside vector register group.  Compared with existing PoR, it removes
| need to load immediate into a spare scalar register before executing
| vsetvli, and is useful for handling scalar values in vector register
| (vl=1) and other cases where short fixed-sized vectors are the
| datatype (e.g., graphics).

| There was discussion on whether uimm=00000 should represent 32 or be
| reserved.  32 is more useful, but adds a little complexity to
| hardware.

| There was also discussion on whether instruction should set vill if
| selected AVL is not supported, or whether should clip vl to VLMAX as
| with other instructions, or if behavior should be reserved.  Group
| generally favored writing vill to expose software errors.

|


Guy Lemieux
 

I agree with you.

I had suggested the mapping of 00000 to an implementation-defined value (chosen by the CPU architect). For some architectures, this may be 16, for others it may be 32, or even 2.

The value selected should be selected as the minimum recommended vector length that can achieve good performance (high FU utilization or good memory bandwidth, or a balance) on the underlying hardware.

This would greatly simplify software that just wants to get "reasonable" acceleration without writing code to measure performance of the underlying hardware. Such code may select poor values if harts are heterogeneous and a thread migrates. By making this implementation-defined, a value suitable for all harts can be selected by the processor architect.

Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.

Guy



On Mon, Feb 15, 2021 at 11:30 PM <krste@...> wrote:

Replying to old thread to add rationale for current choice.

>>>>> On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@...> said:

| Does it get easier if the specification is just the immediate value plus one?

No - this costs more gates on critical path.  Mapping 00000 => 32 is
simpler in area and delay.

| I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and it seems
| like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:
|     (first_bit ? 3 : 1) << rest_of_the_bits
| or:
|     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8
|     map[first_two_bits] << rest_of_the_bits;

| I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

As a counter-example for this particular example, one code I looked at
recently related to AR/VR used 9 as one dimension.

The challenge is agreeing on the best mapping from the 32 immediate
encodings to the most commonly used AVL values.

More creative mappings do consume some incremental logic and path
delay (as well as adding some complexity to software toolchain).
While they can provide small gains in some cases, this is offset by
small losses in other cases (someone will want AVL=17 somewhere, and
it's not clear that say AVL=40 is a substantially better use of
encoding).  There is not huge penalty if the immediate does not fit,
at most a li instruction, which might be hoisted out of the loop.

The curent v0.10 definition uses the obvious mapping of the immediate.
Simplicity is a virtue, and any potential gains are small for AVL >
31, where most implementation costs are amortized over the longer
vector and many implementations won't support longer lengths for a
given datatype in any case.

Krste


| -Z-

| On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@...> wrote:

|     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

|     one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software anticipated,
|     then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

|     another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example, implementations
|     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may
|     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or
|     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power), or
|     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the
|     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR).
|     software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by vsetivli)

|     g

|     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@...> wrote:

|         # vsetivli

|         A new variant of vsetvl was proposed providing an immediate as the AVL
|         in rs1[4:0].  The immediate encoding is the same as for CSR immediate
|         instructions. The instruction would have bit 31:30 = 11 and bits 29:20
|         would be encoded same as vsetvli.

|         This would be used when AVL was statically known, and known to fit
|         inside vector register group.  Compared with existing PoR, it removes
|         need to load immediate into a spare scalar register before executing
|         vsetvli, and is useful for handling scalar values in vector register
|         (vl=1) and other cases where short fixed-sized vectors are the
|         datatype (e.g., graphics).

|         There was discussion on whether uimm=00000 should represent 32 or be
|         reserved.  32 is more useful, but adds a little complexity to
|         hardware.

|         There was also discussion on whether instruction should set vill if
|         selected AVL is not supported, or whether should clip vl to VLMAX as
|         with other instructions, or if behavior should be reserved.  Group
|         generally favored writing vill to expose software errors.


Krste Asanovic
 

On Tue, 16 Feb 2021 01:48:57 -0800, Guy Lemieux <guy.lemieux@gmail.com> said:
| I agree with you.
| I had suggested the mapping of 00000 to an implementation-defined value (chosen by the CPU architect). For some architectures, this may be 16, for others it may
| be 32, or even 2.

| The value selected should be selected as the minimum recommended vector length that can achieve good performance (high FU utilization or good memory bandwidth,
| or a balance) on the underlying hardware.

| This would greatly simplify software that just wants to get "reasonable" acceleration without writing code to measure performance of the underlying hardware.
| Such code may select poor values if harts are heterogeneous and a thread migrates. By making this implementation-defined, a value suitable for all harts can be
| selected by the processor architect.

There's a large overlap here with the (rd!=x0,rs1=x0) case that
selects AVL=VLMAX. If migration is intended, then VLMAX should be
same across harts.

Machines with long temporal vector registers might benefit from using
less than VLMAX, but this is highly dependent on specifics of the
interaction of the microarchitecture and the scheduled application
kernel (otherwise, the long vector registers were a waste of
resources). I can't see how to do this portably beyond selecting
VLMAX.

Krste


| Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.

| Guy

| On Mon, Feb 15, 2021 at 11:30 PM <krste@berkeley.edu> wrote:

| Replying to old thread to add rationale for current choice.

|||||| On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@google.com> said:

| | Does it get easier if the specification is just the immediate value plus one?

| No - this costs more gates on critical path.  Mapping 00000 => 32 is
| simpler in area and delay.

| | I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and it
| seems
| | like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:
| |     (first_bit ? 3 : 1) << rest_of_the_bits
| | or:
| |     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8
| |     map[first_two_bits] << rest_of_the_bits;

| | I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

| As a counter-example for this particular example, one code I looked at
| recently related to AR/VR used 9 as one dimension.

| The challenge is agreeing on the best mapping from the 32 immediate
| encodings to the most commonly used AVL values.

| More creative mappings do consume some incremental logic and path
| delay (as well as adding some complexity to software toolchain).
| While they can provide small gains in some cases, this is offset by
| small losses in other cases (someone will want AVL=17 somewhere, and
| it's not clear that say AVL=40 is a substantially better use of
| encoding).  There is not huge penalty if the immediate does not fit,
| at most a li instruction, which might be hoisted out of the loop.

| The curent v0.10 definition uses the obvious mapping of the immediate.
| Simplicity is a virtue, and any potential gains are small for AVL >
| 31, where most implementation costs are amortized over the longer
| vector and many implementations won't support longer lengths for a
| given datatype in any case.

| Krste

| | -Z-

| | On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@gmail.com> wrote:

| |     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

| |     one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software
| anticipated,
| |     then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

| |     another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example,
| implementations
| |     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may
| |     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or
| |     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power),
| or
| |     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the
| |     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR).
| |     software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by
| vsetivli)

| |     g

| |     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@berkeley.edu> wrote:

| |         # vsetivli

| |         A new variant of vsetvl was proposed providing an immediate as the AVL
| |         in rs1[4:0].  The immediate encoding is the same as for CSR immediate
| |         instructions. The instruction would have bit 31:30 = 11 and bits 29:20
| |         would be encoded same as vsetvli.

| |         This would be used when AVL was statically known, and known to fit
| |         inside vector register group.  Compared with existing PoR, it removes
| |         need to load immediate into a spare scalar register before executing
| |         vsetvli, and is useful for handling scalar values in vector register
| |         (vl=1) and other cases where short fixed-sized vectors are the
| |         datatype (e.g., graphics).

| |         There was discussion on whether uimm=00000 should represent 32 or be
| |         reserved.  32 is more useful, but adds a little complexity to
| |         hardware.

| |         There was also discussion on whether instruction should set vill if
| |         selected AVL is not supported, or whether should clip vl to VLMAX as
| |         with other instructions, or if behavior should be reserved.  Group
| |         generally favored writing vill to expose software errors.

| |


Guy Lemieux
 

in terms of overlap with that case — that case normally selects maximally sized AVL. the implied goals there are to make best use of vector register capacity and throughput. l

i’m suggesting a case where a minimally sized AVL is used, as chosen by the architect. this allows a programmer to optimize for minimum latency while still getting good throughput. in some cases, the full VLMAX state may still be used to hold data, but operations are chunked down to minimally sized AVL (eg for latency reasons).

i’m not sure of the portability concerns. if an implementation is free to set VLMAX, and software must be written for any possible AVL that is returned, then it appears to me that deliberately returning a smaller implementation-defined AVL should still be portable.

programming for min-latency isn’t common in HPC, but can be useful in real-time systems.

g



On Tue, Feb 16, 2021 at 3:01 PM <krste@...> wrote:
There's a large overlap here with the (rd!=x0,rs1=x0) case that

selects AVL=VLMAX.  If migration is intended, then VLMAX should be

same across harts.



Machines with long temporal vector registers might benefit from using

less than VLMAX, but this is highly dependent on specifics of the

interaction of the microarchitecture and the scheduled application

kernel (otherwise, the long vector registers were a waste of

resources).  I can't see how to do this portably beyond selecting

VLMAX.



Krste





| Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.



| Guy



| On Mon, Feb 15, 2021 at 11:30 PM <krste@...> wrote:



|     Replying to old thread to add rationale for current choice.



|||||| On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@...> said:



|     | Does it get easier if the specification is just the immediate value plus one?



|     No - this costs more gates on critical path.  Mapping 00000 => 32 is

|     simpler in area and delay.



|     | I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and it

|     seems

|     | like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:

|     |     (first_bit ? 3 : 1) << rest_of_the_bits

|     | or:

|     |     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8

|     |     map[first_two_bits] << rest_of_the_bits;



|     | I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.



|     As a counter-example for this particular example, one code I looked at

|     recently related to AR/VR used 9 as one dimension.



|     The challenge is agreeing on the best mapping from the 32 immediate

|     encodings to the most commonly used AVL values.



|     More creative mappings do consume some incremental logic and path

|     delay (as well as adding some complexity to software toolchain).

|     While they can provide small gains in some cases, this is offset by

|     small losses in other cases (someone will want AVL=17 somewhere, and

|     it's not clear that say AVL=40 is a substantially better use of

|     encoding).  There is not huge penalty if the immediate does not fit,

|     at most a li instruction, which might be hoisted out of the loop.



|     The curent v0.10 definition uses the obvious mapping of the immediate.

|     Simplicity is a virtue, and any potential gains are small for AVL >

|     31, where most implementation costs are amortized over the longer

|     vector and many implementations won't support longer lengths for a

|     given datatype in any case.



|     Krste



|     | -Z-



|     | On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@...> wrote:



|     |     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?



|     |     one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software

|     anticipated,

|     |     then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 



|     |     another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example,

|     implementations

|     |     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this may

|     |     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent instructions), or

|     |     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower power),

|     or

|     |     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether the

|     |     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined CSR).

|     |     software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by

|     vsetivli)



|     |     g



|     |     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@...> wrote:



|     |         # vsetivli



|     |         A new variant of vsetvl was proposed providing an immediate as the AVL

|     |         in rs1[4:0].  The immediate encoding is the same as for CSR immediate

|     |         instructions. The instruction would have bit 31:30 = 11 and bits 29:20

|     |         would be encoded same as vsetvli.



|     |         This would be used when AVL was statically known, and known to fit

|     |         inside vector register group.  Compared with existing PoR, it removes

|     |         need to load immediate into a spare scalar register before executing

|     |         vsetvli, and is useful for handling scalar values in vector register

|     |         (vl=1) and other cases where short fixed-sized vectors are the

|     |         datatype (e.g., graphics).



|     |         There was discussion on whether uimm=00000 should represent 32 or be

|     |         reserved.  32 is more useful, but adds a little complexity to

|     |         hardware.



|     |         There was also discussion on whether instruction should set vill if

|     |         selected AVL is not supported, or whether should clip vl to VLMAX as

|     |         with other instructions, or if behavior should be reserved.  Group

|     |         generally favored writing vill to expose software errors.




Krste Asanovic
 

On Tue, 16 Feb 2021 15:12:46 -0800, Guy Lemieux <guy.lemieux@gmail.com> said:
| in terms of overlap with that case — that case normally selects maximally sized AVL. the implied goals there are to make best use of vector register capacity and
| throughput. l

| i’m suggesting a case where a minimally sized AVL is used, as chosen by the architect.

| this allows a programmer to optimize for minimum latency while still
| getting good throughput. in some cases, the full VLMAX state may still be used to hold data, but operations are chunked down to minimally sized AVL (eg for
| latency reasons).

I still don't see how hardware can set a <VLMAX value that will work
well for any code in loop.

Your latency comment seems to imply an external observer sees the
individual strips go by (e.g., in DSP applicaiton where data comes in
and goes out in chunks), as otherwise only total time to finish loop
matters.

In these situations, I also can't see having the microarchitecture
pick the chunk size - usually the I/O latency constraint sets the
chunk size and goal of vector execution is to execute the chunks as
efficiently as possible.

Krste

| i’m not sure of the portability concerns. if an implementation is free to set VLMAX, and software must be written for any possible AVL that is returned, then it
| appears to me that deliberately returning a smaller implementation-defined AVL should still be portable.

| programming for min-latency isn’t common in HPC, but can be useful in real-time systems.

| g

| On Tue, Feb 16, 2021 at 3:01 PM <krste@berkeley.edu> wrote:

| There's a large overlap here with the (rd!=x0,rs1=x0) case that

| selects AVL=VLMAX.  If migration is intended, then VLMAX should be

| same across harts.

| Machines with long temporal vector registers might benefit from using

| less than VLMAX, but this is highly dependent on specifics of the

| interaction of the microarchitecture and the scheduled application

| kernel (otherwise, the long vector registers were a waste of

| resources).  I can't see how to do this portably beyond selecting

| VLMAX.

| Krste

| | Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.

| | Guy

| | On Mon, Feb 15, 2021 at 11:30 PM <krste@berkeley.edu> wrote:

| |     Replying to old thread to add rationale for current choice.

| |||||| On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@google.com> said:

| |     | Does it get easier if the specification is just the immediate value plus one?

| |     No - this costs more gates on critical path.  Mapping 00000 => 32 is

| |     simpler in area and delay.

| |     | I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and
| it

| |     seems

| |     | like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:

| |     |     (first_bit ? 3 : 1) << rest_of_the_bits

| |     | or:

| |     |     map[] = { 1, 3, 5, 8 }; // Or maybe something else for 5 and 8

| |     |     map[first_two_bits] << rest_of_the_bits;

| |     | I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

| |     As a counter-example for this particular example, one code I looked at

| |     recently related to AR/VR used 9 as one dimension.

| |     The challenge is agreeing on the best mapping from the 32 immediate

| |     encodings to the most commonly used AVL values.

| |     More creative mappings do consume some incremental logic and path

| |     delay (as well as adding some complexity to software toolchain).

| |     While they can provide small gains in some cases, this is offset by

| |     small losses in other cases (someone will want AVL=17 somewhere, and

| |     it's not clear that say AVL=40 is a substantially better use of

| |     encoding).  There is not huge penalty if the immediate does not fit,

| |     at most a li instruction, which might be hoisted out of the loop.

| |     The curent v0.10 definition uses the obvious mapping of the immediate.

| |     Simplicity is a virtue, and any potential gains are small for AVL >

| |     31, where most implementation costs are amortized over the longer

| |     vector and many implementations won't support longer lengths for a

| |     given datatype in any case.

| |     Krste

| |     | -Z-

| |     | On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@gmail.com> wrote:

| |     |     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

| |     |     one option is to set vl=VLMAX. i have some concerns about software using this safely (eg, if VLMAX turns out to be much larger than software

| |     anticipated,

| |     |     then it would fail; correcting this requires more instructions than just using the regular vsetvl/vsetvli would have used). 

| |     |     another option is to allow an implementation-defined vl to be chosen by hardware; this could be anywhere between 1 and VLMAX. for example,

| |     implementations

| |     |     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this
| may

| |     |     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent
| instructions), or

| |     |     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower
| power),

| |     or

| |     |     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether
| the

| |     |     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined
| CSR).

| |     |     software using a few extra cycles to check the returned vl against AVL should not a big problem (the simplest solution being vsetvli followed by

| |     vsetivli)

| |     |     g

| |     |     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@berkeley.edu> wrote:

| |     |         # vsetivli

| |     |         A new variant of vsetvl was proposed providing an immediate as the AVL

| |     |         in rs1[4:0].  The immediate encoding is the same as for CSR immediate

| |     |         instructions. The instruction would have bit 31:30 = 11 and bits 29:20

| |     |         would be encoded same as vsetvli.

| |     |         This would be used when AVL was statically known, and known to fit

| |     |         inside vector register group.  Compared with existing PoR, it removes

| |     |         need to load immediate into a spare scalar register before executing

| |     |         vsetvli, and is useful for handling scalar values in vector register

| |     |         (vl=1) and other cases where short fixed-sized vectors are the

| |     |         datatype (e.g., graphics).

| |     |         There was discussion on whether uimm=00000 should represent 32 or be

| |     |         reserved.  32 is more useful, but adds a little complexity to

| |     |         hardware.

| |     |         There was also discussion on whether instruction should set vill if

| |     |         selected AVL is not supported, or whether should clip vl to VLMAX as

| |     |         with other instructions, or if behavior should be reserved.  Group

| |     |         generally favored writing vill to expose software errors.

| |     |


Bill Huffman
 

For hardware with very long vector registers, the same effect might be accomplished by having a custom way to change VLMAX dynamically (across all harts, etc.). It would seem that would cover a larger set of useful cases for what Guy is thinking about - if I'm following him.

Bill

-----Original Message-----
From: tech-vector-ext@lists.riscv.org <tech-vector-ext@lists.riscv.org> On Behalf Of Krste Asanovic
Sent: Tuesday, February 16, 2021 3:21 PM
To: Guy Lemieux <guy.lemieux@gmail.com>
Cc: krste@berkeley.edu; Zalman Stern <zalman@google.com>; tech-vector-ext@lists.riscv.org
Subject: Re: [RISC-V] [tech-vector-ext] Vector TG minutes for 2020/12/18 meeting

EXTERNAL MAIL



On Tue, 16 Feb 2021 15:12:46 -0800, Guy Lemieux <guy.lemieux@gmail.com> said:
| in terms of overlap with that case — that case normally selects
| maximally sized AVL. the implied goals there are to make best use of
| vector register capacity and throughput. l

| i’m suggesting a case where a minimally sized AVL is used, as chosen by the architect.

| this allows a programmer to optimize for minimum latency while still
| getting good throughput. in some cases, the full VLMAX state may still
| be used to hold data, but operations are chunked down to minimally sized AVL (eg for latency reasons).

I still don't see how hardware can set a <VLMAX value that will work well for any code in loop.

Your latency comment seems to imply an external observer sees the individual strips go by (e.g., in DSP applicaiton where data comes in and goes out in chunks), as otherwise only total time to finish loop matters.

In these situations, I also can't see having the microarchitecture pick the chunk size - usually the I/O latency constraint sets the chunk size and goal of vector execution is to execute the chunks as efficiently as possible.

Krste

| i’m not sure of the portability concerns. if an implementation is free
| to set VLMAX, and software must be written for any possible AVL that is returned, then it appears to me that deliberately returning a smaller implementation-defined AVL should still be portable.

| programming for min-latency isn’t common in HPC, but can be useful in real-time systems.

| g

| On Tue, Feb 16, 2021 at 3:01 PM <krste@berkeley.edu> wrote:

| There's a large overlap here with the (rd!=x0,rs1=x0) case that

| selects AVL=VLMAX.  If migration is intended, then VLMAX should be

| same across harts.

| Machines with long temporal vector registers might benefit from
| using

| less than VLMAX, but this is highly dependent on specifics of the

| interaction of the microarchitecture and the scheduled application

| kernel (otherwise, the long vector registers were a waste of

| resources).  I can't see how to do this portably beyond selecting

| VLMAX.

| Krste

| | Of course, the implementation-defined value must be fixed across all harts, so thread migration doesn't break software.

| | Guy

| | On Mon, Feb 15, 2021 at 11:30 PM <krste@berkeley.edu> wrote:

| |     Replying to old thread to add rationale for current choice.

| |||||| On Mon, 21 Dec 2020 13:52:07 -0800, Zalman Stern <zalman@google.com> said:

| |     | Does it get easier if the specification is just the immediate value plus one?

| |     No - this costs more gates on critical path.  Mapping 00000
| => 32 is

| |     simpler in area and delay.

| |     | I really don't understand how this encoding is particularly great for immediates as many of the valuhes are likely very rarely or even never used and
| it

| |     seems

| |     | like one can't get long enough values even for existing SIMD hardware in some data types. Compare to e.g.:

| |     |     (first_bit ? 3 : 1) << rest_of_the_bits

| |     | or:

| |     |     map[] = { 1, 3, 5, 8 }; // Or maybe something else for
| 5 and 8

| |     |     map[first_two_bits] << rest_of_the_bits;

| |     | I.e. get a lot of powers of two, multiples of three-vecs for graphics, maybe something else.

| |     As a counter-example for this particular example, one code I
| looked at

| |     recently related to AR/VR used 9 as one dimension.

| |     The challenge is agreeing on the best mapping from the 32
| immediate

| |     encodings to the most commonly used AVL values.

| |     More creative mappings do consume some incremental logic and
| path

| |     delay (as well as adding some complexity to software toolchain).

| |     While they can provide small gains in some cases, this is
| offset by

| |     small losses in other cases (someone will want AVL=17
| somewhere, and

| |     it's not clear that say AVL=40 is a substantially better use
| of

| |     encoding).  There is not huge penalty if the immediate does
| not fit,

| |     at most a li instruction, which might be hoisted out of the loop.

| |     The curent v0.10 definition uses the obvious mapping of the immediate.

| |     Simplicity is a virtue, and any potential gains are small
| for AVL >

| |     31, where most implementation costs are amortized over the
| longer

| |     vector and many implementations won't support longer lengths
| for a

| |     given datatype in any case.

| |     Krste

| |     | -Z-

| |     | On Mon, Dec 21, 2020 at 10:47 AM Guy Lemieux <guy.lemieux@gmail.com> wrote:

| |     |     for vsetivli, with the uimm=00000 encoding, rather than setting vl to 32, how setting it to some other meaning?

| |     |     one option is to set vl=VLMAX. i have some concerns
| about software using this safely (eg, if VLMAX turns out to be much
| larger than software

| |     anticipated,

| |     |     then it would fail; correcting this requires more
| instructions than just using the regular vsetvl/vsetvli would have used).

| |     |     another option is to allow an implementation-defined
| vl to be chosen by hardware; this could be anywhere between 1 and
| VLMAX. for example,

| |     implementations

| |     |     may just choose vl=32, or they may choose something else. it allows the CPU architect to devise a scheme that best fits the implementation. this
| may

| |     |     consider factors like the effective width of the execution engine, the pipeline depth (to reduce likelihood of stalls from dependent
| instructions), or

| |     |     that the vector register file is actually a multi-level memory hierarchy where some smaller values may operate with greater efficiency (lower
| power),

| |     or

| |     |     matching VL to the optimal memory system burst length. perhaps some guidance by the spec could be given here for the default scheme, eg whether
| the

| |     |     implementation optimizes for best performance or power (while still allowing implementations to modify this default via an implementation-defined
| CSR).

| |     |     software using a few extra cycles to check the
| returned vl against AVL should not a big problem (the simplest
| solution being vsetvli followed by

| |     vsetivli)

| |     |     g

| |     |     On Fri, Dec 18, 2020 at 6:13 PM Krste Asanovic <krste@berkeley.edu> wrote:

| |     |         # vsetivli

| |     |         A new variant of vsetvl was proposed providing an
| immediate as the AVL

| |     |         in rs1[4:0].  The immediate encoding is the same
| as for CSR immediate

| |     |         instructions. The instruction would have bit 31:30
| = 11 and bits 29:20

| |     |         would be encoded same as vsetvli.

| |     |         This would be used when AVL was statically known,
| and known to fit

| |     |         inside vector register group.  Compared with
| existing PoR, it removes

| |     |         need to load immediate into a spare scalar
| register before executing

| |     |         vsetvli, and is useful for handling scalar values
| in vector register

| |     |         (vl=1) and other cases where short fixed-sized
| vectors are the

| |     |         datatype (e.g., graphics).

| |     |         There was discussion on whether uimm=00000 should
| represent 32 or be

| |     |         reserved.  32 is more useful, but adds a little
| complexity to

| |     |         hardware.

| |     |         There was also discussion on whether instruction
| should set vill if

| |     |         selected AVL is not supported, or whether should
| clip vl to VLMAX as

| |     |         with other instructions, or if behavior should be
| reserved.  Group

| |     |         generally favored writing vill to expose software errors.

| |     |