Date   

updates to vector spec repo on github

Krste Asanovic
 

I added the change to add a vcsr including FP fields (this came out
much cleaner I think) - please check over.

Another minor change was to fold in the suggestion to move VS to a
two-bit wide gap lower down in *status to leave the higher wider field
alone.

I also bumped the draft date on the doc, and also added the meeting
minutes to the repo.

Krste


Re: RISC-V Vector Task Group: fractional LMUL

David Horner
 

Special lmul code in vsetvl{i} to derive LML from existing SEW/LMUL and provided sew code.

As mentioned in the TG we suggested widen LMUL to 3 bits with 7 (explicit) states.

I suggest we name them l1,l2,l4,l8 as before and lf2, lf4, lf8 the fractional 1/2, 1/4, 1/8 respectively.

lf16 may be useful in larger machines, but already lf8 cannot be used on minimal RV32 machines.

( I recommend we use lf2, lf4 and lf8 rather than f2, f4, f8 to avoid confusion with float and to associate with lmul as a fraction value)

I propose the remaining value be used to derive LMUL from the existing SEW/LMUL and provided sew code.

Specifically, new LMUL = ( new sew / old SEW )  * old LMUL which retains the initial SEW/LMUL ratio.


Tentatively I use lx as the mnemonic for the lmul code.

I recommend the immediate variant using lx have the pseudo-opcode  vsetlmuli.

As a shorthand I refer to vsetlv that uses the lx code in rs1 as vsetlmul.


Removing the fixed size load/stores motivates the increased lmul range and increases the need to issue additional instructions.

This facility compliments the increased lmul range and mitigate the need to explicitly specify the correct lmul in the additional associated vsetlmul{i} instructions.

As a result the first vsetvl{i} instruction establishes the SEW/LMUL ratio and subsequent vsetlmul{i} instructions maintain it.


Notably, values of vl that  were legal in the original vsetvl{i} are legal in subsequent (legal) vsetlmul{i} instructions.


The additional fractional lmul values provide a substantial dynamic range and reduces the possibility of an illegal calculated LMUL.

Even so, the calculated LMUL can be out of range for vsetlmul{i} which will set vill.


Other proposals have the objective of changing SEW and LMUL and enforcing that vl and/or the SEW/LMUL ratio not change.

This proposal compliments those recommendations.

In particular it addresses issue #365 (What is the case about vsetvl{i} x0, x0, {vtypei|rs2} without keeping VLMAX the same)

when rs1 is x0 to use the current value of vl , the use of lx ensures vl, if possible, will remain the same as will VLMAX and SEW/LMUL ratio and if not possible vill will be generated.


To have our lf16 and lx too.

As I recommended initially the lmul code in vsetvl{i} that would otherwise be used for lf16 is used for lx.

That means that vtype lmul field will not have a lf16 value set by an explicit lmul code.

However, we could allow lf16 to be calculated and stored by an instruction using lx.

A read of vtype will correctly show the lf16 code.

Restoring vtype from the value read from vtype is a multistep step process.

    If lmul code in not lf16 then issue vsetvl with save vtype in rs2

    if lmul code is lf16

        change sew code in rs2 to one larger than provided. (e.g. use e16 if e8 in save vtpye)

        set lmul in rs2 to lf8

        issue vsetvl with revised sew and lmul vtype in rs2

        change sew code in rs2 back to original (current example from e16 back to e8)

        change lmul in rs2 to lx

        issue vsetvl with original seq and lmul=lx vtype in rs2

        lmul will now be lf16 and sew code same as saved.

This approach will work for all values of sew and lf16 that are valid in vtype.

The question is if lf16 is worth the extra effort, or does it potentially cause confusion/failures for application code.


Additional suggestion:

This is standalone from the proposal, but complements it, so I submit it here.

vsetvl is defined not to trap on illegal vtype codes.

It sets vill instead and clears the remainder of vtype which triggers illegal operation in subsequent vector instructions that rely on it.

This facilitates low overhead interrogation of the hardware to determine supported SEW size and ediv support (and any future functionality in vtype)

Arguably, vsetlmul{i} should trap if the original vtype has vill set. But I recommend more than that.

An implementation can trap on any instruction or sub-instruction to emulate, take corrective action, to trigger context switch, to report anomaly, etc.

As a result an implementation could choose to provide the behaviour I suggest below and still be compliant.

However, I recommend we make trap mandatory on certain cases of vsetvl{i} use.

These traps would leave undisturbed  vl and vtype as they were before instruction execution.

Specifically trap if:

     1) lmul = lx and original vtype has vill set.

     2) lmul = lx and calculated lmul is not valid 

     3) rd = x0 and rs1 = x0 (use existing vl for rs1) and VLMAX changes (that is SEW/LMUL ratio is not maintained)

            (this only occurs with explicit sew/lmul. if lmul=lx it is either case 1 or 2.)


A precise trap is extremely helpful to debug these cases.

Case 1. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the cause for vill.

Case 2. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul.

Case 3. There is not compelling reason to use rd = x0 and rs1 = x0 to interrogate the hardware so maintaining the VLMAX is the probable intent given existing vl is provided.

                If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul.

     

On 2020-02-07 1:40 a.m., Krste Asanovic wrote:

I'm realizing the idea doesn't quite work unless machine actually
stores the fractional LMUL value (to cope with SLEN and widening
instruction behavior), and so would need the new instruction and an
extra bit of state in vtype.

But with this change, yes.

For floating-point code, where there is no way to load a narrower type
into a wider element, this can be used to increase the number of
registers of a wider width, e.g., in a matrix multiply accumulating
double-precision values that are the product of single-precision
values, using widening muladds, can arrange as something like:

     vsetvli    x0, x0, e32,f2     # Fractional Lmul
     vle.v      v0, (bmatp)        # Get row of matrix
     flw f1, (x15)                 # Get scalar
     vfwmacc.vf v1, f1, v0         # One row of outer-product
     add x15, x15, acolstride      # Bump pointer
     ...
     flw f31, (x15)                # Get scalar
     vfwmacc.vf v31, f31, v0       # Last row of outer-product
     ...

which is holding 31 rows of the destination matrix accumulators as
doubles while performing widening muladds from a single-precision
vector load held in v0.  This is probably overkill register blocking
for this particular example, but shows the general improvement.

Krste


Re: RISC-V Vector Task Group: fractional LMUL

David Horner
 

I  left the encoding unspecified in the proposal.

That was intentional as I saw various tradeoffs.

However, I now recommend the codes be in order of increasing VLMAX value as so:


vlmul

mnemonic

LMUL

VLMAX

#groups

Grouped registers

0

0

0

       lf16

1/16

VLEN/SEW/16

       32

vn (single register, lower 1/16th)

0

0

1

       lf8

1/8

VLEN/SEW/8

       32

vn (single register, lower 1/8th)

0

1

0

       lf4

1/4

VLEN/SEW/4

       32

vn (single register, lower 1/4th)

0

1

1

       lf2

1/2

VLEN/SEW/2

       32

vn (single register, lower 1/2th)

1

0

0

       l1

1

VLEN/SEW

       32

vn (single register in group)

1

0

1

       l2

2

2*VLEN/SEW

       16

vn, vn+1

1

1

0

       l4

4

4*VLEN/SEW

       8

vn, …​, vn+3

1

1

1

       l8

8

8*VLEN/SEW

       4

vn, …​, vn+7



This allows for a quick "andi and bz" check of the "special" LMUL 1/16

It also allows the formula
        new LMUL = ( new sew / old SEW )  * old LMUL
to be implemented as
      vlmul = new vsew - old vsew + old vlmul

The first column could be inverted to allow the original code values with no appreciable hardware cost, but it complicates software for the two above cases.



On 2020-02-09 12:44 a.m., David Horner via Lists.Riscv.Org wrote:

Special lmul code in vsetvl{i} to derive LML from existing SEW/LMUL and provided sew code.

As mentioned in the TG we suggested widen LMUL to 3 bits with 7 (explicit) states.

I suggest we name them l1,l2,l4,l8 as before and lf2, lf4, lf8 the fractional 1/2, 1/4, 1/8 respectively.

lf16 may be useful in larger machines, but already lf8 cannot be used on minimal RV32 machines.

( I recommend we use lf2, lf4 and lf8 rather than f2, f4, f8 to avoid confusion with float and to associate with lmul as a fraction value)

I propose the remaining value be used to derive LMUL from the existing SEW/LMUL and provided sew code.

Specifically, new LMUL = ( new sew / old SEW )  * old LMUL which retains the initial SEW/LMUL ratio.


EPI intrinsics reference and compiler updated to 0.8

Roger Ferrer Ibanez
 

Hi all,

we have updated to version 0.8 of V-ext our LLVM-based experimental compiler and the intrinsics reference.

In this version we have integrated the assembler/disassembler patch from Hsiangkai Wang[1] to replace our own implementation of this part of the compiler. Our goal is to stay aligned as much as possible with LLVM upstream. On top of that we have built our existing code generation strategy.

You can find a recent source drop of the compiler at https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi-0.8

Kind regards,
Roger

[1] https://reviews.llvm.org/D69987

-- 
Roger Ferrer Ibáñez - roger.ferrer@...
Barcelona Supercomputing Center - Centro Nacional de Supercomputación


WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Minutes from 2/21 meeting

Krste Asanovic
 

Date: 2020/2/21
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~13
Current issues on github: https://github.com/riscv/riscv-v-spec

Note, the Zoom meeting details have changed. Please view the member
calendar entry for updated details.

Issues discussed: #341/#377, #367, #362/#354, #381

The following issues were discussed.

#341/#377 Separate vcsr from fcsr

Last time, the agreement was to add a new vcsr but to shadow fcsr
fields in the new vcsr to reduce context swap time. Due to concerns
raised in #377, the agreement was now to keep the separate vcsr but
to not shadow fcsr fields.

This change will be made to the spec for v0.9.

#367 Tail agnostic as option

Extensive disussion continued in the group about allowing this as an
option. The group proposing this change will describe the
implementation challenges of the current tail-undisturbed scheme in
the context of a renamed vector register machine with long temporal
vectors.

#362/#354 Remove non-SEW vector load/store

Discussion covered the new proposal to provide a fractional LMUL to
avoid one drawback of dropping these memory operations. The general
consensus was in favor, as the scheme also aids mixed-width
floating-point arithmetic (for which there is no equivalent to the
widening/narrowing load/stores). The proposal requires adding a new
bit to LMUL in vtype, so that a new instruction will not be required.
The detailed fractional LMUL proposal was only recently presented, so
the group will continue the discussion online before next meeting.

#381 Index size in indexed load/store and AMO

The group discussed the issue of providing XLEN-sized indices in
indexed load/stores and AMOs, if non-SEW vector load/stores are
dropped. This was also previously a concern for floating-point codes,
as there were no fixed-precision floating-point loads/stores. The two
proposals were 1) to provide only XLEN-width indices or 2) provide
XLEN and 32b indices, and whether these are only present in extended
64b encoding. Discussion to continue.


Reminder vector TG meeting Friday March 6th, details on members calendar <eom>

Krste Asanovic
 


Minutes of 2020/3/6 vector task group meeting

Krste Asanovic
 

Date: 2020/3/6
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~21
Current issues on github: https://github.com/riscv/riscv-v-spec

Note, the Zoom meeting details have changed. Please view the member
calendar entry for updated details.

Issues discussed: #362,#354,#382,

The following issues were discussed.

#362,#354,#382 Fractional LMUL additional registers

There was considerable discussion trying to understand the design of
the scheme to pack multiple fractional LMUL registers into the base
vector registers. The conclusion was that we first had to understand
how fractional LMUL would map into the base vector registers,
including interaction with SLEN, before attempting to add fractional
LMUL register packing as another optimization with another level of
complexity.

Waiting on receiving the fractional LMUL mapping.

#367 Tail Agnostic

The discussion reviewed the proposal that long temporal vector
registers with renaming could be handled using vector length trimming.

The proposal was then added that masking should also be given option
of being agnostic giving three options:
1) tail-undisturbed + masking-undisturbed
2) tail-agnostic + masking-undisturbed
3) tail-agnostic + masking-agnostic

All implementations would have to support all options.

Simple implementations can ignore setting and treat all as 1).

Option 2) could be implemented as option 1) even if option 3) was
supported.

The recommendation was that agnostic destination elements had to only
be either zero or undisturbed.

Discussion to continue with this suggestion.


Re: Minutes of 2020/3/6 vector task group meeting

Bill Huffman
 

On 3/6/20 6:22 PM, Krste Asanovic wrote:
EXTERNAL MAIL



Date: 2020/3/6
Task Group: Vector Extension
Chair: Krste Asanovic
Number of Attendees: ~21
Current issues on github: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_riscv_riscv-2Dv-2Dspec&d=DwIBAg&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=7iUw4AQUNARUmLrctcmUnaVeTQFEsr3iPkKAmbvN7TQ&s=SnwnLsCrm8ukZcK2uhKQB4CLhuFujbyBzFx1vgL4iQA&e=
...

#367 Tail Agnostic

The discussion reviewed the proposal that long temporal vector
registers with renaming could be handled using vector length trimming.

The proposal was then added that masking should also be given option
of being agnostic giving three options:
1) tail-undisturbed + masking-undisturbed
2) tail-agnostic + masking-undisturbed
3) tail-agnostic + masking-agnostic
We've discussed before, but allowing the *-agnostic options means code
can be written and tested on an implementation that supports them and
then fail on an implementation that maps them to #1. And vice-versa.

Bill


All implementations would have to support all options.

Simple implementations can ignore setting and treat all as 1).

Option 2) could be implemented as option 1) even if option 3) was
supported.

The recommendation was that agnostic destination elements had to only
be either zero or undisturbed.

Discussion to continue with this suggestion.



Re: Minutes of 2020/3/6 vector task group meeting

Krste Asanovic
 

On Sat, 7 Mar 2020 02:32:24 +0000, Bill Huffman <huffman@...> said:
| On 3/6/20 6:22 PM, Krste Asanovic wrote:
[...]
|| #367 Tail Agnostic
|| The discussion reviewed the proposal that long temporal vector
|| registers with renaming could be handled using vector length trimming.
||
|| The proposal was then added that masking should also be given option
|| of being agnostic giving three options:
|| 1) tail-undisturbed + masking-undisturbed
|| 2) tail-agnostic + masking-undisturbed
|| 3) tail-agnostic + masking-agnostic

| We've discussed before, but allowing the *-agnostic options means code
| can be written and tested on an implementation that supports them and
| then fail on an implementation that maps them to #1. And vice-versa.

Yes, this is the disadvantage of having implementation-dependent
options. Along with making portable
functional/verification/compliance models more complex (though this
case is not as bad as some others).

This has to be balanced against the advantage of greater efficiency on
a wider range of microarchitectures.

Unordered vector reductions and vector stores already cause issues of
this kind, as does the memory model for multithreaded code, but
obviously want to reduce the number of cases where this shows up.

Krste


[tech-vector-ext] Some proposals

Krste Asanovic
 

It doesn't look like all these issues were addressed either on mailing
list or on github tracker. In general, it is much better to split
feedback into individual topics. I'm responding all-in-one below, but
if you want to continue any of these threads, please add to github
tracker as independent issues.

Several of these issues might be addressed with anticipated longer
64-bit vector instructions (ILEN=64).

On Fri, 22 Nov 2019 16:17:33 +0100, Dr. Adrián Cristal <adrian.cristal@...> said:
| Dear all,
| We have been involved in RTL design of an accelerator featuring the 0.7.1 standard; in parallel, we were running some vectorized HPC kernels on
| a simulator platform which mimics the accelerator. We have come across the following:

| 1. With respect to new proposed spill instructions, the cost could be high in case we need to spill few elements of a large vector.  Instead we
| propose the following: spill v1, rs1, [rs2], recoverspill v2,rs2 and unspill rs1.

| The semantic is the following: spill v1, rs1, [rs2], will store v1 in the address rs1 up to rs2 elements. There is not a warranty that the
| values are stored, but there is the warranty that if they are not stored they will be completely recovered by the recoverspill v2, rs2 (
| otherwise recoverspill v2,rs2 will read the values from memory if they at some time were written to memory). The unspill rs1, will disable the
| capability to recoverspill operation at address rs1. In the case of OoO processor, for spill it can delay the free of the vector physical
| register on spill and assign to the logical register again on recovespill. So the cost will be much less, but it can be saved if the processor
| needs more physical registers. For in order implemenetations, it will save the registers on memory.

This seems like a very difficult to implement optimization with
strange effects on memory model and page fault/trap handling if memory
operation is deferred to later in program execution. What happens on
context switches if physical register is pushed out in different
context?

Having more arch vector registers should reduce spill frequency, and
this is one part of 64-bit instruction proposal.

| 2. At this moment the mask is only for true values, so if we have an if then else structure first we have to do the if, and after we have to
| negate the mask (in our case an expensive operation, up to 32
| cycles) and, then execute the else.

This is functionality intended for 64-bit encoding. You should be
able to chain and/or fuse this mask negation and run in parallel in
mask unit with 32-bit encoding.

| It would be great to add a CSR which controls
| the  functionaility of the masked operations, it could be beneficial in many use cases. Also using the same mechanism, it may be possible to
| add other mask-related functionality that can simplify some algorithms. This functionality could be to set it to 0 or constant, or sign
| exchange, set positive or set negative. Another option that we can manage is to determine which bit is the mask bit of the register v0 (when
| LMUL=1), this can be convenient to set up it to sign bit instead of LSB, for example if we want to calculate the abs value or the RELU
| operation.

FP ABS (vfsgnx.vv vd, vs2, vs2), and integer/FP RELU (vmax.vx vd, vs2,
x0, vfmax.vf vd, vs2, f0 with f0=0.0) are already directly supported.

| 3- A vrscatter instruction could be useful (vdest[vsrc2[i]] = vsrc1[i]). The implementation cost could be less than vrgather since in the
| vrgather case it is needed to send the index and then the value, however for the vrscatter both index and the value could be sent together, for
| our implementation this is faster than vrgather.

I don't think vrscatter can replace vrgather, except for strict
permutes. Vrscatter I believe is strictly less powerful. The data
movement from a single vrscatter can be replaced by a single vrgather,
but the converse is not true.

The vrscatter instruction might be useful in some cases, but has
complex cases to consider in specification: a) multiple writes to same
element, ordered either by element ordering (which will be expensive
with multiple lanes), or by allowing unordered scatters; b) no writes
to some elements, independent of masking.

| 4- If a register is defined with a VL, then the values after VL (between VL+1 to VLMAX) could be set to “undefined” or 0, it would benefit
| implementations with longer vectors considerably in an OoO processor. The current text states that the previous values should be preserved.

Ongoing discussion.

| 5- Proposal to add some semantic information to the vectors. In particular it is important to know the VL and the SEW at the time vector was
| created, so when the context is saved we can have a special load/store operation that will only use the minimal storage that is needed in
| memory (plus the length and sew). Also this allows the system to know that after the VL, we do not have to preserve any element.

I had a brief discussion of some save/restore optimization techniques
in Section 5.1.5 of my thesis, and VS bits are at least a start there
to reduce amount of save/restore if not the size of the memory buffer
(though most OS will need memory space for full hart register context
anyway). Some form of microarchitectural tracking can be done,
observing undisturbed (VL=VLMAX) or agnostic (VL) options of valid
length.

I'd guess could have possibly some custom visible state (visible where
privilege allows) recording active vector length in bytes to
save/restore. But for these techniques, hardware can't always do the
right thing on vector save/restore, so need software to know that it's
worth interrogating the custom state to figure out what to do. But
this could penalize simple implementatons where software is OK doing
naive save/restore and doesn't want to spend cycles checking.

Regards,
Krste


| Best regards, Adrian and Osman

| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain
| information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended
| recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing,
| distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy
| and delete any copies you may have received.

| http://www.bsc.es/disclaimer
|


[tech-vector-ext] Feedback on RISC-V V-extension (FFTW3 and Chacha20)

Krste Asanovic
 

We have seen very efficient FFT execution without adding new
instructions, but there is interest in a layer of DSP extension on top
of the base vector ISA. One goal of the EDIV extension would be to
support real+imaginary types and possibly operations.

I haven't looked at the Chacha20 code - but either crypto extensions
or EDIV-based interleaving might help with the interleaving it needs.

Krste

On Tue, 10 Dec 2019 16:16:17 +0100, "Roger Ferrer Ibanez" <roger.ferrer@...> said:
| Hi all,
| Romain Dolbeau is a HPC expert from Atos and participates in the EPI project. Romain provided the following feedback as the result of himself
| porting FFTW3 and Chacha20 to the V-extension. He kindly agreed to share his feedback with the Vext list.

| If you have comments, try to keep him in the loop when replying the email as Romain is not subscribed to this mailing list. I'm also attaching
| the feedback in the original markdown format.

| Thank you very much,
| Roger

| Introduction

| * EPI is implementing a VPU supporting RISC-V “V” extension, and doing some associated compiler & software work

| * This is some feedback on the current version of the “V” extension draft (0.7.1)

| * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers

| In-vector data management issues

| * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer)

| * Some significant overheads due to lack of in-vector data management instructions in “V”

| * Major issue for efficient software

| Please note: FFTW3 is about the implementation in the library, not the high-level algorithm.

| Current situation

| * Like most (all?) SIMD instruction sets, most “V” instructions specify semantic “per-lane”

| * This is fine for most operations on elementary data types (integer, FP)

| * There’s also some reduction operations for common reduction patterns accross lanes

| * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory

| + structures, multi-dimensional arrays, indirect access, …

| * However, some algorithms require data in one lane to interact with data in a different lane

| + here we consider two examples: complex arithmetic & a crypto algorithm

| * Such patterns are usually quite simple

| + Locally, lane with an immediate neighbor

| + Overall, regular patterns like (de)interleaving

| * “V” can make those work, but at a high cost

| + Need to manufacture indices vector to gather/merge data

| * Other SIMD instruction set have dedicated instructions

| * Issue is with porting legacy codes, or management-intensive algorithms

| + High-level view of algorithms, allowing for memory reorganisation to suit “V”, are mostly fine

| First example, FFTW3 library

| * FFTW3 implements SIMD as a two-step process

| 1. Abstract SIMD code generation for “codelets”, using C macro

| 2. An “implementation file” containing SIMD ISA-specific implementation

| * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers

| * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication)

| * Currently, the “V” implementation has a lot of overhead to manage data

| + Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v)

| * (a + bi) × (a’ + b’i) = (a a’ - b b’) + (a b’ + a’ b)i

| * If stored in ‘interleaved’ format …

| * (v[even]+ v[odd]i) (v’[even] + v’[odd]i) = (v[even] v’[even] - v[odd] v’[odd]) + (v[even] v’[odd] + v’[even] * v[odd])i

| 1. Interaction between ‘odd’ and ‘even’ lanes

| 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’

| * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from
| registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data.

| Not-so-readable example “V” code

| * To implement a full complex multiplication in “V”, we need…

| + Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)

| + Gather the data for the first multiplication

| + Do the first multiplication,

| + Do an explicit conjugate for the add/sub

| + Gather more data

| + Do the final fmacc

| + In the example below, VL is the vector length in complex element, tx & sr are interleaved (real/imag) vector data

| * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins)

| * Full code visible on the GitHub referenced earlier

| static inline V VZMUL(V tx, V sr) // fixme: improve

| {

| V tr;

| V ti;

| Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)

| Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);

| Vint vone = TYPEINT(vbroadcast)(1, 2*VL);

| Vint hidx;

| hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)

| tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx

| hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...

| ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx

| tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]

| hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)

| sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr

| hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)

| sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr

| return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]

| }

| Other SIMD instruction set

| * Also visible in GitHub

| * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)

| + Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)

| + Only 5 instructions vs. 11/13 for V

| * SVE has complex arithmetic support (branch arm-sve-alt)

| + ‘CMLA’ instruction designed to be used in pair, implementing complex mul

| + With free conjugation for e.g. VZMULJ macro

| + Only 2 instructions

| Possible solutions

| 1. Problem-specific: add some complex-arithmetic instructions

| + addsub depending on lane parity

| + complex mul similar to CMLA

| + complex mul similar to BlueGene/Q instructions

| + …

| * Very efficient for FFTW3, C/C++/Fortran complex types, etc.

| * Might need more than a few opcodes, hardware might be a bit complex

| * Performance on par with SVE

| 1. Generic: add some data management operations

| + Similar to AVX-512 unpack/movdup/…

| + Similar to SVE zip/uzp/trn

| + …

| + and perhaps a cheap ‘addsub’ as well

| * Less efficient for complex arithmetic, but also useful for other problems

| * Chacha20 further down in this document

| * Performance on par with AVX-512

| 1. Not mutually exclusive :-)

| Second example, Chacha20

| * Chacha20 is a ‘stream’ crypto algorithm

| + https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant

| + Supercop infrastucture with many implementations in crypto_stream/chacha20: https://bench.cr.yp.to/supercop.html

| + “V” source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz

| * Essentially, a pseudo-random function

| + From (a) a secret key (b) a input value (‘nonce’) and (c) a counter, generate a block of pseudo-random bits

| * Purely integer operations

| + Add, Xor, Rotation - rotation is very useful in crypto, missing in V (… as in most SIMD ISA other than AVX-512)

| * Embarrassingly parallel across blocks

| + The counter grows monotonously, so any number of blocks can be generated simultaneously

| + But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent

| * Vector implementation tends to distribute a block over a single lane in multiple vector register

| + 4 Blocks in 128 bits, 16 in 512 bits

| * But the data has to be stored contiguously in memory at the end

| * Effectively, need to do a ‘transpose’ before storing data

| * Examples

| + AVX2 doesn’t support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using
| unpack{hi,lo}_epi{32,64}

| + AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector

| + SVE doesn’t specify register width, currently stick to partial transpose and 128-bits scatter

| * RISC-V “V” can implement the algorithm for variable-width register in a way similar to SVE

| * Because “V” allows to specify the register width, it can adapts better to the level of parallelism

| + SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)

| + V can simply specify a narrower vector width

| + Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent

| * Problem is, “V” lacks the data management instruction for the transpose

| * It has only ‘register gather’ and ‘merge’, which are very general-purpose

| * “register gather” can use any lane as input, so probably very costly in hardware and high producer-consumer latency

| + Cost confirmed by discussion with hardware implementer

| * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies

| + Somewhat costly, but not prohibitively so

| * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general
| instructions

| + Quick and dirty but illustrates the point

| + Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of ‘better’ code

| * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in “V”, plus support instructions to generate the
| indices for gather / the mask for merge

| + Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes

| * That’s a significant overhead at the end of the algorithm

| + And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)

| * Adding some fixed-pattern data management instruction would remove this overhead

| 1. Fewer instructions by removing support instructions

| 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge

| 3. Lower-latency instructions from simpler semantic

| o e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can
| pick from anywhere

| ​

| --
| Roger Ferrer Ibáñez - roger.ferrer@...
| Barcelona Supercomputing Center - Centro Nacional de Supercomputación

| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain
| information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended
| recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing,
| distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy
| and delete any copies you may have received.

| http://www.bsc.es/disclaimer
|

| # Introduction #

| * EPI is implementing a VPU supporting RISC-V "V" extension, and doing some associated compiler & software work

| * This is some feedback on the current version of the "V" extension draft (0.7.1)

| * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers

| # In-vector data management issues #

| * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer)

| * Some significant overheads due to lack of in-vector data management instructions in "V"

| * Major issue for efficient software

| Please note: FFTW3 is about the _implementation_ in the library, not the high-level algorithm.

| ## Current situation ##

| * Like most (all?) SIMD instruction sets, most "V" instructions specify semantic "per-lane"

| * This is fine for most operations on elementary data types (integer, FP)

| * There's also some reduction operations for common reduction patterns accross lanes

| * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory
| * structures, multi-dimensional arrays, indirect access, ...

| * However, some algorithms require data in one lane to interact with data in a different lane
| * here we consider two examples: complex arithmetic & a crypto algorithm

| * Such patterns are usually quite simple
| * Locally, lane with an immediate neighbor
| * Overall, regular patterns like (de)interleaving

| * "V" can make those work, but at a high cost
| * Need to manufacture indices vector to gather/merge data

| * Other SIMD instruction set have dedicated instructions

| * Issue is with porting legacy codes, or management-intensive algorithms
| * High-level view of algorithms, allowing for memory reorganisation to suit "V", are mostly fine

| ## First example, FFTW3 library ##

| * FFTW3 implements SIMD as a two-step process
| 1. Abstract SIMD code generation for "codelets", using C macro
| 2. An "implementation file" containing SIMD ISA-specific implementation

| * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers

| * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication)

| * Currently, the “V” implementation has a lot of overhead to manage data
| * Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v)

| * (a + bi) × (a' + b'i) = (a * a' - b * b') + (a * b' + a' * b)i

| * If stored in ‘interleaved’ format …

| * (v[even]+ v[odd]i) * (v'[even] + v'[odd]i) = (v[even] * v'[even] - v[odd] * v'[odd]) + (v[even] * v'[odd] + v'[even] * v[odd])i
| 1. Interaction between ‘odd’ and ‘even’ lanes
| 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’

| * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data.

| ## Not-so-readable example "V" code ##

| * To implement a full complex multiplication in “V”, we need…
| * Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)
| * Gather the data for the first multiplication
| * Do the first multiplication,
| * Do an explicit conjugate for the add/sub
| * Gather more data
| * Do the final fmacc
| * In the example below, VL is the vector length in _complex_ element, tx & sr are interleaved (real/imag) vector data

| * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins)

| * Full code visible on the GitHub referenced earlier

| ```
| static inline V VZMUL(V tx, V sr) // fixme: improve
| {
| V tr;
| V ti;
| Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)
| Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);
| Vint vone = TYPEINT(vbroadcast)(1, 2*VL);
| Vint hidx;

| hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)
| tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx
| hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...
| ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx
| tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]
| hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)
| sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr
| hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)
| sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr
| return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]
| }
| ```

| ## Other SIMD instruction set ##

| * Also visible in GitHub

| * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)
| * Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)
| * Only 5 instructions vs. 11/13 for V

| * SVE has complex arithmetic support (branch arm-sve-alt)
| * ‘CMLA’ instruction designed to be used in pair, implementing complex mul
| * With free conjugation for e.g. VZMULJ macro
| * Only 2 instructions

| ## Possible solutions ##

| 1. Problem-specific: add some complex-arithmetic instructions
| * addsub depending on lane parity
| * complex mul similar to CMLA
| * complex mul similar to BlueGene/Q instructions
| * ...

| * Very efficient for FFTW3, C/C++/Fortran complex types, etc.
| * Might need more than a few opcodes, hardware might be a bit complex
| * Performance on par with SVE

| 2. Generic: add some data management operations
| * Similar to AVX-512 unpack/movdup/…
| * Similar to SVE zip/uzp/trn
| * ...
| * and perhaps a cheap ‘addsub’ as well

| * Less efficient for complex arithmetic, but also useful for other problems
| * Chacha20 further down in this document

| * Performance on par with AVX-512

| 3. Not mutually exclusive :-)

| # Second example, Chacha20 #

| * Chacha20 is a 'stream' crypto algorithm
| * https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
| * Supercop infrastucture with many implementations in `crypto_stream/chacha20`: https://bench.cr.yp.to/supercop.html
| * "V" source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz

| * Essentially, a pseudo-random function
| * From (a) a secret key (b) a input value ('nonce') and (c) a counter, generate a block of pseudo-random bits

| * Purely integer operations
| * Add, Xor, Rotation - rotation is _very_ useful in crypto, missing in V (... as in most SIMD ISA other than AVX-512)

| * Embarrassingly parallel across blocks
| * The counter grows monotonously, so any number of blocks can be generated simultaneously
| * But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent

| * Vector implementation tends to distribute a block over a single lane in multiple vector register
| * 4 Blocks in 128 bits, 16 in 512 bits

| * But the data has to be stored contiguously in memory at the end

| * Effectively, need to do a 'transpose' before storing data

| * Examples
| * AVX2 doesn't support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using unpack{hi,lo}_epi{32,64}
| * AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector
| * SVE doesn't specify register width, currently stick to partial transpose and 128-bits scatter

| * RISC-V "V" can implement the algorithm for variable-width register in a way similar to SVE

| * Because "V" allows to _specify_ the register width, it can adapts better to the level of parallelism
| * SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)
| * V can simply specify a narrower vector width
| * Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent

| * Problem is, "V" lacks the data management instruction for the transpose

| * It has only 'register gather' and 'merge', which are _very_ general-purpose

| * "register gather" can use _any_ lane as input, so probably very costly in hardware and high producer-consumer latency
| * Cost confirmed by discussion with hardware implementer

| * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies
| * Somewhat costly, but not prohibitively so

| * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general instructions
| * Quick and dirty but illustrates the point
| * Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of 'better' code

| * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in "V", plus support instructions to generate the indices for gather / the mask for merge
| * Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes

| * That's a significant overhead at the end of the algorithm
| * And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)

| * Adding some fixed-pattern data management instruction would remove this overhead
| 1. Fewer instructions by removing support instructions
| 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge
| 3. Lower-latency instructions from simpler semantic
| * e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can pick from anywhere


Re: [tech-vector-ext] Some proposals

Guy Lemieux
 

On Sun, Mar 8, 2020 at 1:42 AM Krste Asanovic <krste@...> wrote:


It doesn't look like all these issues were addressed either on mailing
list or on github tracker. In general, it is much better to split
feedback into individual topics. I'm responding all-in-one below, but
if you want to continue any of these threads, please add to github
tracker as independent issues.

Several of these issues might be addressed with anticipated longer
64-bit vector instructions (ILEN=64).

On Fri, 22 Nov 2019 16:17:33 +0100, Dr. Adrián Cristal <adrian.cristal@...> said:
| Dear all,
| We have been involved in RTL design of an accelerator featuring the 0.7.1 standard; in parallel, we were running some vectorized HPC kernels on
| a simulator platform which mimics the accelerator. We have come across the following:

| 1. With respect to new proposed spill instructions, the cost could be high in case we need to spill few elements of a large vector. Instead we
| propose the following: spill v1, rs1, [rs2], recoverspill v2,rs2 and unspill rs1.

| The semantic is the following: spill v1, rs1, [rs2], will store v1 in the address rs1 up to rs2 elements. There is not a warranty that the
| values are stored, but there is the warranty that if they are not stored they will be completely recovered by the recoverspill v2, rs2 (
| otherwise recoverspill v2,rs2 will read the values from memory if they at some time were written to memory). The unspill rs1, will disable the
| capability to recoverspill operation at address rs1. In the case of OoO processor, for spill it can delay the free of the vector physical
| register on spill and assign to the logical register again on recovespill. So the cost will be much less, but it can be saved if the processor
| needs more physical registers. For in order implemenetations, it will save the registers on memory.
On a context switch, the underlying physical registers that hold spill
values need to be saved/restored to memory, as well as the associated
rs1 values. This means we need extra instructions to iterate through
the physical registers, and their associated rs1 values, to also save
as part of the context switch. Not impossible, but it is more complex
than just adding the 3 proposed instructions. Instead, you could just
force the spill to memory of all tracked registers, but then you run
into the delayed memory page faults etc brought up by Krste.

Instead, at the system level you could have a tightly-coupled memory
(TCM) as an addressable scratchpad where vector registers get written
during a spill? (This TCM could be used for any purpose.)


| 4- If a register is defined with a VL, then the values after VL (between VL+1 to VLMAX) could be set to “undefined” or 0, it would benefit
| implementations with longer vectors considerably in an OoO processor. The current text states that the previous values should be preserved.

Ongoing discussion.

| 5- Proposal to add some semantic information to the vectors. In particular it is important to know the VL and the SEW at the time vector was
| created, so when the context is saved we can have a special load/store operation that will only use the minimal storage that is needed in
| memory (plus the length and sew). Also this allows the system to know that after the VL, we do not have to preserve any element.
Except that all values after VL do have to be stored on a context
switch. Just because a VL was used to modify head elements of a
vector, it doesn't mean the tail elements can be ignored (under the
current tail-undisturbed model).

Guy


issue #393 - Towards a simple fractional LMUL design.

David Horner
 

I'm sending out to the correct mailing list a copy of the revised issue #393.

(link: https://github.com/riscv/riscv-v-spec/issues/393 )

This was requested at the last TG meeting.

I believe it is consistent with casual discussions of fractional LMUL and it is intended to formalize a design.

To follow is the consideration of alternate register overlap to improve usability.

The issue #393 update adds to the Glossary and notes that mask registers and operations are unchanged from the plan of record.


  Towards a simple fractional LMUL design.

Background:

Prior to LMUL, an elaborate mapping of registers numbers to various width element under different configuration settings that allowed for polymorphic operations was proposed.

LMUL was introduced in a pre-v0.5 Nov 2018 in conjunction with widening operations and SEW widths.
The LMUL>1 mapping of a register group is one to a power of 2 of consecutive non-overlapping base-arch-registers. The naming uses the lowest base-arch-register participating in the register group.
The number of LMUL register is diminished by the same power of 2.
This design was substantially less complex than the predecessor, with simple constructs like

  • LMUL in powers of 2 aligning with the widening by 2 operations.
    Abandoning previous ideas of sequences like 1,2,3,4,5,6,8,10,16,32

  • consecutive registers in register groups, aligned and addressed on multiples of LMUL

This issue will look at simplest implementations of fraction LMUL.

Glossary:

base-arch registers* – the 32 registers addressable when LMUL=1
register group – consecutive registers determined by LMUL>1
register sub-group* – portion of physical register used by LMUL<1
SLEN - The striping distance in bits,
VLEN - The number of bits in a vector register,
VLMAX – LMUL * VLEN / SEW
. . no name is given to effective VLEN at different values of LMUL
vstart - read-write CSR specifies the index of the first element to be executed by a vector
instruction.
( * whereas other terms are from the spec these * terms are added for this discussion)

Guidance.
Fractional LMUL follows the same rules as for LMUL>=1.
VLMAX applies the same.

The simplest extensions to the base retain the fundamental characteristics.
Specifically then, for this proposal, ELEN, SEW (and its encoding in vtype), VLEN and, mask register zero and mask operation behaviour are not changed.

The simplest extension of LMUL to “fractional” is that the observe affects continue predictably.
Specifically,

  • for changes in LMUL there is a corresponding change in VLMAX and

  • fractional LMUL changes by a factor of 2 from adjacent settings.

For LMUL >=1, VLMAX = LMUL * VLEN/SEW
Note: if SEW is unchanged, with variation of LMUL there is a proportional change in VLMAX.
We can multiply both sides by SEW to get LMUL * VLEN = VLMAX * SEW.

This table exhaustively represents this simplest extension effect when SEW is unchanged throughout:

LMUL       VLMAX * SEW        

8               8*VLEN
4               4*VLEN
2               2*VLEN
1                 VLEN
1/2             VLEN/2
1/4             VLEN/4
1/8             VLEN/8

Fractional registers then have diminished capacity, 1/2 to 1/8th of a base-arch register.

The simplest mapping of fractional LMUL registers is one to one (and only one) of the base-arch registers.
All 32 base-arch-registers can participate and register numbering can be the same.

The simplest overlay (analogous to the register group overlay of consecutive base-arch registers) is with zero elements overlaying.
That is, the fractional register sub-group occupies the lowest consecutive bytes in the base-arch register. The bytes are in the same ascending order.

I call this iteration zero of the simplest fractional LMUL designs.

Note: Mask behaviour does not change. Mask operations read and write to a base-arch register. Base-arch register zero remains the default mask register. With this "iteration zero" design, as with LMUL>=1, fractional LMUL “register zero”s are substantially limited in their use.

There are some undesirable characteristic of this design.

  • Use of any fractional sub-group is destructive to the underlying base-arch register.
    As sub-groups have less capacity than the underlying base-arch register overall usable capacity is also diminished, up to 7/8ths of VLEN for each active sub-group.

  • Such sub-groups are not optimized for widening operations.
    There is no equivalent to SLEN to align single with widened operands.






A couple of questions about the vector spec

Nagendra Gulur
 

I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.

1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd[0]. This results in committing vd as destination and makes it hard to use other elements of vd (vd[1], vd[2], ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers.

2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by  0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift?

Best Regards
Nagendra


Re: A couple of questions about the vector spec

Guy Lemieux
 

1. A vector register is deliberately used as the destination of
reductions. If the destination is a scalar register, then tight
coupling between the vector and scalar units would be necessary, and
concurrency would be reduced (because the scalar unit might have to
stall until the vector reduction is completed).

2. Yes, vector-indexed-load instructions such as vlxe.v currently
treat offsets as byte offsets. I could see this issue being debated,
but it would require a shift by (0,1,2,3,4) for up to 64-bit SEWs. If
there is a way you can use vrgather.vv instread, it uses element-size
offsets.

Guy

On Tue, Mar 10, 2020 at 2:44 PM Nagendra Gulur <nagendra.gd@...> wrote:

I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.

1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd[0]. This results in committing vd as destination and makes it hard to use other elements of vd (vd[1], vd[2], ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers.

2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by 0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift?

Best Regards
Nagendra


Re: A couple of questions about the vector spec

Andrew Waterman
 



On Tue, Mar 10, 2020 at 3:07 PM Guy Lemieux <glemieux@...> wrote:
1. A vector register is deliberately used as the destination of
reductions. If the destination is a scalar register, then tight
coupling between the vector and scalar units would be necessary, and
concurrency would be reduced (because the scalar unit might have to
stall until the vector reduction is completed).

2. Yes, vector-indexed-load instructions such as vlxe.v currently
treat offsets as byte offsets. I could see this issue being debated,
but it would require a shift by (0,1,2,3,4) for up to 64-bit SEWs. If
there is a way you can use vrgather.vv instread, it uses element-size
offsets.

Indices must be able to represent general pointers in some cases (e.g. vectorizing (*(x[i]))++ instead of y[x[i]]++), so implicitly scaling the indices causes problems, particularly when misaligned addresses are legal.

The 64-bit instruction encoding could offer another variant of these instructions that scales indices by SEW/8.  In the mean time, I don't think the extra shift is too much to ask.


Guy


On Tue, Mar 10, 2020 at 2:44 PM Nagendra Gulur <nagendra.gd@...> wrote:
>
> I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.
>
> 1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd[0]. This results in committing vd as destination and makes it hard to use other elements of vd (vd[1], vd[2], ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers.
>
> 2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by  0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift?
>
> Best Regards
> Nagendra
>




Re: A couple of questions about the vector spec

Nagendra Gulur
 

I am not sure I replied right (I hit reply to sender but not sure who I responded to, learning to work with the list). But thanks for quick replies.

1. Yes, understood about the scalar destination complication. I have to figure a better use of vd[1], vd[2] etc.

2. The vrgather (per 0.8 spec) does vd[i] = vs2[vs1[i]] -- I am not sure how this fixes the conversion from index to address offset. I need the bit shift to happen somewhere in the code. 

I am not suggesting implicit scaling but programmer specified scaling amount in the instruction (0/1/2/3 bit shift). Based on knowledge of the matrix element data type, the programmer can certainly specify a shift amount.
Note that in my sparse matrix vector multiply code, the innermost loop is 9 instructions including the scaling instruction. If this were removed, it reduces dynamic instruction count by about 10%. It seems to be a valuable saving. 

Best Regards
Nagendra


Re: A couple of questions about the vector spec

Guy Lemieux
 

1. Yes, understood about the scalar destination complication. I have to figure a better use of vd[1], vd[2] etc.
possibly vslide1up after every reduction, producing a vector of
reductions (possibly in backwards order, unless you rearrange your
outer loop order).

2. The vrgather (per 0.8 spec) does vd[i] = vs2[vs1[i]] -- I am not sure how this fixes the conversion from index to address offset. I need the bit shift to happen somewhere in the code.
I'm not suggesting that you use vrgather to convert indices to byte
offsets. I'm wondering if there is a way to handle sparse rows/columns
entirely differently that uses vrgather instead of vlx (note: I have
no idea if it's possible, as I've never tried to implement sparse
matrix code). However, vlx and vrgather are very similar (one applies
to memory byte addresses, the other applies to vector elements, so
obviously there is some difference).

I am not suggesting implicit scaling but programmer specified scaling amount in the instruction (0/1/2/3 bit shift). Based on knowledge of the matrix element data type, the programmer can certainly specify a shift amount.
You are overthinking this. Well-designed vector units may be able to
fuse/chain a vssl.vv instruction with a vlx instruction. You shouldn't
think one instruction must run to completion before the next one
starts.

Note that in my sparse matrix vector multiply code, the innermost loop is 9 instructions including the scaling instruction. If this were removed, it reduces dynamic instruction count by about 10%. It seems to be a valuable saving.
Yes, it would save instruction issue bandwidth. On simple vector
implementations, it may speed things up. On complex ones, it shouldn't
make a difference as this will be an anticipated frequently occuring
pattern.

Guy


Re: A couple of questions about the vector spec

Nagendra Gulur
 

Thank you for the vslide1up suggestion. I might use this with loop unrolling.

Best Regards
Nagendra


64-bit instruction encoding wish list

Andrew Waterman
 

Guy pointed out to me that, since several V ISA-design issues have been punted to an eventual 64-bit instruction encoding, we should consider recording them somewhere.  I've set up the github wiki for the purpose of recording design rationale that doesn't belong in the spec proper, and have seeded it with a very short list of hypothetical 64b features.  Feel free to edit directly if you have write permissions. https://github.com/riscv/riscv-v-spec/wiki

21 - 40 of 872