[tech-vector-ext] Feedback on RISC-V V-extension (FFTW3 and Chacha20)


Krste Asanovic
 

We have seen very efficient FFT execution without adding new
instructions, but there is interest in a layer of DSP extension on top
of the base vector ISA. One goal of the EDIV extension would be to
support real+imaginary types and possibly operations.

I haven't looked at the Chacha20 code - but either crypto extensions
or EDIV-based interleaving might help with the interleaving it needs.

Krste

On Tue, 10 Dec 2019 16:16:17 +0100, "Roger Ferrer Ibanez" <roger.ferrer@...> said:
| Hi all,
| Romain Dolbeau is a HPC expert from Atos and participates in the EPI project. Romain provided the following feedback as the result of himself
| porting FFTW3 and Chacha20 to the V-extension. He kindly agreed to share his feedback with the Vext list.

| If you have comments, try to keep him in the loop when replying the email as Romain is not subscribed to this mailing list. I'm also attaching
| the feedback in the original markdown format.

| Thank you very much,
| Roger

| Introduction

| * EPI is implementing a VPU supporting RISC-V “V” extension, and doing some associated compiler & software work

| * This is some feedback on the current version of the “V” extension draft (0.7.1)

| * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers

| In-vector data management issues

| * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer)

| * Some significant overheads due to lack of in-vector data management instructions in “V”

| * Major issue for efficient software

| Please note: FFTW3 is about the implementation in the library, not the high-level algorithm.

| Current situation

| * Like most (all?) SIMD instruction sets, most “V” instructions specify semantic “per-lane”

| * This is fine for most operations on elementary data types (integer, FP)

| * There’s also some reduction operations for common reduction patterns accross lanes

| * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory

| + structures, multi-dimensional arrays, indirect access, …

| * However, some algorithms require data in one lane to interact with data in a different lane

| + here we consider two examples: complex arithmetic & a crypto algorithm

| * Such patterns are usually quite simple

| + Locally, lane with an immediate neighbor

| + Overall, regular patterns like (de)interleaving

| * “V” can make those work, but at a high cost

| + Need to manufacture indices vector to gather/merge data

| * Other SIMD instruction set have dedicated instructions

| * Issue is with porting legacy codes, or management-intensive algorithms

| + High-level view of algorithms, allowing for memory reorganisation to suit “V”, are mostly fine

| First example, FFTW3 library

| * FFTW3 implements SIMD as a two-step process

| 1. Abstract SIMD code generation for “codelets”, using C macro

| 2. An “implementation file” containing SIMD ISA-specific implementation

| * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers

| * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication)

| * Currently, the “V” implementation has a lot of overhead to manage data

| + Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v)

| * (a + bi) × (a’ + b’i) = (a a’ - b b’) + (a b’ + a’ b)i

| * If stored in ‘interleaved’ format …

| * (v[even]+ v[odd]i) (v’[even] + v’[odd]i) = (v[even] v’[even] - v[odd] v’[odd]) + (v[even] v’[odd] + v’[even] * v[odd])i

| 1. Interaction between ‘odd’ and ‘even’ lanes

| 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’

| * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from
| registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data.

| Not-so-readable example “V” code

| * To implement a full complex multiplication in “V”, we need…

| + Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)

| + Gather the data for the first multiplication

| + Do the first multiplication,

| + Do an explicit conjugate for the add/sub

| + Gather more data

| + Do the final fmacc

| + In the example below, VL is the vector length in complex element, tx & sr are interleaved (real/imag) vector data

| * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins)

| * Full code visible on the GitHub referenced earlier

| static inline V VZMUL(V tx, V sr) // fixme: improve

| {

| V tr;

| V ti;

| Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)

| Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);

| Vint vone = TYPEINT(vbroadcast)(1, 2*VL);

| Vint hidx;

| hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)

| tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx

| hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...

| ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx

| tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]

| hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)

| sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr

| hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)

| sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr

| return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]

| }

| Other SIMD instruction set

| * Also visible in GitHub

| * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)

| + Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)

| + Only 5 instructions vs. 11/13 for V

| * SVE has complex arithmetic support (branch arm-sve-alt)

| + ‘CMLA’ instruction designed to be used in pair, implementing complex mul

| + With free conjugation for e.g. VZMULJ macro

| + Only 2 instructions

| Possible solutions

| 1. Problem-specific: add some complex-arithmetic instructions

| + addsub depending on lane parity

| + complex mul similar to CMLA

| + complex mul similar to BlueGene/Q instructions

| + …

| * Very efficient for FFTW3, C/C++/Fortran complex types, etc.

| * Might need more than a few opcodes, hardware might be a bit complex

| * Performance on par with SVE

| 1. Generic: add some data management operations

| + Similar to AVX-512 unpack/movdup/…

| + Similar to SVE zip/uzp/trn

| + …

| + and perhaps a cheap ‘addsub’ as well

| * Less efficient for complex arithmetic, but also useful for other problems

| * Chacha20 further down in this document

| * Performance on par with AVX-512

| 1. Not mutually exclusive :-)

| Second example, Chacha20

| * Chacha20 is a ‘stream’ crypto algorithm

| + https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant

| + Supercop infrastucture with many implementations in crypto_stream/chacha20: https://bench.cr.yp.to/supercop.html

| + “V” source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz

| * Essentially, a pseudo-random function

| + From (a) a secret key (b) a input value (‘nonce’) and (c) a counter, generate a block of pseudo-random bits

| * Purely integer operations

| + Add, Xor, Rotation - rotation is very useful in crypto, missing in V (… as in most SIMD ISA other than AVX-512)

| * Embarrassingly parallel across blocks

| + The counter grows monotonously, so any number of blocks can be generated simultaneously

| + But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent

| * Vector implementation tends to distribute a block over a single lane in multiple vector register

| + 4 Blocks in 128 bits, 16 in 512 bits

| * But the data has to be stored contiguously in memory at the end

| * Effectively, need to do a ‘transpose’ before storing data

| * Examples

| + AVX2 doesn’t support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using
| unpack{hi,lo}_epi{32,64}

| + AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector

| + SVE doesn’t specify register width, currently stick to partial transpose and 128-bits scatter

| * RISC-V “V” can implement the algorithm for variable-width register in a way similar to SVE

| * Because “V” allows to specify the register width, it can adapts better to the level of parallelism

| + SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)

| + V can simply specify a narrower vector width

| + Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent

| * Problem is, “V” lacks the data management instruction for the transpose

| * It has only ‘register gather’ and ‘merge’, which are very general-purpose

| * “register gather” can use any lane as input, so probably very costly in hardware and high producer-consumer latency

| + Cost confirmed by discussion with hardware implementer

| * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies

| + Somewhat costly, but not prohibitively so

| * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general
| instructions

| + Quick and dirty but illustrates the point

| + Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of ‘better’ code

| * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in “V”, plus support instructions to generate the
| indices for gather / the mask for merge

| + Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes

| * That’s a significant overhead at the end of the algorithm

| + And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)

| * Adding some fixed-pattern data management instruction would remove this overhead

| 1. Fewer instructions by removing support instructions

| 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge

| 3. Lower-latency instructions from simpler semantic

| o e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can
| pick from anywhere

| ​

| --
| Roger Ferrer Ibáñez - roger.ferrer@...
| Barcelona Supercomputing Center - Centro Nacional de Supercomputación

| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain
| information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended
| recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing,
| distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy
| and delete any copies you may have received.

| http://www.bsc.es/disclaimer
|

| # Introduction #

| * EPI is implementing a VPU supporting RISC-V "V" extension, and doing some associated compiler & software work

| * This is some feedback on the current version of the "V" extension draft (0.7.1)

| * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers

| # In-vector data management issues #

| * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer)

| * Some significant overheads due to lack of in-vector data management instructions in "V"

| * Major issue for efficient software

| Please note: FFTW3 is about the _implementation_ in the library, not the high-level algorithm.

| ## Current situation ##

| * Like most (all?) SIMD instruction sets, most "V" instructions specify semantic "per-lane"

| * This is fine for most operations on elementary data types (integer, FP)

| * There's also some reduction operations for common reduction patterns accross lanes

| * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory
| * structures, multi-dimensional arrays, indirect access, ...

| * However, some algorithms require data in one lane to interact with data in a different lane
| * here we consider two examples: complex arithmetic & a crypto algorithm

| * Such patterns are usually quite simple
| * Locally, lane with an immediate neighbor
| * Overall, regular patterns like (de)interleaving

| * "V" can make those work, but at a high cost
| * Need to manufacture indices vector to gather/merge data

| * Other SIMD instruction set have dedicated instructions

| * Issue is with porting legacy codes, or management-intensive algorithms
| * High-level view of algorithms, allowing for memory reorganisation to suit "V", are mostly fine

| ## First example, FFTW3 library ##

| * FFTW3 implements SIMD as a two-step process
| 1. Abstract SIMD code generation for "codelets", using C macro
| 2. An "implementation file" containing SIMD ISA-specific implementation

| * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers

| * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication)

| * Currently, the “V” implementation has a lot of overhead to manage data
| * Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v)

| * (a + bi) × (a' + b'i) = (a * a' - b * b') + (a * b' + a' * b)i

| * If stored in ‘interleaved’ format …

| * (v[even]+ v[odd]i) * (v'[even] + v'[odd]i) = (v[even] * v'[even] - v[odd] * v'[odd]) + (v[even] * v'[odd] + v'[even] * v[odd])i
| 1. Interaction between ‘odd’ and ‘even’ lanes
| 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’

| * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data.

| ## Not-so-readable example "V" code ##

| * To implement a full complex multiplication in “V”, we need…
| * Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)
| * Gather the data for the first multiplication
| * Do the first multiplication,
| * Do an explicit conjugate for the add/sub
| * Gather more data
| * Do the final fmacc
| * In the example below, VL is the vector length in _complex_ element, tx & sr are interleaved (real/imag) vector data

| * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins)

| * Full code visible on the GitHub referenced earlier

| ```
| static inline V VZMUL(V tx, V sr) // fixme: improve
| {
| V tr;
| V ti;
| Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)
| Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);
| Vint vone = TYPEINT(vbroadcast)(1, 2*VL);
| Vint hidx;

| hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)
| tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx
| hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...
| ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx
| tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]
| hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)
| sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr
| hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)
| sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr
| return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]
| }
| ```

| ## Other SIMD instruction set ##

| * Also visible in GitHub

| * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)
| * Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)
| * Only 5 instructions vs. 11/13 for V

| * SVE has complex arithmetic support (branch arm-sve-alt)
| * ‘CMLA’ instruction designed to be used in pair, implementing complex mul
| * With free conjugation for e.g. VZMULJ macro
| * Only 2 instructions

| ## Possible solutions ##

| 1. Problem-specific: add some complex-arithmetic instructions
| * addsub depending on lane parity
| * complex mul similar to CMLA
| * complex mul similar to BlueGene/Q instructions
| * ...

| * Very efficient for FFTW3, C/C++/Fortran complex types, etc.
| * Might need more than a few opcodes, hardware might be a bit complex
| * Performance on par with SVE

| 2. Generic: add some data management operations
| * Similar to AVX-512 unpack/movdup/…
| * Similar to SVE zip/uzp/trn
| * ...
| * and perhaps a cheap ‘addsub’ as well

| * Less efficient for complex arithmetic, but also useful for other problems
| * Chacha20 further down in this document

| * Performance on par with AVX-512

| 3. Not mutually exclusive :-)

| # Second example, Chacha20 #

| * Chacha20 is a 'stream' crypto algorithm
| * https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
| * Supercop infrastucture with many implementations in `crypto_stream/chacha20`: https://bench.cr.yp.to/supercop.html
| * "V" source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz

| * Essentially, a pseudo-random function
| * From (a) a secret key (b) a input value ('nonce') and (c) a counter, generate a block of pseudo-random bits

| * Purely integer operations
| * Add, Xor, Rotation - rotation is _very_ useful in crypto, missing in V (... as in most SIMD ISA other than AVX-512)

| * Embarrassingly parallel across blocks
| * The counter grows monotonously, so any number of blocks can be generated simultaneously
| * But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent

| * Vector implementation tends to distribute a block over a single lane in multiple vector register
| * 4 Blocks in 128 bits, 16 in 512 bits

| * But the data has to be stored contiguously in memory at the end

| * Effectively, need to do a 'transpose' before storing data

| * Examples
| * AVX2 doesn't support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using unpack{hi,lo}_epi{32,64}
| * AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector
| * SVE doesn't specify register width, currently stick to partial transpose and 128-bits scatter

| * RISC-V "V" can implement the algorithm for variable-width register in a way similar to SVE

| * Because "V" allows to _specify_ the register width, it can adapts better to the level of parallelism
| * SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)
| * V can simply specify a narrower vector width
| * Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent

| * Problem is, "V" lacks the data management instruction for the transpose

| * It has only 'register gather' and 'merge', which are _very_ general-purpose

| * "register gather" can use _any_ lane as input, so probably very costly in hardware and high producer-consumer latency
| * Cost confirmed by discussion with hardware implementer

| * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies
| * Somewhat costly, but not prohibitively so

| * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general instructions
| * Quick and dirty but illustrates the point
| * Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of 'better' code

| * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in "V", plus support instructions to generate the indices for gather / the mask for merge
| * Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes

| * That's a significant overhead at the end of the algorithm
| * And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)

| * Adding some fixed-pattern data management instruction would remove this overhead
| 1. Fewer instructions by removing support instructions
| 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge
| 3. Lower-latency instructions from simpler semantic
| * e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can pick from anywhere