[tech-vector-ext] Feedback on RISC-V V-extension (FFTW3 and Chacha20)
We have seen very efficient FFT execution without adding new
instructions, but there is interest in a layer of DSP extension on top
of the base vector ISA. One goal of the EDIV extension would be to
support real+imaginary types and possibly operations.
I haven't looked at the Chacha20 code - but either crypto extensions
or EDIV-based interleaving might help with the interleaving it needs.
Krste
| Romain Dolbeau is a HPC expert from Atos and participates in the EPI project. Romain provided the following feedback as the result of himself
| porting FFTW3 and Chacha20 to the V-extension. He kindly agreed to share his feedback with the Vext list.
| If you have comments, try to keep him in the loop when replying the email as Romain is not subscribed to this mailing list. I'm also attaching
| the feedback in the original markdown format.
| Thank you very much,
| Roger
| Introduction
| * EPI is implementing a VPU supporting RISC-V “V” extension, and doing some associated compiler & software work
| * This is some feedback on the current version of the “V” extension draft (0.7.1)
| * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers
| In-vector data management issues
| * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer)
| * Some significant overheads due to lack of in-vector data management instructions in “V”
| * Major issue for efficient software
| Please note: FFTW3 is about the implementation in the library, not the high-level algorithm.
| Current situation
| * Like most (all?) SIMD instruction sets, most “V” instructions specify semantic “per-lane”
| * This is fine for most operations on elementary data types (integer, FP)
| * There’s also some reduction operations for common reduction patterns accross lanes
| * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory
| + structures, multi-dimensional arrays, indirect access, …
| * However, some algorithms require data in one lane to interact with data in a different lane
| + here we consider two examples: complex arithmetic & a crypto algorithm
| * Such patterns are usually quite simple
| + Locally, lane with an immediate neighbor
| + Overall, regular patterns like (de)interleaving
| * “V” can make those work, but at a high cost
| + Need to manufacture indices vector to gather/merge data
| * Other SIMD instruction set have dedicated instructions
| * Issue is with porting legacy codes, or management-intensive algorithms
| + High-level view of algorithms, allowing for memory reorganisation to suit “V”, are mostly fine
| First example, FFTW3 library
| * FFTW3 implements SIMD as a two-step process
| 1. Abstract SIMD code generation for “codelets”, using C macro
| 2. An “implementation file” containing SIMD ISA-specific implementation
| * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers
| * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication)
| * Currently, the “V” implementation has a lot of overhead to manage data
| + Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v)
| * (a + bi) × (a’ + b’i) = (a a’ - b b’) + (a b’ + a’ b)i
| * If stored in ‘interleaved’ format …
| * (v[even]+ v[odd]i) (v’[even] + v’[odd]i) = (v[even] v’[even] - v[odd] v’[odd]) + (v[even] v’[odd] + v’[even] * v[odd])i
| 1. Interaction between ‘odd’ and ‘even’ lanes
| 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’
| * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from
| registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data.
| Not-so-readable example “V” code
| * To implement a full complex multiplication in “V”, we need…
| + Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)
| + Gather the data for the first multiplication
| + Do the first multiplication,
| + Do an explicit conjugate for the add/sub
| + Gather more data
| + Do the final fmacc
| + In the example below, VL is the vector length in complex element, tx & sr are interleaved (real/imag) vector data
| * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins)
| * Full code visible on the GitHub referenced earlier
| static inline V VZMUL(V tx, V sr) // fixme: improve
| {
| V tr;
| V ti;
| Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)
| Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);
| Vint vone = TYPEINT(vbroadcast)(1, 2*VL);
| Vint hidx;
| hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)
| tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx
| hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...
| ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx
| tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]
| hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)
| sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr
| hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)
| sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr
| return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]
| }
| Other SIMD instruction set
| * Also visible in GitHub
| * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)
| + Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)
| + Only 5 instructions vs. 11/13 for V
| * SVE has complex arithmetic support (branch arm-sve-alt)
| + ‘CMLA’ instruction designed to be used in pair, implementing complex mul
| + With free conjugation for e.g. VZMULJ macro
| + Only 2 instructions
| Possible solutions
| 1. Problem-specific: add some complex-arithmetic instructions
| + addsub depending on lane parity
| + complex mul similar to CMLA
| + complex mul similar to BlueGene/Q instructions
| + …
| * Very efficient for FFTW3, C/C++/Fortran complex types, etc.
| * Might need more than a few opcodes, hardware might be a bit complex
| * Performance on par with SVE
| 1. Generic: add some data management operations
| + Similar to AVX-512 unpack/movdup/…
| + Similar to SVE zip/uzp/trn
| + …
| + and perhaps a cheap ‘addsub’ as well
| * Less efficient for complex arithmetic, but also useful for other problems
| * Chacha20 further down in this document
| * Performance on par with AVX-512
| 1. Not mutually exclusive :-)
| Second example, Chacha20
| * Chacha20 is a ‘stream’ crypto algorithm
| + https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
| + Supercop infrastucture with many implementations in crypto_stream/chacha20: https://bench.cr.yp.to/supercop.html
| + “V” source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz
| * Essentially, a pseudo-random function
| + From (a) a secret key (b) a input value (‘nonce’) and (c) a counter, generate a block of pseudo-random bits
| * Purely integer operations
| + Add, Xor, Rotation - rotation is very useful in crypto, missing in V (… as in most SIMD ISA other than AVX-512)
| * Embarrassingly parallel across blocks
| + The counter grows monotonously, so any number of blocks can be generated simultaneously
| + But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent
| * Vector implementation tends to distribute a block over a single lane in multiple vector register
| + 4 Blocks in 128 bits, 16 in 512 bits
| * But the data has to be stored contiguously in memory at the end
| * Effectively, need to do a ‘transpose’ before storing data
| * Examples
| + AVX2 doesn’t support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using
| unpack{hi,lo}_epi{32,64}
| + AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector
| + SVE doesn’t specify register width, currently stick to partial transpose and 128-bits scatter
| * RISC-V “V” can implement the algorithm for variable-width register in a way similar to SVE
| * Because “V” allows to specify the register width, it can adapts better to the level of parallelism
| + SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)
| + V can simply specify a narrower vector width
| + Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent
| * Problem is, “V” lacks the data management instruction for the transpose
| * It has only ‘register gather’ and ‘merge’, which are very general-purpose
| * “register gather” can use any lane as input, so probably very costly in hardware and high producer-consumer latency
| + Cost confirmed by discussion with hardware implementer
| * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies
| + Somewhat costly, but not prohibitively so
| * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general
| instructions
| + Quick and dirty but illustrates the point
| + Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of ‘better’ code
| * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in “V”, plus support instructions to generate the
| indices for gather / the mask for merge
| + Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes
| * That’s a significant overhead at the end of the algorithm
| + And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)
| * Adding some fixed-pattern data management instruction would remove this overhead
| 1. Fewer instructions by removing support instructions
| 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge
| 3. Lower-latency instructions from simpler semantic
| o e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can
| pick from anywhere
|
| --
| Roger Ferrer Ibáñez - roger.ferrer@...
| Barcelona Supercomputing Center - Centro Nacional de Supercomputación
| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain
| information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended
| recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing,
| distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy
| and delete any copies you may have received.
| http://www.bsc.es/disclaimer
|
| # Introduction #
| * EPI is implementing a VPU supporting RISC-V "V" extension, and doing some associated compiler & software work
| * This is some feedback on the current version of the "V" extension draft (0.7.1)
| * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers
| # In-vector data management issues #
| * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer)
| * Some significant overheads due to lack of in-vector data management instructions in "V"
| * Major issue for efficient software
| Please note: FFTW3 is about the _implementation_ in the library, not the high-level algorithm.
| ## Current situation ##
| * Like most (all?) SIMD instruction sets, most "V" instructions specify semantic "per-lane"
| * This is fine for most operations on elementary data types (integer, FP)
| * There's also some reduction operations for common reduction patterns accross lanes
| * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory
| * structures, multi-dimensional arrays, indirect access, ...
| * However, some algorithms require data in one lane to interact with data in a different lane
| * here we consider two examples: complex arithmetic & a crypto algorithm
| * Such patterns are usually quite simple
| * Locally, lane with an immediate neighbor
| * Overall, regular patterns like (de)interleaving
| * "V" can make those work, but at a high cost
| * Need to manufacture indices vector to gather/merge data
| * Other SIMD instruction set have dedicated instructions
| * Issue is with porting legacy codes, or management-intensive algorithms
| * High-level view of algorithms, allowing for memory reorganisation to suit "V", are mostly fine
| ## First example, FFTW3 library ##
| * FFTW3 implements SIMD as a two-step process
| 1. Abstract SIMD code generation for "codelets", using C macro
| 2. An "implementation file" containing SIMD ISA-specific implementation
| * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers
| * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication)
| * Currently, the “V” implementation has a lot of overhead to manage data
| * Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v)
| * (a + bi) × (a' + b'i) = (a * a' - b * b') + (a * b' + a' * b)i
| * If stored in ‘interleaved’ format …
| * (v[even]+ v[odd]i) * (v'[even] + v'[odd]i) = (v[even] * v'[even] - v[odd] * v'[odd]) + (v[even] * v'[odd] + v'[even] * v[odd])i
| 1. Interaction between ‘odd’ and ‘even’ lanes
| 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’
| * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data.
| ## Not-so-readable example "V" code ##
| * To implement a full complex multiplication in “V”, we need…
| * Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)
| * Gather the data for the first multiplication
| * Do the first multiplication,
| * Do an explicit conjugate for the add/sub
| * Gather more data
| * Do the final fmacc
| * In the example below, VL is the vector length in _complex_ element, tx & sr are interleaved (real/imag) vector data
| * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins)
| * Full code visible on the GitHub referenced earlier
| ```
| static inline V VZMUL(V tx, V sr) // fixme: improve
| {
| V tr;
| V ti;
| Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)
| Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);
| Vint vone = TYPEINT(vbroadcast)(1, 2*VL);
| Vint hidx;
| hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)
| tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx
| hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...
| ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx
| tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]
| hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)
| sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr
| hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)
| sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr
| return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]
| }
| ```
| ## Other SIMD instruction set ##
| * Also visible in GitHub
| * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)
| * Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)
| * Only 5 instructions vs. 11/13 for V
| * SVE has complex arithmetic support (branch arm-sve-alt)
| * ‘CMLA’ instruction designed to be used in pair, implementing complex mul
| * With free conjugation for e.g. VZMULJ macro
| * Only 2 instructions
| ## Possible solutions ##
| 1. Problem-specific: add some complex-arithmetic instructions
| * addsub depending on lane parity
| * complex mul similar to CMLA
| * complex mul similar to BlueGene/Q instructions
| * ...
| * Very efficient for FFTW3, C/C++/Fortran complex types, etc.
| * Might need more than a few opcodes, hardware might be a bit complex
| * Performance on par with SVE
| 2. Generic: add some data management operations
| * Similar to AVX-512 unpack/movdup/…
| * Similar to SVE zip/uzp/trn
| * ...
| * and perhaps a cheap ‘addsub’ as well
| * Less efficient for complex arithmetic, but also useful for other problems
| * Chacha20 further down in this document
| * Performance on par with AVX-512
| 3. Not mutually exclusive :-)
| # Second example, Chacha20 #
| * Chacha20 is a 'stream' crypto algorithm
| * https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
| * Supercop infrastucture with many implementations in `crypto_stream/chacha20`: https://bench.cr.yp.to/supercop.html
| * "V" source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz
| * Essentially, a pseudo-random function
| * From (a) a secret key (b) a input value ('nonce') and (c) a counter, generate a block of pseudo-random bits
| * Purely integer operations
| * Add, Xor, Rotation - rotation is _very_ useful in crypto, missing in V (... as in most SIMD ISA other than AVX-512)
| * Embarrassingly parallel across blocks
| * The counter grows monotonously, so any number of blocks can be generated simultaneously
| * But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent
| * Vector implementation tends to distribute a block over a single lane in multiple vector register
| * 4 Blocks in 128 bits, 16 in 512 bits
| * But the data has to be stored contiguously in memory at the end
| * Effectively, need to do a 'transpose' before storing data
| * Examples
| * AVX2 doesn't support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using unpack{hi,lo}_epi{32,64}
| * AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector
| * SVE doesn't specify register width, currently stick to partial transpose and 128-bits scatter
| * RISC-V "V" can implement the algorithm for variable-width register in a way similar to SVE
| * Because "V" allows to _specify_ the register width, it can adapts better to the level of parallelism
| * SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)
| * V can simply specify a narrower vector width
| * Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent
| * Problem is, "V" lacks the data management instruction for the transpose
| * It has only 'register gather' and 'merge', which are _very_ general-purpose
| * "register gather" can use _any_ lane as input, so probably very costly in hardware and high producer-consumer latency
| * Cost confirmed by discussion with hardware implementer
| * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies
| * Somewhat costly, but not prohibitively so
| * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general instructions
| * Quick and dirty but illustrates the point
| * Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of 'better' code
| * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in "V", plus support instructions to generate the indices for gather / the mask for merge
| * Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes
| * That's a significant overhead at the end of the algorithm
| * And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)
| * Adding some fixed-pattern data management instruction would remove this overhead
| 1. Fewer instructions by removing support instructions
| 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge
| 3. Lower-latency instructions from simpler semantic
| * e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can pick from anywhere
instructions, but there is interest in a layer of DSP extension on top
of the base vector ISA. One goal of the EDIV extension would be to
support real+imaginary types and possibly operations.
I haven't looked at the Chacha20 code - but either crypto extensions
or EDIV-based interleaving might help with the interleaving it needs.
Krste
| Hi all,On Tue, 10 Dec 2019 16:16:17 +0100, "Roger Ferrer Ibanez" <roger.ferrer@...> said:
| Romain Dolbeau is a HPC expert from Atos and participates in the EPI project. Romain provided the following feedback as the result of himself
| porting FFTW3 and Chacha20 to the V-extension. He kindly agreed to share his feedback with the Vext list.
| If you have comments, try to keep him in the loop when replying the email as Romain is not subscribed to this mailing list. I'm also attaching
| the feedback in the original markdown format.
| Thank you very much,
| Roger
| Introduction
| * EPI is implementing a VPU supporting RISC-V “V” extension, and doing some associated compiler & software work
| * This is some feedback on the current version of the “V” extension draft (0.7.1)
| * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers
| In-vector data management issues
| * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer)
| * Some significant overheads due to lack of in-vector data management instructions in “V”
| * Major issue for efficient software
| Please note: FFTW3 is about the implementation in the library, not the high-level algorithm.
| Current situation
| * Like most (all?) SIMD instruction sets, most “V” instructions specify semantic “per-lane”
| * This is fine for most operations on elementary data types (integer, FP)
| * There’s also some reduction operations for common reduction patterns accross lanes
| * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory
| + structures, multi-dimensional arrays, indirect access, …
| * However, some algorithms require data in one lane to interact with data in a different lane
| + here we consider two examples: complex arithmetic & a crypto algorithm
| * Such patterns are usually quite simple
| + Locally, lane with an immediate neighbor
| + Overall, regular patterns like (de)interleaving
| * “V” can make those work, but at a high cost
| + Need to manufacture indices vector to gather/merge data
| * Other SIMD instruction set have dedicated instructions
| * Issue is with porting legacy codes, or management-intensive algorithms
| + High-level view of algorithms, allowing for memory reorganisation to suit “V”, are mostly fine
| First example, FFTW3 library
| * FFTW3 implements SIMD as a two-step process
| 1. Abstract SIMD code generation for “codelets”, using C macro
| 2. An “implementation file” containing SIMD ISA-specific implementation
| * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers
| * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication)
| * Currently, the “V” implementation has a lot of overhead to manage data
| + Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v)
| * (a + bi) × (a’ + b’i) = (a a’ - b b’) + (a b’ + a’ b)i
| * If stored in ‘interleaved’ format …
| * (v[even]+ v[odd]i) (v’[even] + v’[odd]i) = (v[even] v’[even] - v[odd] v’[odd]) + (v[even] v’[odd] + v’[even] * v[odd])i
| 1. Interaction between ‘odd’ and ‘even’ lanes
| 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’
| * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from
| registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data.
| Not-so-readable example “V” code
| * To implement a full complex multiplication in “V”, we need…
| + Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)
| + Gather the data for the first multiplication
| + Do the first multiplication,
| + Do an explicit conjugate for the add/sub
| + Gather more data
| + Do the final fmacc
| + In the example below, VL is the vector length in complex element, tx & sr are interleaved (real/imag) vector data
| * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins)
| * Full code visible on the GitHub referenced earlier
| static inline V VZMUL(V tx, V sr) // fixme: improve
| {
| V tr;
| V ti;
| Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)
| Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);
| Vint vone = TYPEINT(vbroadcast)(1, 2*VL);
| Vint hidx;
| hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)
| tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx
| hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...
| ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx
| tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]
| hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)
| sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr
| hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)
| sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr
| return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]
| }
| Other SIMD instruction set
| * Also visible in GitHub
| * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)
| + Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)
| + Only 5 instructions vs. 11/13 for V
| * SVE has complex arithmetic support (branch arm-sve-alt)
| + ‘CMLA’ instruction designed to be used in pair, implementing complex mul
| + With free conjugation for e.g. VZMULJ macro
| + Only 2 instructions
| Possible solutions
| 1. Problem-specific: add some complex-arithmetic instructions
| + addsub depending on lane parity
| + complex mul similar to CMLA
| + complex mul similar to BlueGene/Q instructions
| + …
| * Very efficient for FFTW3, C/C++/Fortran complex types, etc.
| * Might need more than a few opcodes, hardware might be a bit complex
| * Performance on par with SVE
| 1. Generic: add some data management operations
| + Similar to AVX-512 unpack/movdup/…
| + Similar to SVE zip/uzp/trn
| + …
| + and perhaps a cheap ‘addsub’ as well
| * Less efficient for complex arithmetic, but also useful for other problems
| * Chacha20 further down in this document
| * Performance on par with AVX-512
| 1. Not mutually exclusive :-)
| Second example, Chacha20
| * Chacha20 is a ‘stream’ crypto algorithm
| + https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
| + Supercop infrastucture with many implementations in crypto_stream/chacha20: https://bench.cr.yp.to/supercop.html
| + “V” source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz
| * Essentially, a pseudo-random function
| + From (a) a secret key (b) a input value (‘nonce’) and (c) a counter, generate a block of pseudo-random bits
| * Purely integer operations
| + Add, Xor, Rotation - rotation is very useful in crypto, missing in V (… as in most SIMD ISA other than AVX-512)
| * Embarrassingly parallel across blocks
| + The counter grows monotonously, so any number of blocks can be generated simultaneously
| + But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent
| * Vector implementation tends to distribute a block over a single lane in multiple vector register
| + 4 Blocks in 128 bits, 16 in 512 bits
| * But the data has to be stored contiguously in memory at the end
| * Effectively, need to do a ‘transpose’ before storing data
| * Examples
| + AVX2 doesn’t support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using
| unpack{hi,lo}_epi{32,64}
| + AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector
| + SVE doesn’t specify register width, currently stick to partial transpose and 128-bits scatter
| * RISC-V “V” can implement the algorithm for variable-width register in a way similar to SVE
| * Because “V” allows to specify the register width, it can adapts better to the level of parallelism
| + SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)
| + V can simply specify a narrower vector width
| + Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent
| * Problem is, “V” lacks the data management instruction for the transpose
| * It has only ‘register gather’ and ‘merge’, which are very general-purpose
| * “register gather” can use any lane as input, so probably very costly in hardware and high producer-consumer latency
| + Cost confirmed by discussion with hardware implementer
| * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies
| + Somewhat costly, but not prohibitively so
| * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general
| instructions
| + Quick and dirty but illustrates the point
| + Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of ‘better’ code
| * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in “V”, plus support instructions to generate the
| indices for gather / the mask for merge
| + Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes
| * That’s a significant overhead at the end of the algorithm
| + And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)
| * Adding some fixed-pattern data management instruction would remove this overhead
| 1. Fewer instructions by removing support instructions
| 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge
| 3. Lower-latency instructions from simpler semantic
| o e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can
| pick from anywhere
|
| --
| Roger Ferrer Ibáñez - roger.ferrer@...
| Barcelona Supercomputing Center - Centro Nacional de Supercomputación
| WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain
| information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended
| recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing,
| distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy
| and delete any copies you may have received.
| http://www.bsc.es/disclaimer
|
| # Introduction #
| * EPI is implementing a VPU supporting RISC-V "V" extension, and doing some associated compiler & software work
| * This is some feedback on the current version of the "V" extension draft (0.7.1)
| * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers
| # In-vector data management issues #
| * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer)
| * Some significant overheads due to lack of in-vector data management instructions in "V"
| * Major issue for efficient software
| Please note: FFTW3 is about the _implementation_ in the library, not the high-level algorithm.
| ## Current situation ##
| * Like most (all?) SIMD instruction sets, most "V" instructions specify semantic "per-lane"
| * This is fine for most operations on elementary data types (integer, FP)
| * There's also some reduction operations for common reduction patterns accross lanes
| * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory
| * structures, multi-dimensional arrays, indirect access, ...
| * However, some algorithms require data in one lane to interact with data in a different lane
| * here we consider two examples: complex arithmetic & a crypto algorithm
| * Such patterns are usually quite simple
| * Locally, lane with an immediate neighbor
| * Overall, regular patterns like (de)interleaving
| * "V" can make those work, but at a high cost
| * Need to manufacture indices vector to gather/merge data
| * Other SIMD instruction set have dedicated instructions
| * Issue is with porting legacy codes, or management-intensive algorithms
| * High-level view of algorithms, allowing for memory reorganisation to suit "V", are mostly fine
| ## First example, FFTW3 library ##
| * FFTW3 implements SIMD as a two-step process
| 1. Abstract SIMD code generation for "codelets", using C macro
| 2. An "implementation file" containing SIMD ISA-specific implementation
| * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers
| * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication)
| * Currently, the “V” implementation has a lot of overhead to manage data
| * Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v)
| * (a + bi) × (a' + b'i) = (a * a' - b * b') + (a * b' + a' * b)i
| * If stored in ‘interleaved’ format …
| * (v[even]+ v[odd]i) * (v'[even] + v'[odd]i) = (v[even] * v'[even] - v[odd] * v'[odd]) + (v[even] * v'[odd] + v'[even] * v[odd])i
| 1. Interaction between ‘odd’ and ‘even’ lanes
| 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’
| * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data.
| ## Not-so-readable example "V" code ##
| * To implement a full complex multiplication in “V”, we need…
| * Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)
| * Gather the data for the first multiplication
| * Do the first multiplication,
| * Do an explicit conjugate for the add/sub
| * Gather more data
| * Do the final fmacc
| * In the example below, VL is the vector length in _complex_ element, tx & sr are interleaved (real/imag) vector data
| * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins)
| * Full code visible on the GitHub referenced earlier
| ```
| static inline V VZMUL(V tx, V sr) // fixme: improve
| {
| V tr;
| V ti;
| Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)
| Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);
| Vint vone = TYPEINT(vbroadcast)(1, 2*VL);
| Vint hidx;
| hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)
| tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx
| hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...
| ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx
| tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]
| hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)
| sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr
| hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)
| sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr
| return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]
| }
| ```
| ## Other SIMD instruction set ##
| * Also visible in GitHub
| * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)
| * Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)
| * Only 5 instructions vs. 11/13 for V
| * SVE has complex arithmetic support (branch arm-sve-alt)
| * ‘CMLA’ instruction designed to be used in pair, implementing complex mul
| * With free conjugation for e.g. VZMULJ macro
| * Only 2 instructions
| ## Possible solutions ##
| 1. Problem-specific: add some complex-arithmetic instructions
| * addsub depending on lane parity
| * complex mul similar to CMLA
| * complex mul similar to BlueGene/Q instructions
| * ...
| * Very efficient for FFTW3, C/C++/Fortran complex types, etc.
| * Might need more than a few opcodes, hardware might be a bit complex
| * Performance on par with SVE
| 2. Generic: add some data management operations
| * Similar to AVX-512 unpack/movdup/…
| * Similar to SVE zip/uzp/trn
| * ...
| * and perhaps a cheap ‘addsub’ as well
| * Less efficient for complex arithmetic, but also useful for other problems
| * Chacha20 further down in this document
| * Performance on par with AVX-512
| 3. Not mutually exclusive :-)
| # Second example, Chacha20 #
| * Chacha20 is a 'stream' crypto algorithm
| * https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant
| * Supercop infrastucture with many implementations in `crypto_stream/chacha20`: https://bench.cr.yp.to/supercop.html
| * "V" source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz
| * Essentially, a pseudo-random function
| * From (a) a secret key (b) a input value ('nonce') and (c) a counter, generate a block of pseudo-random bits
| * Purely integer operations
| * Add, Xor, Rotation - rotation is _very_ useful in crypto, missing in V (... as in most SIMD ISA other than AVX-512)
| * Embarrassingly parallel across blocks
| * The counter grows monotonously, so any number of blocks can be generated simultaneously
| * But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent
| * Vector implementation tends to distribute a block over a single lane in multiple vector register
| * 4 Blocks in 128 bits, 16 in 512 bits
| * But the data has to be stored contiguously in memory at the end
| * Effectively, need to do a 'transpose' before storing data
| * Examples
| * AVX2 doesn't support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using unpack{hi,lo}_epi{32,64}
| * AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector
| * SVE doesn't specify register width, currently stick to partial transpose and 128-bits scatter
| * RISC-V "V" can implement the algorithm for variable-width register in a way similar to SVE
| * Because "V" allows to _specify_ the register width, it can adapts better to the level of parallelism
| * SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)
| * V can simply specify a narrower vector width
| * Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent
| * Problem is, "V" lacks the data management instruction for the transpose
| * It has only 'register gather' and 'merge', which are _very_ general-purpose
| * "register gather" can use _any_ lane as input, so probably very costly in hardware and high producer-consumer latency
| * Cost confirmed by discussion with hardware implementer
| * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies
| * Somewhat costly, but not prohibitively so
| * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general instructions
| * Quick and dirty but illustrates the point
| * Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of 'better' code
| * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in "V", plus support instructions to generate the indices for gather / the mask for merge
| * Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes
| * That's a significant overhead at the end of the algorithm
| * And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)
| * Adding some fixed-pattern data management instruction would remove this overhead
| 1. Fewer instructions by removing support instructions
| 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge
| 3. Lower-latency instructions from simpler semantic
| * e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can pick from anywhere