Date
1 - 1 of 1
[tech-vector-ext] Feedback on RISC-V V-extension (FFTW3 and Chacha20)
We have seen very efficient FFT execution without adding new
instructions, but there is interest in a layer of DSP extension on top of the base vector ISA. One goal of the EDIV extension would be to support real+imaginary types and possibly operations. I haven't looked at the Chacha20 code - but either crypto extensions or EDIV-based interleaving might help with the interleaving it needs. Krste | Hi all,On Tue, 10 Dec 2019 16:16:17 +0100, "Roger Ferrer Ibanez" <roger.ferrer@...> said: | Romain Dolbeau is a HPC expert from Atos and participates in the EPI project. Romain provided the following feedback as the result of himself | porting FFTW3 and Chacha20 to the V-extension. He kindly agreed to share his feedback with the Vext list. | If you have comments, try to keep him in the loop when replying the email as Romain is not subscribed to this mailing list. I'm also attaching | the feedback in the original markdown format. | Thank you very much, | Roger | Introduction | * EPI is implementing a VPU supporting RISC-V “V” extension, and doing some associated compiler & software work | * This is some feedback on the current version of the “V” extension draft (0.7.1) | * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers | In-vector data management issues | * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer) | * Some significant overheads due to lack of in-vector data management instructions in “V” | * Major issue for efficient software | Please note: FFTW3 is about the implementation in the library, not the high-level algorithm. | Current situation | * Like most (all?) SIMD instruction sets, most “V” instructions specify semantic “per-lane” | * This is fine for most operations on elementary data types (integer, FP) | * There’s also some reduction operations for common reduction patterns accross lanes | * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory | + structures, multi-dimensional arrays, indirect access, … | * However, some algorithms require data in one lane to interact with data in a different lane | + here we consider two examples: complex arithmetic & a crypto algorithm | * Such patterns are usually quite simple | + Locally, lane with an immediate neighbor | + Overall, regular patterns like (de)interleaving | * “V” can make those work, but at a high cost | + Need to manufacture indices vector to gather/merge data | * Other SIMD instruction set have dedicated instructions | * Issue is with porting legacy codes, or management-intensive algorithms | + High-level view of algorithms, allowing for memory reorganisation to suit “V”, are mostly fine | First example, FFTW3 library | * FFTW3 implements SIMD as a two-step process | 1. Abstract SIMD code generation for “codelets”, using C macro | 2. An “implementation file” containing SIMD ISA-specific implementation | * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers | * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication) | * Currently, the “V” implementation has a lot of overhead to manage data | + Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v) | * (a + bi) × (a’ + b’i) = (a a’ - b b’) + (a b’ + a’ b)i | * If stored in ‘interleaved’ format … | * (v[even]+ v[odd]i) (v’[even] + v’[odd]i) = (v[even] v’[even] - v[odd] v’[odd]) + (v[even] v’[odd] + v’[even] * v[odd])i | 1. Interaction between ‘odd’ and ‘even’ lanes | 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’ | * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from | registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data. | Not-so-readable example “V” code | * To implement a full complex multiplication in “V”, we need… | + Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …) | + Gather the data for the first multiplication | + Do the first multiplication, | + Do an explicit conjugate for the add/sub | + Gather more data | + Do the final fmacc | + In the example below, VL is the vector length in complex element, tx & sr are interleaved (real/imag) vector data | * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins) | * Full code visible on the GitHub referenced earlier | static inline V VZMUL(V tx, V sr) // fixme: improve | { | V tr; | V ti; | Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...) | Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL); | Vint vone = TYPEINT(vbroadcast)(1, 2*VL); | Vint hidx; | hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...) | tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx | hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ... | ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx | tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr] | hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1) | sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr | hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...) | sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr | return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr] | } | Other SIMD instruction set | * Also visible in GitHub | * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master) | + Some data gathering only need a single instruction (mov[lh]?dup, unpackhi) | + Only 5 instructions vs. 11/13 for V | * SVE has complex arithmetic support (branch arm-sve-alt) | + ‘CMLA’ instruction designed to be used in pair, implementing complex mul | + With free conjugation for e.g. VZMULJ macro | + Only 2 instructions | Possible solutions | 1. Problem-specific: add some complex-arithmetic instructions | + addsub depending on lane parity | + complex mul similar to CMLA | + complex mul similar to BlueGene/Q instructions | + … | * Very efficient for FFTW3, C/C++/Fortran complex types, etc. | * Might need more than a few opcodes, hardware might be a bit complex | * Performance on par with SVE | 1. Generic: add some data management operations | + Similar to AVX-512 unpack/movdup/… | + Similar to SVE zip/uzp/trn | + … | + and perhaps a cheap ‘addsub’ as well | * Less efficient for complex arithmetic, but also useful for other problems | * Chacha20 further down in this document | * Performance on par with AVX-512 | 1. Not mutually exclusive :-) | Second example, Chacha20 | * Chacha20 is a ‘stream’ crypto algorithm | + https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant | + Supercop infrastucture with many implementations in crypto_stream/chacha20: https://bench.cr.yp.to/supercop.html | + “V” source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz | * Essentially, a pseudo-random function | + From (a) a secret key (b) a input value (‘nonce’) and (c) a counter, generate a block of pseudo-random bits | * Purely integer operations | + Add, Xor, Rotation - rotation is very useful in crypto, missing in V (… as in most SIMD ISA other than AVX-512) | * Embarrassingly parallel across blocks | + The counter grows monotonously, so any number of blocks can be generated simultaneously | + But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent | * Vector implementation tends to distribute a block over a single lane in multiple vector register | + 4 Blocks in 128 bits, 16 in 512 bits | * But the data has to be stored contiguously in memory at the end | * Effectively, need to do a ‘transpose’ before storing data | * Examples | + AVX2 doesn’t support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using | unpack{hi,lo}_epi{32,64} | + AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector | + SVE doesn’t specify register width, currently stick to partial transpose and 128-bits scatter | * RISC-V “V” can implement the algorithm for variable-width register in a way similar to SVE | * Because “V” allows to specify the register width, it can adapts better to the level of parallelism | + SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet) | + V can simply specify a narrower vector width | + Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent | * Problem is, “V” lacks the data management instruction for the transpose | * It has only ‘register gather’ and ‘merge’, which are very general-purpose | * “register gather” can use any lane as input, so probably very costly in hardware and high producer-consumer latency | + Cost confirmed by discussion with hardware implementer | * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies | + Somewhat costly, but not prohibitively so | * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general | instructions | + Quick and dirty but illustrates the point | + Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of ‘better’ code | * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in “V”, plus support instructions to generate the | indices for gather / the mask for merge | + Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes | * That’s a significant overhead at the end of the algorithm | + And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data) | * Adding some fixed-pattern data management instruction would remove this overhead | 1. Fewer instructions by removing support instructions | 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge | 3. Lower-latency instructions from simpler semantic | o e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can | pick from anywhere | | -- | Roger Ferrer Ibáñez - roger.ferrer@... | Barcelona Supercomputing Center - Centro Nacional de Supercomputación | WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain | information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended | recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, | distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy | and delete any copies you may have received. | http://www.bsc.es/disclaimer | | # Introduction # | * EPI is implementing a VPU supporting RISC-V "V" extension, and doing some associated compiler & software work | * This is some feedback on the current version of the "V" extension draft (0.7.1) | * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers | # In-vector data management issues # | * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer) | * Some significant overheads due to lack of in-vector data management instructions in "V" | * Major issue for efficient software | Please note: FFTW3 is about the _implementation_ in the library, not the high-level algorithm. | ## Current situation ## | * Like most (all?) SIMD instruction sets, most "V" instructions specify semantic "per-lane" | * This is fine for most operations on elementary data types (integer, FP) | * There's also some reduction operations for common reduction patterns accross lanes | * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory | * structures, multi-dimensional arrays, indirect access, ... | * However, some algorithms require data in one lane to interact with data in a different lane | * here we consider two examples: complex arithmetic & a crypto algorithm | * Such patterns are usually quite simple | * Locally, lane with an immediate neighbor | * Overall, regular patterns like (de)interleaving | * "V" can make those work, but at a high cost | * Need to manufacture indices vector to gather/merge data | * Other SIMD instruction set have dedicated instructions | * Issue is with porting legacy codes, or management-intensive algorithms | * High-level view of algorithms, allowing for memory reorganisation to suit "V", are mostly fine | ## First example, FFTW3 library ## | * FFTW3 implements SIMD as a two-step process | 1. Abstract SIMD code generation for "codelets", using C macro | 2. An "implementation file" containing SIMD ISA-specific implementation | * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers | * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication) | * Currently, the “V” implementation has a lot of overhead to manage data | * Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v) | * (a + bi) × (a' + b'i) = (a * a' - b * b') + (a * b' + a' * b)i | * If stored in ‘interleaved’ format … | * (v[even]+ v[odd]i) * (v'[even] + v'[odd]i) = (v[even] * v'[even] - v[odd] * v'[odd]) + (v[even] * v'[odd] + v'[even] * v[odd])i | 1. Interaction between ‘odd’ and ‘even’ lanes | 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’ | * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data. | ## Not-so-readable example "V" code ## | * To implement a full complex multiplication in “V”, we need… | * Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …) | * Gather the data for the first multiplication | * Do the first multiplication, | * Do an explicit conjugate for the add/sub | * Gather more data | * Do the final fmacc | * In the example below, VL is the vector length in _complex_ element, tx & sr are interleaved (real/imag) vector data | * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins) | * Full code visible on the GitHub referenced earlier | ``` | static inline V VZMUL(V tx, V sr) // fixme: improve | { | V tr; | V ti; | Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...) | Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL); | Vint vone = TYPEINT(vbroadcast)(1, 2*VL); | Vint hidx; | hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...) | tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx | hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ... | ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx | tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr] | hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1) | sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr | hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...) | sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr | return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr] | } | ``` | ## Other SIMD instruction set ## | * Also visible in GitHub | * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master) | * Some data gathering only need a single instruction (mov[lh]?dup, unpackhi) | * Only 5 instructions vs. 11/13 for V | * SVE has complex arithmetic support (branch arm-sve-alt) | * ‘CMLA’ instruction designed to be used in pair, implementing complex mul | * With free conjugation for e.g. VZMULJ macro | * Only 2 instructions | ## Possible solutions ## | 1. Problem-specific: add some complex-arithmetic instructions | * addsub depending on lane parity | * complex mul similar to CMLA | * complex mul similar to BlueGene/Q instructions | * ... | * Very efficient for FFTW3, C/C++/Fortran complex types, etc. | * Might need more than a few opcodes, hardware might be a bit complex | * Performance on par with SVE | 2. Generic: add some data management operations | * Similar to AVX-512 unpack/movdup/… | * Similar to SVE zip/uzp/trn | * ... | * and perhaps a cheap ‘addsub’ as well | * Less efficient for complex arithmetic, but also useful for other problems | * Chacha20 further down in this document | * Performance on par with AVX-512 | 3. Not mutually exclusive :-) | # Second example, Chacha20 # | * Chacha20 is a 'stream' crypto algorithm | * https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant | * Supercop infrastucture with many implementations in `crypto_stream/chacha20`: https://bench.cr.yp.to/supercop.html | * "V" source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz | * Essentially, a pseudo-random function | * From (a) a secret key (b) a input value ('nonce') and (c) a counter, generate a block of pseudo-random bits | * Purely integer operations | * Add, Xor, Rotation - rotation is _very_ useful in crypto, missing in V (... as in most SIMD ISA other than AVX-512) | * Embarrassingly parallel across blocks | * The counter grows monotonously, so any number of blocks can be generated simultaneously | * But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent | * Vector implementation tends to distribute a block over a single lane in multiple vector register | * 4 Blocks in 128 bits, 16 in 512 bits | * But the data has to be stored contiguously in memory at the end | * Effectively, need to do a 'transpose' before storing data | * Examples | * AVX2 doesn't support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using unpack{hi,lo}_epi{32,64} | * AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector | * SVE doesn't specify register width, currently stick to partial transpose and 128-bits scatter | * RISC-V "V" can implement the algorithm for variable-width register in a way similar to SVE | * Because "V" allows to _specify_ the register width, it can adapts better to the level of parallelism | * SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet) | * V can simply specify a narrower vector width | * Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent | * Problem is, "V" lacks the data management instruction for the transpose | * It has only 'register gather' and 'merge', which are _very_ general-purpose | * "register gather" can use _any_ lane as input, so probably very costly in hardware and high producer-consumer latency | * Cost confirmed by discussion with hardware implementer | * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies | * Somewhat costly, but not prohibitively so | * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general instructions | * Quick and dirty but illustrates the point | * Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of 'better' code | * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in "V", plus support instructions to generate the indices for gather / the mask for merge | * Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes | * That's a significant overhead at the end of the algorithm | * And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data) | * Adding some fixed-pattern data management instruction would remove this overhead | 1. Fewer instructions by removing support instructions | 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge | 3. Lower-latency instructions from simpler semantic | * e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can pick from anywhere |
|