Date
1  1 of 1
[techvectorext] Feedback on RISCV Vextension (FFTW3 and Chacha20)
Krste Asanovic
We have seen very efficient FFT execution without adding new
instructions, but there is interest in a layer of DSP extension on top of the base vector ISA. One goal of the EDIV extension would be to support real+imaginary types and possibly operations. I haven't looked at the Chacha20 code  but either crypto extensions or EDIVbased interleaving might help with the interleaving it needs. Krste  Hi all,On Tue, 10 Dec 2019 16:16:17 +0100, "Roger Ferrer Ibanez" <roger.ferrer@...> said:  Romain Dolbeau is a HPC expert from Atos and participates in the EPI project. Romain provided the following feedback as the result of himself  porting FFTW3 and Chacha20 to the Vextension. He kindly agreed to share his feedback with the Vext list.  If you have comments, try to keep him in the loop when replying the email as Romain is not subscribed to this mailing list. I'm also attaching  the feedback in the original markdown format.  Thank you very much,  Roger  Introduction  * EPI is implementing a VPU supporting RISCV “V” extension, and doing some associated compiler & software work  * This is some feedback on the current version of the “V” extension draft (0.7.1)  * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers  Invector data management issues  * Two different codes: FFTW3 (complex arithmetic, floatingpoint) & Chacha20 (cryptographic, integer)  * Some significant overheads due to lack of invector data management instructions in “V”  * Major issue for efficient software  Please note: FFTW3 is about the implementation in the library, not the highlevel algorithm.  Current situation  * Like most (all?) SIMD instruction sets, most “V” instructions specify semantic “perlane”  * This is fine for most operations on elementary data types (integer, FP)  * There’s also some reduction operations for common reduction patterns accross lanes  * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory  + structures, multidimensional arrays, indirect access, …  * However, some algorithms require data in one lane to interact with data in a different lane  + here we consider two examples: complex arithmetic & a crypto algorithm  * Such patterns are usually quite simple  + Locally, lane with an immediate neighbor  + Overall, regular patterns like (de)interleaving  * “V” can make those work, but at a high cost  + Need to manufacture indices vector to gather/merge data  * Other SIMD instruction set have dedicated instructions  * Issue is with porting legacy codes, or managementintensive algorithms  + Highlevel view of algorithms, allowing for memory reorganisation to suit “V”, are mostly fine  First example, FFTW3 library  * FFTW3 implements SIMD as a twostep process  1. Abstract SIMD code generation for “codelets”, using C macro  2. An “implementation file” containing SIMD ISAspecific implementation  * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers  * Some macro are trivial (e.g. perelement sum), some less so (full complex multiplication)  * Currently, the “V” implementation has a lot of overhead to manage data  + Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch riscv)  * (a + bi) × (a’ + b’i) = (a a’  b b’) + (a b’ + a’ b)i  * If stored in ‘interleaved’ format …  * (v[even]+ v[odd]i) (v’[even] + v’[odd]i) = (v[even] v’[even]  v[odd] v’[odd]) + (v[even] v’[odd] + v’[even] * v[odd])i  1. Interaction between ‘odd’ and ‘even’ lanes  2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’  * Of course, at the algorithm level, deinterleaving/reinterleaving is possible from memory with Zvlsseg (segment load/store), but not from  registers. And for FFTW3, the SIMD infrastructure is not designed to support deinterleaving of data.  Notsoreadable example “V” code  * To implement a full complex multiplication in “V”, we need…  + Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)  + Gather the data for the first multiplication  + Do the first multiplication,  + Do an explicit conjugate for the add/sub  + Gather more data  + Do the final fmacc  + In the example below, VL is the vector length in complex element, tx & sr are interleaved (real/imag) vector data  * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector builtins)  * Full code visible on the GitHub referenced earlier  static inline V VZMUL(V tx, V sr) // fixme: improve  {  V tr;  V ti;  Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)  Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);  Vint vone = TYPEINT(vbroadcast)(1, 2*VL);  Vint hidx;  hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)  tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx  hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...  ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx  tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]  hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)  sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr  hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)  sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr  return TYPE(vfmacc)(tr,ti,sr,2*VL); // (Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]  }  Other SIMD instruction set  * Also visible in GitHub  * AVX512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)  + Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)  + Only 5 instructions vs. 11/13 for V  * SVE has complex arithmetic support (branch armsvealt)  + ‘CMLA’ instruction designed to be used in pair, implementing complex mul  + With free conjugation for e.g. VZMULJ macro  + Only 2 instructions  Possible solutions  1. Problemspecific: add some complexarithmetic instructions  + addsub depending on lane parity  + complex mul similar to CMLA  + complex mul similar to BlueGene/Q instructions  + …  * Very efficient for FFTW3, C/C++/Fortran complex types, etc.  * Might need more than a few opcodes, hardware might be a bit complex  * Performance on par with SVE  1. Generic: add some data management operations  + Similar to AVX512 unpack/movdup/…  + Similar to SVE zip/uzp/trn  + …  + and perhaps a cheap ‘addsub’ as well  * Less efficient for complex arithmetic, but also useful for other problems  * Chacha20 further down in this document  * Performance on par with AVX512  1. Not mutually exclusive :)  Second example, Chacha20  * Chacha20 is a ‘stream’ crypto algorithm  + https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant  + Supercop infrastucture with many implementations in crypto_stream/chacha20: https://bench.cr.yp.to/supercop.html  + “V” source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscvv.tgz  * Essentially, a pseudorandom function  + From (a) a secret key (b) a input value (‘nonce’) and (c) a counter, generate a block of pseudorandom bits  * Purely integer operations  + Add, Xor, Rotation  rotation is very useful in crypto, missing in V (… as in most SIMD ISA other than AVX512)  * Embarrassingly parallel across blocks  + The counter grows monotonously, so any number of blocks can be generated simultaneously  + But  amount of data really needed (i.e. truly available parallelism) is usecase/applicationdependent  * Vector implementation tends to distribute a block over a single lane in multiple vector register  + 4 Blocks in 128 bits, 16 in 512 bits  * But the data has to be stored contiguously in memory at the end  * Effectively, need to do a ‘transpose’ before storing data  * Examples  + AVX2 doesn’t support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using  unpack{hi,lo}_epi{32,64}  + AVX512 can scatter 128bits block to memory after a partial transpose, or do a full transpose and store dense 512bits vector  + SVE doesn’t specify register width, currently stick to partial transpose and 128bits scatter  * RISCV “V” can implement the algorithm for variablewidth register in a way similar to SVE  * Because “V” allows to specify the register width, it can adapts better to the level of parallelism  + SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)  + V can simply specify a narrower vector width  + Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardwaredependent  * Problem is, “V” lacks the data management instruction for the transpose  * It has only ‘register gather’ and ‘merge’, which are very generalpurpose  * “register gather” can use any lane as input, so probably very costly in hardware and high producerconsumer latency  + Cost confirmed by discussion with hardware implementer  * AVX, SVE uses specialized instructions with known, fixed interlane dependencies  + Somewhat costly, but not prohibitively so  * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general  instructions  + Quick and dirty but illustrates the point  + Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of ‘better’ code  * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in “V”, plus support instructions to generate the  indices for gather / the mask for merge  + Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes  * That’s a significant overhead at the end of the algorithm  + And is worse for roundreduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)  * Adding some fixedpattern data management instruction would remove this overhead  1. Fewer instructions by removing support instructions  2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge  3. Lowerlatency instructions from simpler semantic  o e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can  pick from anywhere     Roger Ferrer Ibáñez  roger.ferrer@...  Barcelona Supercomputing Center  Centro Nacional de Supercomputación  WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain  information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended  recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing,  distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy  and delete any copies you may have received.  http://www.bsc.es/disclaimer   # Introduction #  * EPI is implementing a VPU supporting RISCV "V" extension, and doing some associated compiler & software work  * This is some feedback on the current version of the "V" extension draft (0.7.1)  * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers  # Invector data management issues #  * Two different codes: FFTW3 (complex arithmetic, floatingpoint) & Chacha20 (cryptographic, integer)  * Some significant overheads due to lack of invector data management instructions in "V"  * Major issue for efficient software  Please note: FFTW3 is about the _implementation_ in the library, not the highlevel algorithm.  ## Current situation ##  * Like most (all?) SIMD instruction sets, most "V" instructions specify semantic "perlane"  * This is fine for most operations on elementary data types (integer, FP)  * There's also some reduction operations for common reduction patterns accross lanes  * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory  * structures, multidimensional arrays, indirect access, ...  * However, some algorithms require data in one lane to interact with data in a different lane  * here we consider two examples: complex arithmetic & a crypto algorithm  * Such patterns are usually quite simple  * Locally, lane with an immediate neighbor  * Overall, regular patterns like (de)interleaving  * "V" can make those work, but at a high cost  * Need to manufacture indices vector to gather/merge data  * Other SIMD instruction set have dedicated instructions  * Issue is with porting legacy codes, or managementintensive algorithms  * Highlevel view of algorithms, allowing for memory reorganisation to suit "V", are mostly fine  ## First example, FFTW3 library ##  * FFTW3 implements SIMD as a twostep process  1. Abstract SIMD code generation for "codelets", using C macro  2. An "implementation file" containing SIMD ISAspecific implementation  * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers  * Some macro are trivial (e.g. perelement sum), some less so (full complex multiplication)  * Currently, the “V” implementation has a lot of overhead to manage data  * Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch riscv)  * (a + bi) × (a' + b'i) = (a * a'  b * b') + (a * b' + a' * b)i  * If stored in ‘interleaved’ format …  * (v[even]+ v[odd]i) * (v'[even] + v'[odd]i) = (v[even] * v'[even]  v[odd] * v'[odd]) + (v[even] * v'[odd] + v'[even] * v[odd])i  1. Interaction between ‘odd’ and ‘even’ lanes  2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’  * Of course, at the algorithm level, deinterleaving/reinterleaving is possible from memory with Zvlsseg (segment load/store), but not from registers. And for FFTW3, the SIMD infrastructure is not designed to support deinterleaving of data.  ## Notsoreadable example "V" code ##  * To implement a full complex multiplication in “V”, we need…  * Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …)  * Gather the data for the first multiplication  * Do the first multiplication,  * Do an explicit conjugate for the add/sub  * Gather more data  * Do the final fmacc  * In the example below, VL is the vector length in _complex_ element, tx & sr are interleaved (real/imag) vector data  * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector builtins)  * Full code visible on the GitHub referenced earlier  ```  static inline V VZMUL(V tx, V sr) // fixme: improve  {  V tr;  V ti;  Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...)  Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL);  Vint vone = TYPEINT(vbroadcast)(1, 2*VL);  Vint hidx;  hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...)  tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx  hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ...  ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx  tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr]  hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1)  sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr  hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...)  sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr  return TYPE(vfmacc)(tr,ti,sr,2*VL); // (Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr]  }  ```  ## Other SIMD instruction set ##  * Also visible in GitHub  * AVX512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master)  * Some data gathering only need a single instruction (mov[lh]?dup, unpackhi)  * Only 5 instructions vs. 11/13 for V  * SVE has complex arithmetic support (branch armsvealt)  * ‘CMLA’ instruction designed to be used in pair, implementing complex mul  * With free conjugation for e.g. VZMULJ macro  * Only 2 instructions  ## Possible solutions ##  1. Problemspecific: add some complexarithmetic instructions  * addsub depending on lane parity  * complex mul similar to CMLA  * complex mul similar to BlueGene/Q instructions  * ...  * Very efficient for FFTW3, C/C++/Fortran complex types, etc.  * Might need more than a few opcodes, hardware might be a bit complex  * Performance on par with SVE  2. Generic: add some data management operations  * Similar to AVX512 unpack/movdup/…  * Similar to SVE zip/uzp/trn  * ...  * and perhaps a cheap ‘addsub’ as well  * Less efficient for complex arithmetic, but also useful for other problems  * Chacha20 further down in this document  * Performance on par with AVX512  3. Not mutually exclusive :)  # Second example, Chacha20 #  * Chacha20 is a 'stream' crypto algorithm  * https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant  * Supercop infrastucture with many implementations in `crypto_stream/chacha20`: https://bench.cr.yp.to/supercop.html  * "V" source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscvv.tgz  * Essentially, a pseudorandom function  * From (a) a secret key (b) a input value ('nonce') and (c) a counter, generate a block of pseudorandom bits  * Purely integer operations  * Add, Xor, Rotation  rotation is _very_ useful in crypto, missing in V (... as in most SIMD ISA other than AVX512)  * Embarrassingly parallel across blocks  * The counter grows monotonously, so any number of blocks can be generated simultaneously  * But  amount of data really needed (i.e. truly available parallelism) is usecase/applicationdependent  * Vector implementation tends to distribute a block over a single lane in multiple vector register  * 4 Blocks in 128 bits, 16 in 512 bits  * But the data has to be stored contiguously in memory at the end  * Effectively, need to do a 'transpose' before storing data  * Examples  * AVX2 doesn't support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using unpack{hi,lo}_epi{32,64}  * AVX512 can scatter 128bits block to memory after a partial transpose, or do a full transpose and store dense 512bits vector  * SVE doesn't specify register width, currently stick to partial transpose and 128bits scatter  * RISCV "V" can implement the algorithm for variablewidth register in a way similar to SVE  * Because "V" allows to _specify_ the register width, it can adapts better to the level of parallelism  * SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet)  * V can simply specify a narrower vector width  * Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardwaredependent  * Problem is, "V" lacks the data management instruction for the transpose  * It has only 'register gather' and 'merge', which are _very_ generalpurpose  * "register gather" can use _any_ lane as input, so probably very costly in hardware and high producerconsumer latency  * Cost confirmed by discussion with hardware implementer  * AVX, SVE uses specialized instructions with known, fixed interlane dependencies  * Somewhat costly, but not prohibitively so  * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general instructions  * Quick and dirty but illustrates the point  * Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of 'better' code  * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in "V", plus support instructions to generate the indices for gather / the mask for merge  * Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes  * That's a significant overhead at the end of the algorithm  * And is worse for roundreduced version Chacha12 and Chacha8 (fewer computations for the same amount of data)  * Adding some fixedpattern data management instruction would remove this overhead  1. Fewer instructions by removing support instructions  2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge  3. Lowerlatency instructions from simpler semantic  * e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can pick from anywhere 
