updates to vector spec repo on github
I added the change to add a vcsr including FP fields (this came out
much cleaner I think) - please check over. Another minor change was to fold in the suggestion to move VS to a two-bit wide gap lower down in *status to leave the higher wider field alone. I also bumped the draft date on the doc, and also added the meeting minutes to the repo. Krste
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: RISC-V Vector Task Group: fractional LMUL
David Horner
Special lmul code in vsetvl{i} to derive LML from existing
SEW/LMUL and provided sew code.
toggle quoted messageShow quoted text
As mentioned in the TG we suggested widen LMUL to 3 bits with 7 (explicit) states. I suggest we name them l1,l2,l4,l8 as before and lf2, lf4, lf8 the fractional 1/2, 1/4, 1/8 respectively. lf16 may be useful in larger machines, but already lf8 cannot be used on minimal RV32 machines. ( I recommend we use lf2, lf4 and lf8 rather than f2, f4, f8 to
avoid confusion with float and to associate with lmul as a
fraction value) I propose the remaining value be used to derive LMUL from
the existing SEW/LMUL and provided sew code.
Tentatively I use lx as the mnemonic for the lmul code. I recommend the immediate variant using lx have the
pseudo-opcode vsetlmuli. As a shorthand I refer to vsetlv that uses the lx code in rs1
as vsetlmul.
Removing the fixed size load/stores motivates the increased lmul range and increases the need to issue additional instructions. This facility compliments the increased lmul range and mitigate the need to explicitly specify the correct lmul in the additional associated vsetlmul{i} instructions. As a result the first vsetvl{i} instruction establishes the
SEW/LMUL ratio and subsequent vsetlmul{i} instructions maintain
it.
Notably, values of vl that were legal in the original vsetvl{i} are legal in subsequent (legal) vsetlmul{i} instructions.
The additional fractional lmul values provide a substantial dynamic range and reduces the possibility of an illegal calculated LMUL. Even so, the calculated LMUL can be out of range for vsetlmul{i} which will set vill.
Other proposals have the objective of changing SEW and LMUL and enforcing that vl and/or the SEW/LMUL ratio not change. This proposal compliments those recommendations. In particular it addresses issue #365 (What is the case about vsetvl{i} x0, x0, {vtypei|rs2} without keeping VLMAX the same) when rs1 is x0 to use the current value of vl , the use of lx ensures vl, if possible, will remain the same as will VLMAX and SEW/LMUL ratio and if not possible vill will be generated.
To have our lf16 and lx too. As I recommended initially the lmul code in vsetvl{i} that would otherwise be used for lf16 is used for lx. That means that vtype lmul field will not have a lf16 value set by an explicit lmul code. However, we could allow lf16 to be calculated and stored by an instruction using lx. A read of vtype will correctly show the lf16 code. Restoring vtype from the value read from vtype is a multistep step process. If lmul code in not lf16 then issue vsetvl with save vtype in rs2 if lmul code is lf16 change sew code in rs2 to one larger than provided. (e.g. use e16 if e8 in save vtpye) set lmul in rs2 to lf8 issue vsetvl with revised sew and lmul vtype in rs2 change sew code in rs2 back to original (current example from e16 back to e8) change lmul in rs2 to lx issue vsetvl with original seq and lmul=lx vtype in rs2 lmul will now be lf16 and sew code same as saved. This approach will work for all values of sew and lf16 that are valid in vtype. The question is if lf16 is worth the extra effort, or does it potentially cause confusion/failures for application code. Additional suggestion: This is standalone from the proposal, but complements it, so I
submit it here. vsetvl is defined not to trap on illegal vtype codes. It sets vill instead and clears the remainder of vtype which triggers illegal operation in subsequent vector instructions that rely on it. This facilitates low overhead interrogation of the hardware to
determine supported SEW size and ediv support (and any future
functionality in vtype)
Arguably, vsetlmul{i} should trap if the original vtype has
vill set. But I recommend more than that. An implementation can trap on any instruction or sub-instruction to emulate, take corrective action, to trigger context switch, to report anomaly, etc. As a result an implementation could choose to provide the behaviour I suggest below and still be compliant. However, I recommend we make trap mandatory on certain cases of vsetvl{i} use. These traps would leave undisturbed vl and vtype as they were
before instruction execution. Specifically trap if: 1) lmul = lx and original vtype has vill set. 2) lmul = lx and calculated lmul is not valid 3) rd = x0 and rs1 = x0 (use existing vl for rs1) and VLMAX changes (that is SEW/LMUL ratio is not maintained) (this only occurs with explicit sew/lmul. if lmul=lx it is either case 1 or 2.)
Case 1. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the cause for vill. Case 2. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul. Case 3. There is not compelling reason to use rd = x0 and rs1 = x0 to interrogate the hardware so maintaining the VLMAX is the probable intent given existing vl is provided. If no precise trap then it requires traceback to not only the last vsetvl{i} but also to at least the previous one to determine the probable values in seq and lmul. On 2020-02-07 1:40 a.m., Krste
Asanovic wrote:
I'm realizing the idea doesn't quite work unless machine actually stores the fractional LMUL value (to cope with SLEN and widening instruction behavior), and so would need the new instruction and an extra bit of state in vtype. But with this change, yes. For floating-point code, where there is no way to load a narrower type into a wider element, this can be used to increase the number of registers of a wider width, e.g., in a matrix multiply accumulating double-precision values that are the product of single-precision values, using widening muladds, can arrange as something like: vsetvli x0, x0, e32,f2 # Fractional Lmul vle.v v0, (bmatp) # Get row of matrix flw f1, (x15) # Get scalar vfwmacc.vf v1, f1, v0 # One row of outer-product add x15, x15, acolstride # Bump pointer ... flw f31, (x15) # Get scalar vfwmacc.vf v31, f31, v0 # Last row of outer-product ... which is holding 31 rows of the destination matrix accumulators as doubles while performing widening muladds from a single-precision vector load held in v0. This is probably overkill register blocking for this particular example, but shows the general improvement. Krste
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: RISC-V Vector Task Group: fractional LMUL
David Horner
I left the encoding unspecified in the proposal. That was intentional as I saw various tradeoffs. However, I now recommend the codes be in order of increasing VLMAX value as so:
This allows for a quick "andi and bz"
check of the "special" LMUL 1/16
It also allows the formula
new LMUL = ( new sew / old SEW
) * old LMUL
to be implemented as
vlmul = new vsew - old vsew + old
vlmul
The first column could be inverted to
allow the original code values with no appreciable hardware cost,
but it complicates software for the two above cases.
On 2020-02-09 12:44 a.m., David Horner
via Lists.Riscv.Org wrote:
Special lmul code in vsetvl{i} to derive LML from existing SEW/LMUL and provided sew code.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
EPI intrinsics reference and compiler updated to 0.8
Hi all, we have updated to version 0.8 of V-ext our LLVM-based experimental compiler and the intrinsics reference.
In this version we have integrated the assembler/disassembler patch from Hsiangkai Wang[1] to replace our own implementation of this part of the compiler. Our goal is to stay aligned as much as possible with LLVM upstream. On top of that we have built our existing code generation strategy. You can find a recent source drop of the compiler at https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi-0.8 Kind regards, [1] https://reviews.llvm.org/D69987 -- Roger Ferrer Ibáñez - roger.ferrer@... Barcelona Supercomputing Center - Centro Nacional de Supercomputación WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Minutes from 2/21 meeting
Date: 2020/2/21
Task Group: Vector Extension Chair: Krste Asanovic Number of Attendees: ~13 Current issues on github: https://github.com/riscv/riscv-v-spec Note, the Zoom meeting details have changed. Please view the member calendar entry for updated details. Issues discussed: #341/#377, #367, #362/#354, #381 The following issues were discussed. #341/#377 Separate vcsr from fcsr Last time, the agreement was to add a new vcsr but to shadow fcsr fields in the new vcsr to reduce context swap time. Due to concerns raised in #377, the agreement was now to keep the separate vcsr but to not shadow fcsr fields. This change will be made to the spec for v0.9. #367 Tail agnostic as option Extensive disussion continued in the group about allowing this as an option. The group proposing this change will describe the implementation challenges of the current tail-undisturbed scheme in the context of a renamed vector register machine with long temporal vectors. #362/#354 Remove non-SEW vector load/store Discussion covered the new proposal to provide a fractional LMUL to avoid one drawback of dropping these memory operations. The general consensus was in favor, as the scheme also aids mixed-width floating-point arithmetic (for which there is no equivalent to the widening/narrowing load/stores). The proposal requires adding a new bit to LMUL in vtype, so that a new instruction will not be required. The detailed fractional LMUL proposal was only recently presented, so the group will continue the discussion online before next meeting. #381 Index size in indexed load/store and AMO The group discussed the issue of providing XLEN-sized indices in indexed load/stores and AMOs, if non-SEW vector load/stores are dropped. This was also previously a concern for floating-point codes, as there were no fixed-precision floating-point loads/stores. The two proposals were 1) to provide only XLEN-width indices or 2) provide XLEN and 32b indices, and whether these are only present in extended 64b encoding. Discussion to continue.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Reminder vector TG meeting Friday March 6th, details on members calendar <eom>
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Minutes of 2020/3/6 vector task group meeting
Date: 2020/3/6
Task Group: Vector Extension Chair: Krste Asanovic Number of Attendees: ~21 Current issues on github: https://github.com/riscv/riscv-v-spec Note, the Zoom meeting details have changed. Please view the member calendar entry for updated details. Issues discussed: #362,#354,#382, The following issues were discussed. #362,#354,#382 Fractional LMUL additional registers There was considerable discussion trying to understand the design of the scheme to pack multiple fractional LMUL registers into the base vector registers. The conclusion was that we first had to understand how fractional LMUL would map into the base vector registers, including interaction with SLEN, before attempting to add fractional LMUL register packing as another optimization with another level of complexity. Waiting on receiving the fractional LMUL mapping. #367 Tail Agnostic The discussion reviewed the proposal that long temporal vector registers with renaming could be handled using vector length trimming. The proposal was then added that masking should also be given option of being agnostic giving three options: 1) tail-undisturbed + masking-undisturbed 2) tail-agnostic + masking-undisturbed 3) tail-agnostic + masking-agnostic All implementations would have to support all options. Simple implementations can ignore setting and treat all as 1). Option 2) could be implemented as option 1) even if option 3) was supported. The recommendation was that agnostic destination elements had to only be either zero or undisturbed. Discussion to continue with this suggestion.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: Minutes of 2020/3/6 vector task group meeting
Bill Huffman
On 3/6/20 6:22 PM, Krste Asanovic wrote:
EXTERNAL MAIL... We've discussed before, but allowing the *-agnostic options means code can be written and tested on an implementation that supports them and then fail on an implementation that maps them to #1. And vice-versa. Bill
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: Minutes of 2020/3/6 vector task group meeting
| On 3/6/20 6:22 PM, Krste Asanovic wrote:On Sat, 7 Mar 2020 02:32:24 +0000, Bill Huffman <huffman@...> said: [...] || #367 Tail Agnostic || The discussion reviewed the proposal that long temporal vector || registers with renaming could be handled using vector length trimming. || || The proposal was then added that masking should also be given option || of being agnostic giving three options: || 1) tail-undisturbed + masking-undisturbed || 2) tail-agnostic + masking-undisturbed || 3) tail-agnostic + masking-agnostic | We've discussed before, but allowing the *-agnostic options means code | can be written and tested on an implementation that supports them and | then fail on an implementation that maps them to #1. And vice-versa. Yes, this is the disadvantage of having implementation-dependent options. Along with making portable functional/verification/compliance models more complex (though this case is not as bad as some others). This has to be balanced against the advantage of greater efficiency on a wider range of microarchitectures. Unordered vector reductions and vector stores already cause issues of this kind, as does the memory model for multithreaded code, but obviously want to reduce the number of cases where this shows up. Krste
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[tech-vector-ext] Some proposals
It doesn't look like all these issues were addressed either on mailing
list or on github tracker. In general, it is much better to split feedback into individual topics. I'm responding all-in-one below, but if you want to continue any of these threads, please add to github tracker as independent issues. Several of these issues might be addressed with anticipated longer 64-bit vector instructions (ILEN=64). | Dear all,On Fri, 22 Nov 2019 16:17:33 +0100, Dr. Adrián Cristal <adrian.cristal@...> said: | We have been involved in RTL design of an accelerator featuring the 0.7.1 standard; in parallel, we were running some vectorized HPC kernels on | a simulator platform which mimics the accelerator. We have come across the following: | 1. With respect to new proposed spill instructions, the cost could be high in case we need to spill few elements of a large vector. Instead we | propose the following: spill v1, rs1, [rs2], recoverspill v2,rs2 and unspill rs1. | The semantic is the following: spill v1, rs1, [rs2], will store v1 in the address rs1 up to rs2 elements. There is not a warranty that the | values are stored, but there is the warranty that if they are not stored they will be completely recovered by the recoverspill v2, rs2 ( | otherwise recoverspill v2,rs2 will read the values from memory if they at some time were written to memory). The unspill rs1, will disable the | capability to recoverspill operation at address rs1. In the case of OoO processor, for spill it can delay the free of the vector physical | register on spill and assign to the logical register again on recovespill. So the cost will be much less, but it can be saved if the processor | needs more physical registers. For in order implemenetations, it will save the registers on memory. This seems like a very difficult to implement optimization with strange effects on memory model and page fault/trap handling if memory operation is deferred to later in program execution. What happens on context switches if physical register is pushed out in different context? Having more arch vector registers should reduce spill frequency, and this is one part of 64-bit instruction proposal. | 2. At this moment the mask is only for true values, so if we have an if then else structure first we have to do the if, and after we have to | negate the mask (in our case an expensive operation, up to 32 | cycles) and, then execute the else. This is functionality intended for 64-bit encoding. You should be able to chain and/or fuse this mask negation and run in parallel in mask unit with 32-bit encoding. | It would be great to add a CSR which controls | the functionaility of the masked operations, it could be beneficial in many use cases. Also using the same mechanism, it may be possible to | add other mask-related functionality that can simplify some algorithms. This functionality could be to set it to 0 or constant, or sign | exchange, set positive or set negative. Another option that we can manage is to determine which bit is the mask bit of the register v0 (when | LMUL=1), this can be convenient to set up it to sign bit instead of LSB, for example if we want to calculate the abs value or the RELU | operation. FP ABS (vfsgnx.vv vd, vs2, vs2), and integer/FP RELU (vmax.vx vd, vs2, x0, vfmax.vf vd, vs2, f0 with f0=0.0) are already directly supported. | 3- A vrscatter instruction could be useful (vdest[vsrc2[i]] = vsrc1[i]). The implementation cost could be less than vrgather since in the | vrgather case it is needed to send the index and then the value, however for the vrscatter both index and the value could be sent together, for | our implementation this is faster than vrgather. I don't think vrscatter can replace vrgather, except for strict permutes. Vrscatter I believe is strictly less powerful. The data movement from a single vrscatter can be replaced by a single vrgather, but the converse is not true. The vrscatter instruction might be useful in some cases, but has complex cases to consider in specification: a) multiple writes to same element, ordered either by element ordering (which will be expensive with multiple lanes), or by allowing unordered scatters; b) no writes to some elements, independent of masking. | 4- If a register is defined with a VL, then the values after VL (between VL+1 to VLMAX) could be set to “undefined” or 0, it would benefit | implementations with longer vectors considerably in an OoO processor. The current text states that the previous values should be preserved. Ongoing discussion. | 5- Proposal to add some semantic information to the vectors. In particular it is important to know the VL and the SEW at the time vector was | created, so when the context is saved we can have a special load/store operation that will only use the minimal storage that is needed in | memory (plus the length and sew). Also this allows the system to know that after the VL, we do not have to preserve any element. I had a brief discussion of some save/restore optimization techniques in Section 5.1.5 of my thesis, and VS bits are at least a start there to reduce amount of save/restore if not the size of the memory buffer (though most OS will need memory space for full hart register context anyway). Some form of microarchitectural tracking can be done, observing undisturbed (VL=VLMAX) or agnostic (VL) options of valid length. I'd guess could have possibly some custom visible state (visible where privilege allows) recording active vector length in bytes to save/restore. But for these techniques, hardware can't always do the right thing on vector save/restore, so need software to know that it's worth interrogating the custom state to figure out what to do. But this could penalize simple implementatons where software is OK doing naive save/restore and doesn't want to spend cycles checking. Regards, Krste | Best regards, Adrian and Osman | WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain | information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended | recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, | distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy | and delete any copies you may have received. | http://www.bsc.es/disclaimer |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[tech-vector-ext] Feedback on RISC-V V-extension (FFTW3 and Chacha20)
We have seen very efficient FFT execution without adding new
instructions, but there is interest in a layer of DSP extension on top of the base vector ISA. One goal of the EDIV extension would be to support real+imaginary types and possibly operations. I haven't looked at the Chacha20 code - but either crypto extensions or EDIV-based interleaving might help with the interleaving it needs. Krste | Hi all,On Tue, 10 Dec 2019 16:16:17 +0100, "Roger Ferrer Ibanez" <roger.ferrer@...> said: | Romain Dolbeau is a HPC expert from Atos and participates in the EPI project. Romain provided the following feedback as the result of himself | porting FFTW3 and Chacha20 to the V-extension. He kindly agreed to share his feedback with the Vext list. | If you have comments, try to keep him in the loop when replying the email as Romain is not subscribed to this mailing list. I'm also attaching | the feedback in the original markdown format. | Thank you very much, | Roger | Introduction | * EPI is implementing a VPU supporting RISC-V “V” extension, and doing some associated compiler & software work | * This is some feedback on the current version of the “V” extension draft (0.7.1) | * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers | In-vector data management issues | * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer) | * Some significant overheads due to lack of in-vector data management instructions in “V” | * Major issue for efficient software | Please note: FFTW3 is about the implementation in the library, not the high-level algorithm. | Current situation | * Like most (all?) SIMD instruction sets, most “V” instructions specify semantic “per-lane” | * This is fine for most operations on elementary data types (integer, FP) | * There’s also some reduction operations for common reduction patterns accross lanes | * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory | + structures, multi-dimensional arrays, indirect access, … | * However, some algorithms require data in one lane to interact with data in a different lane | + here we consider two examples: complex arithmetic & a crypto algorithm | * Such patterns are usually quite simple | + Locally, lane with an immediate neighbor | + Overall, regular patterns like (de)interleaving | * “V” can make those work, but at a high cost | + Need to manufacture indices vector to gather/merge data | * Other SIMD instruction set have dedicated instructions | * Issue is with porting legacy codes, or management-intensive algorithms | + High-level view of algorithms, allowing for memory reorganisation to suit “V”, are mostly fine | First example, FFTW3 library | * FFTW3 implements SIMD as a two-step process | 1. Abstract SIMD code generation for “codelets”, using C macro | 2. An “implementation file” containing SIMD ISA-specific implementation | * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers | * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication) | * Currently, the “V” implementation has a lot of overhead to manage data | + Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v) | * (a + bi) × (a’ + b’i) = (a a’ - b b’) + (a b’ + a’ b)i | * If stored in ‘interleaved’ format … | * (v[even]+ v[odd]i) (v’[even] + v’[odd]i) = (v[even] v’[even] - v[odd] v’[odd]) + (v[even] v’[odd] + v’[even] * v[odd])i | 1. Interaction between ‘odd’ and ‘even’ lanes | 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’ | * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from | registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data. | Not-so-readable example “V” code | * To implement a full complex multiplication in “V”, we need… | + Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …) | + Gather the data for the first multiplication | + Do the first multiplication, | + Do an explicit conjugate for the add/sub | + Gather more data | + Do the final fmacc | + In the example below, VL is the vector length in complex element, tx & sr are interleaved (real/imag) vector data | * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins) | * Full code visible on the GitHub referenced earlier | static inline V VZMUL(V tx, V sr) // fixme: improve | { | V tr; | V ti; | Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...) | Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL); | Vint vone = TYPEINT(vbroadcast)(1, 2*VL); | Vint hidx; | hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...) | tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx | hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ... | ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx | tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr] | hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1) | sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr | hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...) | sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr | return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr] | } | Other SIMD instruction set | * Also visible in GitHub | * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master) | + Some data gathering only need a single instruction (mov[lh]?dup, unpackhi) | + Only 5 instructions vs. 11/13 for V | * SVE has complex arithmetic support (branch arm-sve-alt) | + ‘CMLA’ instruction designed to be used in pair, implementing complex mul | + With free conjugation for e.g. VZMULJ macro | + Only 2 instructions | Possible solutions | 1. Problem-specific: add some complex-arithmetic instructions | + addsub depending on lane parity | + complex mul similar to CMLA | + complex mul similar to BlueGene/Q instructions | + … | * Very efficient for FFTW3, C/C++/Fortran complex types, etc. | * Might need more than a few opcodes, hardware might be a bit complex | * Performance on par with SVE | 1. Generic: add some data management operations | + Similar to AVX-512 unpack/movdup/… | + Similar to SVE zip/uzp/trn | + … | + and perhaps a cheap ‘addsub’ as well | * Less efficient for complex arithmetic, but also useful for other problems | * Chacha20 further down in this document | * Performance on par with AVX-512 | 1. Not mutually exclusive :-) | Second example, Chacha20 | * Chacha20 is a ‘stream’ crypto algorithm | + https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant | + Supercop infrastucture with many implementations in crypto_stream/chacha20: https://bench.cr.yp.to/supercop.html | + “V” source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz | * Essentially, a pseudo-random function | + From (a) a secret key (b) a input value (‘nonce’) and (c) a counter, generate a block of pseudo-random bits | * Purely integer operations | + Add, Xor, Rotation - rotation is very useful in crypto, missing in V (… as in most SIMD ISA other than AVX-512) | * Embarrassingly parallel across blocks | + The counter grows monotonously, so any number of blocks can be generated simultaneously | + But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent | * Vector implementation tends to distribute a block over a single lane in multiple vector register | + 4 Blocks in 128 bits, 16 in 512 bits | * But the data has to be stored contiguously in memory at the end | * Effectively, need to do a ‘transpose’ before storing data | * Examples | + AVX2 doesn’t support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using | unpack{hi,lo}_epi{32,64} | + AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector | + SVE doesn’t specify register width, currently stick to partial transpose and 128-bits scatter | * RISC-V “V” can implement the algorithm for variable-width register in a way similar to SVE | * Because “V” allows to specify the register width, it can adapts better to the level of parallelism | + SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet) | + V can simply specify a narrower vector width | + Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent | * Problem is, “V” lacks the data management instruction for the transpose | * It has only ‘register gather’ and ‘merge’, which are very general-purpose | * “register gather” can use any lane as input, so probably very costly in hardware and high producer-consumer latency | + Cost confirmed by discussion with hardware implementer | * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies | + Somewhat costly, but not prohibitively so | * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general | instructions | + Quick and dirty but illustrates the point | + Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of ‘better’ code | * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in “V”, plus support instructions to generate the | indices for gather / the mask for merge | + Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes | * That’s a significant overhead at the end of the algorithm | + And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data) | * Adding some fixed-pattern data management instruction would remove this overhead | 1. Fewer instructions by removing support instructions | 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge | 3. Lower-latency instructions from simpler semantic | o e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can | pick from anywhere | | -- | Roger Ferrer Ibáñez - roger.ferrer@... | Barcelona Supercomputing Center - Centro Nacional de Supercomputación | WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain | information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended | recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, | distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy | and delete any copies you may have received. | http://www.bsc.es/disclaimer | | # Introduction # | * EPI is implementing a VPU supporting RISC-V "V" extension, and doing some associated compiler & software work | * This is some feedback on the current version of the "V" extension draft (0.7.1) | * Issues found while developing some algorithms & implementations to test and validate both software and hardware layers | # In-vector data management issues # | * Two different codes: FFTW3 (complex arithmetic, floating-point) & Chacha20 (cryptographic, integer) | * Some significant overheads due to lack of in-vector data management instructions in "V" | * Major issue for efficient software | Please note: FFTW3 is about the _implementation_ in the library, not the high-level algorithm. | ## Current situation ## | * Like most (all?) SIMD instruction sets, most "V" instructions specify semantic "per-lane" | * This is fine for most operations on elementary data types (integer, FP) | * There's also some reduction operations for common reduction patterns accross lanes | * Indexed memory operations (scatter/gather) can take care of many data organization patterns in memory | * structures, multi-dimensional arrays, indirect access, ... | * However, some algorithms require data in one lane to interact with data in a different lane | * here we consider two examples: complex arithmetic & a crypto algorithm | * Such patterns are usually quite simple | * Locally, lane with an immediate neighbor | * Overall, regular patterns like (de)interleaving | * "V" can make those work, but at a high cost | * Need to manufacture indices vector to gather/merge data | * Other SIMD instruction set have dedicated instructions | * Issue is with porting legacy codes, or management-intensive algorithms | * High-level view of algorithms, allowing for memory reorganisation to suit "V", are mostly fine | ## First example, FFTW3 library ## | * FFTW3 implements SIMD as a two-step process | 1. Abstract SIMD code generation for "codelets", using C macro | 2. An "implementation file" containing SIMD ISA-specific implementation | * This mostly hardwire some assumptions in the implementation, such as interleaved real/imaginary parts in vector registers | * Some macro are trivial (e.g. per-element sum), some less so (full complex multiplication) | * Currently, the “V” implementation has a lot of overhead to manage data | * Explanations below, code can be seen in https://github.com/rdolbeau/fftw3 (branch risc-v) | * (a + bi) × (a' + b'i) = (a * a' - b * b') + (a * b' + a' * b)i | * If stored in ‘interleaved’ format … | * (v[even]+ v[odd]i) * (v'[even] + v'[odd]i) = (v[even] * v'[even] - v[odd] * v'[odd]) + (v[even] * v'[odd] + v'[even] * v[odd])i | 1. Interaction between ‘odd’ and ‘even’ lanes | 2. ‘even’ lanes last operation is ‘subtraction’, but ‘odd’ lanes last operation is ‘addition’ | * Of course, at the algorithm level, de-interleaving/re-interleaving is possible from memory with Zvlsseg (segment load/store), but not from registers. And for FFTW3, the SIMD infrastructure is not designed to support de-interleaving of data. | ## Not-so-readable example "V" code ## | * To implement a full complex multiplication in “V”, we need… | * Generate the indices for the ‘register gather’ operations (vid, vbroadcast, logical operations, …) | * Gather the data for the first multiplication | * Do the first multiplication, | * Do an explicit conjugate for the add/sub | * Gather more data | * Do the final fmacc | * In the example below, VL is the vector length in _complex_ element, tx & sr are interleaved (real/imag) vector data | * ‘vbroadcast’ should be merged in a subsequent operation (currently the EPI compiler doesn’t implement scalar/vector built-ins) | * Full code visible on the GitHub referenced earlier | ``` | static inline V VZMUL(V tx, V sr) // fixme: improve | { | V tr; | V ti; | Vint idx = TYPEINT(vid)(2*VL); // (0, 1, 2, 3, ...) | Vint vnotone = TYPEINT(vbroadcast)(DS(~1ull,~1), 2*VL); | Vint vone = TYPEINT(vbroadcast)(1, 2*VL); | Vint hidx; | hidx = TYPEINT(vand)(idx, vnotone, 2*VL); // (0, 0, 2, 2, ...) | tr = TYPE(vrgather)(tx, hidx, 2*VL); // Real, Real of tx | hidx = TYPEINT(vor)(idx, vone, 2*VL); // (1, 1, 3, 3, ...) // could be hidx + vone, ... | ti = TYPE(vrgather)(tx, hidx, 2*VL); // Imag, Imag of tx | tr = TYPE(vfmul)(sr,tr,2*VL); // (Real, Real)[tx] * (Real,Imag)[sr] | hidx = TYPEINT(vand)(idx, vone, 2*VL); // (0, 1, 0, 1, 0, 1) | sr = TYPEMASK(vfsgnjn)(sr, sr, sr, INT2MASK(hidx), 2*VL); // conjugate of sr | hidx = TYPEINT(vxor)(idx, vone, 2*VL); // (1, 0, 3, 2, ...) | sr = TYPE(vrgather)(sr, hidx, 2*VL); // Imag, Real of (conjugate of) sr | return TYPE(vfmacc)(tr,ti,sr,2*VL); // (-Imag, Real)[sr] * (Imag, Imag)[tx] + (Real, Real)[tx] * (Real,Imag)[sr] | } | ``` | ## Other SIMD instruction set ## | * Also visible in GitHub | * AVX-512 has ‘fmaddsub’ to avoid the explicit conjugate (branch master) | * Some data gathering only need a single instruction (mov[lh]?dup, unpackhi) | * Only 5 instructions vs. 11/13 for V | * SVE has complex arithmetic support (branch arm-sve-alt) | * ‘CMLA’ instruction designed to be used in pair, implementing complex mul | * With free conjugation for e.g. VZMULJ macro | * Only 2 instructions | ## Possible solutions ## | 1. Problem-specific: add some complex-arithmetic instructions | * addsub depending on lane parity | * complex mul similar to CMLA | * complex mul similar to BlueGene/Q instructions | * ... | * Very efficient for FFTW3, C/C++/Fortran complex types, etc. | * Might need more than a few opcodes, hardware might be a bit complex | * Performance on par with SVE | 2. Generic: add some data management operations | * Similar to AVX-512 unpack/movdup/… | * Similar to SVE zip/uzp/trn | * ... | * and perhaps a cheap ‘addsub’ as well | * Less efficient for complex arithmetic, but also useful for other problems | * Chacha20 further down in this document | * Performance on par with AVX-512 | 3. Not mutually exclusive :-) | # Second example, Chacha20 # | * Chacha20 is a 'stream' crypto algorithm | * https://en.wikipedia.org/wiki/Salsa20#ChaCha_variant | * Supercop infrastucture with many implementations in `crypto_stream/chacha20`: https://bench.cr.yp.to/supercop.html | * "V" source code for the supercop infrastructure: http://www.dolbeau.name/dolbeau/crypto/crypto_stream_chacha20_dolbeau_riscv-v.tgz | * Essentially, a pseudo-random function | * From (a) a secret key (b) a input value ('nonce') and (c) a counter, generate a block of pseudo-random bits | * Purely integer operations | * Add, Xor, Rotation - rotation is _very_ useful in crypto, missing in V (... as in most SIMD ISA other than AVX-512) | * Embarrassingly parallel across blocks | * The counter grows monotonously, so any number of blocks can be generated simultaneously | * But - amount of data really needed (i.e. truly available parallelism) is use-case/application-dependent | * Vector implementation tends to distribute a block over a single lane in multiple vector register | * 4 Blocks in 128 bits, 16 in 512 bits | * But the data has to be stored contiguously in memory at the end | * Effectively, need to do a 'transpose' before storing data | * Examples | * AVX2 doesn't support scatter so partial vector (128 bits) load/xor/store are used at the final step, after a partial transpose using unpack{hi,lo}_epi{32,64} | * AVX512 can scatter 128-bits block to memory after a partial transpose, or do a full transpose and store dense 512-bits vector | * SVE doesn't specify register width, currently stick to partial transpose and 128-bits scatter | * RISC-V "V" can implement the algorithm for variable-width register in a way similar to SVE | * Because "V" allows to _specify_ the register width, it can adapts better to the level of parallelism | * SVE requires masking if the amount of data required is smaller than what a full vector would produce – not implemented (yet) | * V can simply specify a narrower vector width | * Both could specialize for e.g. 512 bits with full transpose – not implemented, optimization might be hardware-dependent | * Problem is, "V" lacks the data management instruction for the transpose | * It has only 'register gather' and 'merge', which are _very_ general-purpose | * "register gather" can use _any_ lane as input, so probably very costly in hardware and high producer-consumer latency | * Cost confirmed by discussion with hardware implementer | * AVX, SVE uses specialized instructions with known, fixed inter-lane dependencies | * Somewhat costly, but not prohibitively so | * Currently, implemented using SVE semantic (zip, uzp, trn) & an implementation file to emulate SVE’s behavior using the more general instructions | * Quick and dirty but illustrates the point | * Current EPI LLVM compiler for V good at CSE and hoisting so overhead is a good approximation of 'better' code | * Each emulated SVE instruction required at least 2 gather & 1 merge to be implemented in "V", plus support instructions to generate the indices for gather / the mask for merge | * Many of the support instructions can be/are eliminated by the compiler’s CSE & hoisting passes | * That's a significant overhead at the end of the algorithm | * And is worse for round-reduced version Chacha12 and Chacha8 (fewer computations for the same amount of data) | * Adding some fixed-pattern data management instruction would remove this overhead | 1. Fewer instructions by removing support instructions | 2. Fewer instructions by combining two registers in a specific way instead of gather/gather/merge | 3. Lower-latency instructions from simpler semantic | * e.g. ZIP1 only uses lower halves of registers as input, so no need to wait for higher halves, unlike “register gather” which can pick from anywhere
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: [tech-vector-ext] Some proposals
Guy Lemieux
On Sun, Mar 8, 2020 at 1:42 AM Krste Asanovic <krste@...> wrote:
On a context switch, the underlying physical registers that hold spill values need to be saved/restored to memory, as well as the associated rs1 values. This means we need extra instructions to iterate through the physical registers, and their associated rs1 values, to also save as part of the context switch. Not impossible, but it is more complex than just adding the 3 proposed instructions. Instead, you could just force the spill to memory of all tracked registers, but then you run into the delayed memory page faults etc brought up by Krste. Instead, at the system level you could have a tightly-coupled memory (TCM) as an addressable scratchpad where vector registers get written during a spill? (This TCM could be used for any purpose.) | 4- If a register is defined with a VL, then the values after VL (between VL+1 to VLMAX) could be set to “undefined” or 0, it would benefitExcept that all values after VL do have to be stored on a context switch. Just because a VL was used to modify head elements of a vector, it doesn't mean the tail elements can be ignored (under the current tail-undisturbed model). Guy
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
issue #393 - Towards a simple fractional LMUL design.
David Horner
I'm sending out to the correct mailing list a copy of the
revised issue #393.
(link:
https://github.com/riscv/riscv-v-spec/issues/393 ) This was requested at the last TG meeting. I believe it is consistent with casual discussions of fractional LMUL and it is intended to formalize a design. To follow is the consideration of alternate register overlap to improve usability.The issue #393 update adds to the Glossary and notes that mask registers and operations are unchanged from the plan of record.
Towards a simple fractional LMUL design. Background: Prior to LMUL, an elaborate mapping of registers numbers to various width element under different configuration settings that allowed for polymorphic operations was proposed. LMUL was introduced in a pre-v0.5 Nov 2018 in conjunction with
widening operations and SEW widths.
This issue will look at simplest implementations of fraction LMUL. Glossary: base-arch registers* – the 32 registers addressable when LMUL=1 Guidance. The simplest extensions to the base retain the
fundamental characteristics. The simplest extension of LMUL to “fractional” is that
the observe affects continue predictably.
For LMUL >=1, VLMAX = LMUL * VLEN/SEW This table exhaustively represents this simplest extension effect when SEW is unchanged throughout:
Fractional registers then have diminished capacity, 1/2 to 1/8th of a base-arch register. The simplest mapping of fractional LMUL registers is
one to one (and only one) of the base-arch registers. The simplest overlay (analogous to the register group
overlay of consecutive base-arch registers) is with zero
elements overlaying. I call this iteration zero of the simplest fractional LMUL designs. Note: Mask behaviour does not change. Mask operations read and write to a base-arch register. Base-arch register zero remains the default mask register. With this "iteration zero" design, as with LMUL>=1, fractional LMUL “register zero”s are substantially limited in their use. There are some undesirable characteristic of this design.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A couple of questions about the vector spec
I am developing sparse matrix codes using the vector extension on RISCV32 using SPIKE simulator. Based on my understanding of the spec thus far, I wanted to ask a couple of questions about the spec. I hope that this is the correct group to post such queries to.
1. Vector reductions (such as vector single-width integer reduction instructions) write their reductions to vd[0]. This results in committing vd as destination and makes it hard to use other elements of vd (vd[1], vd[2], ..) unless some shift/mask operations are employed. Given the need to efficiently use vector registers, I was wondering if a variant of these instructions where the destination is a scalar register could be defined. In most configs, a single scalar register for destination should suffice. In rare cases, a scalar register pair may act as destination. If the common cases of 8/16/32 bit SEW based reductions could be supported to use scalar dest, that would free up a vector register. That would be very helpful in codes that need to retain as many sub-blocks of data as possible inside registers. 2. Many common sparse matrix formats (such as CSR, CSC, COO, etc) use metadata in the form of non-zero column (CSR) or row (CSC) indices. However the actual element address offsets are in terms of element widths. For eg: column indices 0, 1 and 2 in a matrix with 32-bit elements correspond to address offsets 0, 4 and 8 bytes. Thus, the code requires the use of a scaling instruction to scale the indices to address offsets. This instruction has to run inside innermost loops. One way to avoid such a separate scale instruction is to embed the common cases of shifting left by 0/1/2/3 inside the vector load instruction itself. I am referring to the vector load that loads the indices from memory to a vector. With this, the vector load would load the indices AND perform scaling (1B /2B/ 4B/ 8B left shift of each loaded element). That way, the vector register would directly contain address offsets after loading and the code will not need to include another scaling instruction. I have not looked at the full details of instruction format details to see how a 2-bit shift field could be incorporated but perhaps some of the lumop field reserved values could be used to encode a shift? Best Regards Nagendra
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: A couple of questions about the vector spec
Guy Lemieux
1. A vector register is deliberately used as the destination of
toggle quoted messageShow quoted text
reductions. If the destination is a scalar register, then tight coupling between the vector and scalar units would be necessary, and concurrency would be reduced (because the scalar unit might have to stall until the vector reduction is completed). 2. Yes, vector-indexed-load instructions such as vlxe.v currently treat offsets as byte offsets. I could see this issue being debated, but it would require a shift by (0,1,2,3,4) for up to 64-bit SEWs. If there is a way you can use vrgather.vv instread, it uses element-size offsets. Guy
On Tue, Mar 10, 2020 at 2:44 PM Nagendra Gulur <nagendra.gd@...> wrote:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: A couple of questions about the vector spec
Andrew Waterman
On Tue, Mar 10, 2020 at 3:07 PM Guy Lemieux <glemieux@...> wrote: 1. A vector register is deliberately used as the destination of Indices must be able to represent general pointers in some cases (e.g. vectorizing (*(x[i]))++ instead of y[x[i]]++), so implicitly scaling the indices causes problems, particularly when misaligned addresses are legal. The 64-bit instruction encoding could offer another variant of these instructions that scales indices by SEW/8. In the mean time, I don't think the extra shift is too much to ask.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: A couple of questions about the vector spec
I am not sure I replied right (I hit reply to sender but not sure who I responded to, learning to work with the list). But thanks for quick replies.
1. Yes, understood about the scalar destination complication. I have to figure a better use of vd[1], vd[2] etc. 2. The vrgather (per 0.8 spec) does vd[i] = vs2[vs1[i]] -- I am not sure how this fixes the conversion from index to address offset. I need the bit shift to happen somewhere in the code. I am not suggesting implicit scaling but programmer specified scaling amount in the instruction (0/1/2/3 bit shift). Based on knowledge of the matrix element data type, the programmer can certainly specify a shift amount. Note that in my sparse matrix vector multiply code, the innermost loop is 9 instructions including the scaling instruction. If this were removed, it reduces dynamic instruction count by about 10%. It seems to be a valuable saving. Best Regards Nagendra
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: A couple of questions about the vector spec
Guy Lemieux
1. Yes, understood about the scalar destination complication. I have to figure a better use of vd[1], vd[2] etc.possibly vslide1up after every reduction, producing a vector of reductions (possibly in backwards order, unless you rearrange your outer loop order). 2. The vrgather (per 0.8 spec) does vd[i] = vs2[vs1[i]] -- I am not sure how this fixes the conversion from index to address offset. I need the bit shift to happen somewhere in the code.I'm not suggesting that you use vrgather to convert indices to byte offsets. I'm wondering if there is a way to handle sparse rows/columns entirely differently that uses vrgather instead of vlx (note: I have no idea if it's possible, as I've never tried to implement sparse matrix code). However, vlx and vrgather are very similar (one applies to memory byte addresses, the other applies to vector elements, so obviously there is some difference). I am not suggesting implicit scaling but programmer specified scaling amount in the instruction (0/1/2/3 bit shift). Based on knowledge of the matrix element data type, the programmer can certainly specify a shift amount.You are overthinking this. Well-designed vector units may be able to fuse/chain a vssl.vv instruction with a vlx instruction. You shouldn't think one instruction must run to completion before the next one starts. Note that in my sparse matrix vector multiply code, the innermost loop is 9 instructions including the scaling instruction. If this were removed, it reduces dynamic instruction count by about 10%. It seems to be a valuable saving.Yes, it would save instruction issue bandwidth. On simple vector implementations, it may speed things up. On complex ones, it shouldn't make a difference as this will be an anticipated frequently occuring pattern. Guy
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re: A couple of questions about the vector spec
Thank you for the vslide1up suggestion. I might use this with loop unrolling.
Best Regards Nagendra
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
64-bit instruction encoding wish list
Andrew Waterman
Guy pointed out to me that, since several V ISA-design issues have been punted to an eventual 64-bit instruction encoding, we should consider recording them somewhere. I've set up the github wiki for the purpose of recording design rationale that doesn't belong in the spec proper, and have seeded it with a very short list of hypothetical 64b features. Feel free to edit directly if you have write permissions. https://github.com/riscv/riscv-v-spec/wiki
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|