Re: Mask Register Value Mapping
David Horner
On Wed, Sep 23, 2020, 15:10 CDS, <cohen.steed@...> wrote:
Or my preference a similar annotation that explicitly identifies it as a mast bit: vs2[i] + vs1[i] + v0[i].m Or similar. |
|
Re: Mask Register Value Mapping
CDS <cohen.steed@...>
Word of caution: there may be a utility/readability concern if the ".LSB" text is removed, only. vs2[i] + vs1[i] + v0[i] which can easily be misleading to the reader - while 'i' has the same value for all three terms, the first two indicate a SEW bit field, whereas the final term indicates a single bit. Suggestions: include a reminder that v0[i] entries are a single bit under the opening comment in the code block ("Produce sum with carry."); Set a reminder at the bottom of the description section before starting the code text, or indicate a comment on the code line "#Vector-vector-bit". |
|
Re: Mask Register Value Mapping
I believe so: I am not aware of any proposals to reintroduce MLEN. On Tue, Sep 22, 2020 at 12:34 PM CDS <cohen.steed@...> wrote: Thank you Andrew and Nick. |
|
Re: Mask Register Value Mapping
CDS <cohen.steed@...>
Thank you Andrew and Nick.
To avoid having to repeat this question later, is it the intent moving forward (beyond "this version of the spec" being 0.9 stable) that this will hold true in the same format - at least, as of today? |
|
Re: Mask Register Value Mapping
Hi Cohen, I think the "LSB references" are carryovers from pre-0.9 versions, when MLEN > 1 was possible. I can put together a PR to fix this later tonight, unless someone else gets to it sooner. Best, Nick Knight On Tue, Sep 22, 2020 at 12:25 PM CDS <cohen.steed@...> wrote:
|
|
Re: Mask Register Value Mapping
andrew@...
It is the case that mask elements are always one bit wide in this version of the spec. Removing the “.LSB” holdovers will improve clarity. On Tue, Sep 22, 2020 at 12:25 PM CDS <cohen.steed@...> wrote:
|
|
Mask Register Value Mapping
CDS <cohen.steed@...>
From 0.9 stable spec, 5.3.1, table (no number), vector masking is referred to as having LSB. This suggests, yet does not require, that the mask field for each element is greater than bit-size 1. From same spec, 4.6.1, each element mask bit is given an explicit location, as a single bit. And yet, for individual operations, the LSB reference is still intact - such as in section 12.4 (Vector Integer Add-with-Carry / Subtract-with-Borrow Instructions). |
|
Re: V-ext white paper?
Yes, but only after it heads into ratification.
toggle quoted message
Show quoted text
There are at least two papers. 1) outline the design for people, 2) document the history and development process Krste
|
|
Re: V-ext white paper?
count me in
Hi team, Do we have a plan to write a V-extension white paper? Is there any interest? I'm thinking along the lines of ARM's SVE paper in IEEE Micro '17. I don't know if this is feasible or appropriate for a RISC-V working group. And I imagine our organizations will write individually about our own implementations. But it might be nice to collaborate on a general paper, post-ratification, presumably. Best, Nick Knight _._,_._,_ WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer |
|
V-ext white paper?
Hi team, Do we have a plan to write a V-extension white paper? Is there any interest? I'm thinking along the lines of ARM's SVE paper in IEEE Micro '17. I don't know if this is feasible or appropriate for a RISC-V working group. And I imagine our organizations will write individually about our own implementations. But it might be nice to collaborate on a general paper, post-ratification, presumably. Best, Nick Knight |
|
Vector TG minutes for 2020/9/18
Date: 2020/9/18
Task Group: Vector Extension Chair: Krste Asanovic Co-Chair: Roger Espasa Number of Attendees: ~17 Current issues on github: https://github.com/riscv/riscv-v-spec #551 Memory orderings scalar-vector #534/5 Element ordering ----------------------------------- The following proposal was put forward and was agreeable to group. Vector unit-stride and constant vector memory accesses (both load and store) would always be unordered by element. Indexed accesses would be supplied in both ordered and unordered forms. Where ordering is required, software has to use an indexed instruction. Existing encoding is retained in mop[1:0], with the previously reserved load mop[1:0] encoding now allocated to unordered gather. Strided operations now treated as unordered. Loads Old New mop[1:0] 0 0 unit-stride (ordered) VLE unit-stride (unordered) VLE 0 1 reserved --- indexed (unordered) VLUXEI 1 0 strided (ordered) VLSE strided (unordered) VLSE 1 1 indexed (ordered) VLXEI indexed (ordered) VLOXEI Stores Old New mop[1:0] 0 0 unit-stride (unordered) VSE unit-stride (unordered) VSE 0 1 indexed (unordered) VSUXEI indexed (unordered) VSUXEI 1 0 strided (unordered) VSSE strided (unordered) VSSE 1 1 indexed (ordered) VSXEI indexed (ordered) VSOXEI (The mnemonics were not discussed at length in meeting, and a slightly different scheme is given here. The indexed operations have a "U" or "O" to distinguish between unordered and ordered. This is a little less consistent, as could argue that unordered indexed operations don't need a U to match others, but this approach minimizes disruption to existing software.) For unordered instructions (mop!=11) there is no guarantee on element access order. For segment loads and stores, the individual element accesses within each segement are unordered with respect to each other. If the accesses are to a strongly ordered IO region, the element accesses can happen in any order. Stride-0 Optimizations ---------------------- We also discussed stride-0 optimizations. The proposed scheme is that if rs1=x0, then an implementation is allowed to only perform one memory read and replicate the value to all destination elements but may the read the location more than once. Similarly, a store might combine one or more elements and do fewer writes to memory. With a zero-valued stride in a register rs1!=x0, i.e., x[rs1]==0, the implementation must perform all accesses (but these will be unordered). It was noted the compiler must be aware not to convert a known stride of 0 into use of x0 if all memory accesses are required. If it is desired to perform multiple repeated ordered accesses to a non-idempotent memory region (e.g., popping a memory-mapped FIFO), then an ordered gather should be used targeting the single address. When a segment straddles a PMA boundary, the segment accesses must obey the PMA constraints associated with each constituent element's address for accesses to that element. Ordered AMOs ------------ The current PoR only has unordered AMOs. If was discussed whether ordered AMOs are desirable, but there were few clear examples where this would be useful, and so these are not planned for v1.0. Element-Precise Exception Reporting ----------------------------------- There was discussion around allowing vector stores to have updated some locations in memory corresponding to elements before the element that raises a synchronous exception. The proposal was to allow these additional stores in idempotent memory regions. Stores to non-idempotent memory regions must not occur for elements past an element reporting an exception. #493/510/532 Opaque vstart -------------------------- There was brief discussion around opaque vstart. Given the relaxation of memory access ordering, it was felt less critical to support opaque vstart, but was suggested to allow this in a subset of the base architecure (treating non-opaque vstart in V as an extension). However, it was also noted that opaque vstart alone will rarely be sufficient to support resumable traps, and in general additional mechanism will be required to save and restore microarchitectural state in complex processors with imprecise traps. |
|
Please check new Google calendar for new vector TG meeting link
Krste
|
|
Vector Task Group minutes for 2020/9/4 meeting
Date: 2020/9/4
Task Group: Vector Extension Chair: Krste Asanovic Co-Chair: Roger Espasa Number of Attendees: ~20 Current issues on github: https://github.com/riscv/riscv-v-spec Issues discussed: Spec formatting. The new formatting has been merged in, and the new build flow was discussed. #551 Memory orderings scalar-vector #534/5 Element ordering Discussion continued around memory ordering and determing the correct set of instructions to provide- with discussion to continue on list. |
|
Vector task group minutes for 2020/8/21
Date: 2020/8/21
Task Group: Vector Extension Chair: Krste Asanovic Co-Chair: Roger Espasa Number of Attendees: ~12 Current issues on github: https://github.com/riscv/riscv-v-spec Issues discussed: #501 Ordering of element loads Discussion was around vector memory access ordering and interaction with global memory consistency model. The original proposal for vector memory ordering was that it behaved like a scalar loop over the elements, however this would imply that a vector load that accessed the same address would have to access that address in element order. This only affects stride-0 loads and indexed gathers. The proposal was to relax indexed gathers to always be unordered, and to use vl=1 if the program required ordering between loop iterations. Another dicussion was on access to ordered and/or non-idempotent memory regions, where initiating memory accessed out-of-order would not give expected semantics, and restarting instructions once this was detected could be problematic for implementations without load buffers or renamed registers, and when having to maintain precise-to-the-element exception semantics. One proposal was to weaken precise-to-element semantics and forbid overlap of source index and destination vector on gathers. Implementing debug data watchpoints with unordered accesses also is problematic if watchpoints have to be triggered in element. A proposal was to allow watchpoints to be reported out of element order. Currently, strided segment stores where segments overlap are defined to occur in element order. Discussion was around loosening this constraint to out-of-order writes, or to trap if the overlap was detected. |
|
Re: poll on vstart management issues #493, #510 and #532
attached is what we did at convex and it worked quite well. worked well in the context of compiler generated code for stencils and for runtimers like convolution and correlation
i am not sure this answers the questions you posed.
hope this helps ————————— Vector first register - C4600 The vector register set of the C4600 Series CPUs contains an additional vector register called the vector first register (VF). VF specifies the first element of vector register Vi, Vj or Vk accessed by a vector instruction, provided that the MSB of the corresponding 5-bit register select field of the instruction is set. VF cannot be applied to operations on VM. VF is seven bits in length and may contain a value between 0 and 127. If the value of VF plus the value of VL is greater than 128, the effective value ofVL for vector instructions that use VF is 128 minus VF. This effective VL value determines the number of results written to a vector register or VM, or the number of elements stored to memory. If the value of VF plus Sj is greater than 127 in the mov If Vi or Vj of an instruction specifies the same register as Vk of the instruction, and VF is applied to Vk, and VL is greater than VF, then elements of the shared register may be written (as Vk) before they are read (as Vi or Vj, depending of the hardware implementation). In this case, the result in Vk is architecturally undefined.Theinstructionmerg.x Vi,Vj,Vkhasthesame behavior if Vi or Vj are the same as Vk. WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer |
|
Added details for vector TG meeting tomorrow
I believe I added the correct zoom info on the correct new calendar for tomorrow’s vector task group meeting.
Please check and advise if you’re not seeing it, Krste |
|
poll on vstart management issues #493, #510 and #532
David Horner
Ahead of the vector meeting I would like to see if we can address or at least get direction on some of the flagged for pre-v1.0 resolution.
There are 3 related flagged issues that all deal with vstart. 493 - unbind vstart from element index510 - add element index to vstart for segment loads and stores 532 - define vstart as an opaque value (some values will have specific meanings
These have in common redefining vstart. #493 proposes that the vstart value not have a one to one mapping from its value to the usual vl ordering. This was motivated in part by SLEN considerations that are now invisible to the ISA architecture.However, included is the consideration that some operations , e.g. cryptography may desire restart part of the way through the operation. This even if only one element is contained in the vlen*8 register set. Clearly insufficient internal state is expressed it only zero and 1 are allowed values. #510 proposes the element position within a segmented load/store is identified in vstart, not just the group position. ** Most of the discussion relates to which should be identified, element position or group position.POR is group position and it was substantially defended as the incumbent definition. The POR substantially limits an implementations options, intended for the greater good. Thus the question of where and how the additional "element within group" information should be stored did not progress. However, even if segmented load/stores will settle on restart granularity of groups or elements, the larger question of alternate representation of restart information for "special" cirumstances has been raised, as it has in #493. #532 proposes that vstart be defined as a value that will be treated as opaque at the ISA level. No intrinsic meaning should be inferred directly from the value in vstart. It is assured only to be a value used to instruct the implementation to restart the instruction from either a) the exception element or b) by adding 1 to the vstart value, the next element. The proposal allows for implementations to provide a mechanism to provide additional trap information, including related element at exception. Escape mechanisms to convey that the index is simply
embedded in the rightmost bits of vstart is discussed as a
concession to the expectation that plain text identification of the element "active" at the time of the exception is valuable information and sufficient in most cases for restart handling. My request for the meeting is to poll the group to respond agree, disagree or abstain on the following.. 1) the POR [vstart is the plain text number indicating the next element at which to resume] is sufficient for v1.0,Any augmenting of vstart or adding new facilities [e.g. csr] can be addressed later. The ecosystem changes can also be made later as we expect the changes required to be specific to new functionality (crypto, ediv, etc.). 2) we identify specific special cases that we believe are desired by ecosystem [ element within segment group, phased restart of crypto ops, etc.] and make specific allowance for supporting fields in vstart. Low order bit will still identify the item [element, segment, crypto term, etc.] specific to the instruction. Other fields will be populated according to the nature of the operation and the element type. 3) vstart be considered opaque as above with the escape mechanism that then reduces to the current POR. The ecosystem will need to handle the general case in which the element of exception must be determined by other means than low bits in vstart. Exception handlers can no longer resume at an arbitrary element in the instruction and have a reasonable expectation that the restart will work as expected. 4) opaque vstart without an escape signature in vstart. In all cases an alternate mechanism will be required to identify the element "of exception". 5) further investigation is still required before these decisions can be made. ** (and presumably any future segmented operation) |
|
Re: an interesting paper
i agree with your comment
i got this paper from someone who applied their assessment to risc-v vectors On Sep 14, 2020, at 10:47 AM, krste@... wrote: http://bsc.es/disclaimer |
|
an interesting paper
They mention RISC-V vectors in the intro, but on a quick scan, the
results are very ARM-specific, with no real implication for RISC-V vectors. They're pointing out a problem with ARM SVE where all elements are executed regardless of vector length due to SVE using predication to implement vector length. Krste | i was made aware of this paper. risc-v vectors are mentioned.On Mon, 14 Sep 2020 08:46:05 -0500, "swallach" <steven.wallach@...> said: | one of the key conclusions are (from the abstract) | Our experiments show that VLA code reaches about 90% of the performance of | vector length specific code, i.e. a 10% overhead is inferred due to global | predication of instructions. Furthermore, we show that code performance is not | increasing proportionally with increasing vector lengths due to the higher | memory demands. | my experience is just the opposite. (based on memory system design) | i am curious to hear other opinions | WARNING / LEGAL TEXT: This message is intended only for the use of the | individual or entity to which it is addressed and may contain information | which is privileged, confidential, proprietary, or exempt from disclosure | under applicable law. If you are not the intended recipient or the person | responsible for delivering the message to the intended recipient, you are | strictly prohibited from disclosing, distributing, copying, or in any way | using this message. If you have received this communication in error, please | notify the sender and destroy and delete any copies you may have received. | http://www.bsc.es/disclaimer | | x[DELETED ATTACHMENT PohlAPPMM19.pdf, PDF] |
|
an interesting paper
i was made aware of this paper. risc-v vectors are mentioned.
one of the key conclusions are (from the abstract) Our experiments show that VLA code reaches about 90% of the performance of vector length specific code, i.e. a 10% overhead is inferred due to global predication of instructions. Furthermore, we show that code performance is not increasing proportionally with increasing vector lengths due to the higher memory demands. my experience is just the opposite. (based on memory system design) i am curious to hear other opinions WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer |
|