tech-vector-ext got dropped somehow, adding again.
We have to keep technical discussion on tech-vector-ext public mailing list.
On Tue, 30 Jun 2020 07:59:17 -0700, Krste Asanovic <krste@...> said:
| On Jun 30, 2020, at 12:35 AM, Roger Espasa <roger.espasa@...> wrote:
| @Krste: you wrote "forall" but you did mean "for(i=0; i < VL; i-i++) right? In other words, the hash is a recurrence on ALL previous lanes, and,
| from a hardware point of view, you have to "pass up" the value of Lane i to lane i+1 which is going to take N clocks (or whatever)
| Yes, should be a sequential loop.
| The hardware view is that this is a binary associative operation that can be done with an parallel-prefix-style OR-reduction tree, so shouldn’t need
| whole cycle per lane.
| @Dawei: I read the code more carefully, and I'm not sure I understand the "vmsbf.m v7, v6, v0.t " output. In your vdup example, this
| instruction would yield a 1 in lanes 0,1,2.3, which I don't think it's what you want... You would want a 1 in lanes 1 and 3, which are the ones
| that have the duplicate, right?
| I couldn’t follow code either.
| If there’s a single potential store-load dependency, then just using vmsbf.m on hash result will enable the elements before any conflict.
| You then continue following vector instructions under that mask, while scalar unit uses either vpopc.m to find out how many elements were handled to
| update pointers,counters before going back around for next strip.
| For histogram
Fixing errors below:
| loop:
| vsetvli t0, a0, e64
| vle64.v v8, (a1) # Get indices
| vmhash.vv v1, v8, v8 # Where are there possible duplicates in index vector?
| vmsbf.m v0, v1 # Enable where no duplicates
| Vlxe64.v v12, (a2). v0.t # Get bucket elements
vlxe64.v v12, (a2), v8, v0.t # Get bucket elements
| vpopc.m t1, v0 # How many without duplicates?
| vadd.vi v13, v12, 1, v0.t # Increment elements
| Vsxe64.v v13, (a2), v0.t # Store bucket elements
vsxe64.v v13, (a2), v8, v0.t # Store bucket elements
| sub a0, a0, t1 # Decrement count
| sll t0, t1, 3 # Multiply byte counts
| add a1, a1, t1 # Bump index pointer
| bnez a0, loop
| Which of course, in this particular case, could be done with vector AMOs but using as example.
When there are multiple dependencies in loop, you just OR the hashes
together, e.g.:
a[b[i]] = a[c[i]] + a[d[i]];
loop:
andi t2, a0, 15 # Only try up to 15 elements
vsetvli x0, t2, e64,ta,ma
vle64.v v11, (a1) # Load b[i]
vle64.v v12, (a2) # Load c[i]
vmhash.v v22, v11, v12 # c[i] match b[0:i-1] ?
vle64.v v13, (a3) # load d[i]
vmhash.v v23, v11, v13 # d[i] match b[0:i-1] ?
vmor.mm v24, v22, v23 # c[i] or d[i] match b[0:1-1] ?
vmsbf.m v0, v24 # flag safe iterations
vlxe64.v v1, (a4), v12, v0.t # load a[c[i]]
vpopc.m t0, v0 # Iterations done
vlxe64.v v2, (a4), v13, v0.t # load a[d[i]]
sub a0, a0, t0 # Subtract count of iterations
slli t1, t0, 3 # 8-byte elements
vadd.vv v3, v1, v2, v0.t # add elemnts
vsxe64.v v3, (a4), v11, v0.t # store a[b[i]]
add a1, a1, t1 # Bump pointers
add a2, a2, t1 # Bump pointers
add a3, a3, t1 # Bump pointers
bnez a0, loop
This caps speculative AVL to 15 elements.
Krste
| The software should be more sophisticated about setting vl, as there is no point trying a vector larger than number of bits in EEW used, so should do
| a min(AVL, EEW), and then probably halve that, when setting vl.
| Krste
| roger.
| On Tue, Jun 30, 2020 at 4:22 AM lidawei <lidawei14@...> wrote:
| Hi Krste & Roger,
| Replacing "vdupcnt" with "vmhash" can be quite simple, we might just replace "vdupcnt" and" vmsne" with a single "vmhash", then resolve the
| memory harzards from lowest to highest one by one .
| This way, the accuracy drop does not affect correctness of the loop, but just introduce some additional effort to handle the mis-predicted
| duplicates.
| Original assembly code:
| la a0, N
| ld s1, a # a[]
| ld s2, b # b[]
| ld s3, c # c[]
| loop:
| vsetvli t0, a0, e64 # t0 = vl, a0 = N, 64-bit
| vmv.v.i v0, 1 # v0 = todo_list
| vlxe.v v3, s2, v1 # v3 = v_b
| vlxe.v v4, s3, v1 # v4 = v_c
----| vdupcnt.v v5, v3, v4 # duplicate counting
----| vmsne.vx v6, v5, zero # convert counts to dup_mask
| patch_up:
| vmsbf.m v7, v6, v0.t # v7 = safe_list, locate lanes can be safely updated
| vlxe.v v8, s1, v4, v7.t # v8 = v_a, indexed loads
| vadd.vi v8, v8, 1, v7.t # modify
| vsxei32.v v8, s1, v3, v7.t # write back to memory
| vxor.vi v7, v7, -1, v0.t # ~safe_list for removing done lanes
| vand.vv v0, v0, v7, v0.t # remove safe_list done lanes from todo_list
| vmsif.m v9, v6, v0 # find duplicate lanes for mask_off
| vxor.vi v9, v9, -1, v0.t # ~mask_off for removing
| vand.vv v6, v6, v9, v0.t # mask off this lane from dup_mask
| vpopc.m t0, v0, v0.t
| bne t0, zero, patch_up # proceed until no todo_list
| sub a0, a0, t0
| bnez a0, loop
| Dawei
| -----邮件原件-----
| 发件人: lidawei
| 发送时间: 2020年6月18日 11:17
| 收件人: 'krste@...' <krste@...>; Roger Espasa <roger.espasa@...>
| 抄送: sunwenbo (A) <sunwenbo1@...>; Reza Azimi <reza.azimi1@...>; liuyingying (F) <liuyingying19@...>; xuqingqing (C) <
| xuqingqing7@...>; gouyue <gouyue@...>
| 主题: RE: [RISC-V] [tech-vector-ext] Duplicate Counting Instruction
| Hi Krste & Roger,
| Thank you for your solution.
| I agree that the instruction does not scale to long vectors, and I believe a vmhash.vv will do the work correctly if we keep conservative when
| we are not sure if memory hazard exists.
| However, I designed the instruction as a count pattern because with the count values, I can re-organize the vectorization algorithm to resolve
| multiple dependences at once.
| Example:
| // VLEN= 8 in the figure
| // N= total elements
| while (i + VLEN < N) {
| v_index = vec_load(keys, i)
| v_dup_cnt = vdupcnt(v_index, v_index)
| v_updated_mask = 0
| while (~v_update_mask) {
| v_pend_mask = test_nez(v_dup_cnt)
| v_mod_mask = ~(v_updated_mask | v_pending_mask)
| v2 = vec_load(buckets, v_index, v_mod_mask)
| // modify v2
| vec_store(buckets, v_index, v2, v_mod_mask)
| v_updated_mask |= v_mod_mask
| v_dup_cnt = vec_add(v_dup_cnt, -1, v_pend_mask)
| }
| I += VLEN
| }
| Let us keep the design open and seek optimal solutions.
| Regards,
| Dawei
| -----邮件原件-----
| 发件人: krste@... [mailto:krste@...]
| 发送时间: 2020年6月17日 9:33
| 收件人: Roger Espasa <roger.espasa@...>
| 抄送: lidawei <lidawei14@...>; Krste Asanovic <krste@...>
| 主题: Re: [RISC-V] [tech-vector-ext] Duplicate Counting Instruction
| This instruction doesn't really scale to long vectors (consider the BSC design where VLEN=16384).
| There is a fully software gadget used to vectorize this type of code that uses a reversed-order element store of element index into a hash
| table, followed by load/compare of same to see if there was a conflict.
| Perhaps a hash-based hardware instruction that approximates the full check would be sufficient in practice (e.g., parallel-prefix of a hash
| function), though I guess the Intel folks must have found that simpler approach didn't help Spec as much before going all out on the vdup
| instruction.
| The hardware hash-match instruction would operate something like:
| vmhash.vv vd, vs2, vs1, vm // Keeps hash in SEW element, usually XLEN
| vhash[-1] = 0;
| forall (i in VL)
| vd[i] = is_present_in_hash( vs1[i], vhash[i-1] )
| vhash[i] = hash_op( vhash[i-1] , vs2[i] )
| so mask register element i is set where vs1[i] is possibly in set of preceding elements of vs2. Could make it take/produce a scalar of input/
| ouput hash to stripmine, but in the memory hazard detection we don't need to stripmine these as tthe next stripmine loop is done sequentially
| anyway. Input mask v0.t probably applies to the inclusion of vs2 values, and all vd values are written?
| The simplest hash is setting a bit based on some LSBs of data value, with presence check being equally easy. Then there's always Bloom
| filters...
| The hash filter accuracy drops rapidly with increasing vector length but at least operation is well defined. The hardware cost of the vdup
| approach grows rapidly with vector length, so maybe the feasible range of both overlaps.
| Krste
| | On Tue, Jun 16, 2020 at 3:10 PM <lidawei14@...> wrote:
| | Hi all,
| | For some certain cases such as histogram we might have duplicate runtime
| | memory dependences, and the current V extension may fail to vectorize such
| | cases. Therefore, I would like to propose duplicate detection instructions
| | to enable vectorization.
| | Let us consider the following example code first:
| | for (i=0; i<N; i++) {
| | a[c[i]] = a[b[i]] + 1;
| | }
| | the above loop will have loop-carried data dependence if c[i] == b[i] (duplicate),
| | and if we naively vectorize the loop, we got wrong a[] results.However, if we
| | have the knowledge of where the duplicate is, we can sequentially update a[]
| | from the previous lane and then safely execute the current and incoming
| | lanes.
| | Then let us check the duplicate counting instruction:
| |
| | vdupcnt.v vd, vs1, vs2, vm
| |
| | Efficient address/index duplicate detection instruction `vdupcnt` aims to
| | detect memory dependency. it performs an all-to-all comparison across all
| | lanes of the source vector registers vs1 from vs2 and outputs a vector of values
| | ``vd`` in which each element shows the count of all identical values in vs2
| | prior to each element in the corresponding vector lane.
| | 7 6 5 4 3 2 1 0 Element number
| | 0 0 1 1 1 1 1 0 v0 contents
| | 3 0 2 1 2 0 2 1 v2 contents
| | 3 0 2 0 2 2 2 2 v3 contents
| | vdupcnt.v v1, v2, v3, v0.t
| | 0 0 3 0 2 0 0 0 v1 contents
| | With vdupcnt.v, we can vectorize the loop (may not be optimal and I assume bugs exist):
| | la a0, N
| | ld s1, a # a[]
| | ld s2, b # b[]
| | ld s3, c # c[]
| | loop:
| | vsetvli t0, a0, e64 # t0 = vl, a0 = N, 64-bit
| | vmv.v.i v0, 1 # v0 = todo_list
| | vlxe.v v3, s2, v1 # v3 = v_b
| | vlxe.v v4, s3, v1 # v4 = v_c
| | vdupcnt.v v5, v3, v4 # duplicate counting
| | vmsne.vx v6, v5, zero # convert counts to dup_mask
| | patch_up:
| | vmsbf.m v7, v6, v0.t # v7 = safe_list, locate lanes can be safely updated
| | vlxe.v v8, s1, v4, v7.t # v8 = v_a, indexed loads
| | vadd.vi v8, v8, 1, v7.t # modify
| | vsxei32.v v8, s1, v3, v7.t # write back to memory
| | vxor.vi v7, v7, -1, v0.t # ~safe_list for removing done lanes
| | vand.vv v0, v0, v7, v0.t # remove safe_list done lanes from todo_list
| | vmsif.m v9, v6, v0 # find duplicate lanes for mask_off
| | vxor.vi v9, v9, -1, v0.t # ~mask_off for removing
| | vand.vv v6, v6, v9, v0.t # mask off this lane from dup_mask
| | vpopc.m t0, v0, v0.t
| | bne t0, zero, patch_up # proceed until no todo_list
| | sub a0, a0, t0
| | bnez a0, loop
| | To be efficient, this algorithm expects duplicates rarely happen.
| | For image processing, we have pixels range from 0 to 255, where a byte is enough to
| | represent a pixel, we could also use this instruction to directly count histograms rather
| | than detecting duplicates in this case.
| | Apart from histogram, I also found some duplicate data dependence loops in SPEC CPU,
| | and some other using cases for analytical database applications such as hash joins.
| |