Duplicate Counting Instruction


lidawei14@...
 

Hi all,

For some certain cases such as histogram we might have duplicate runtime
memory dependences, and the current V extension may fail to vectorize such
cases. Therefore, I would like to propose duplicate detection instructions
to enable vectorization.

Let us consider the following example code first:

for (i=0; i<N; i++) {
    a[c[i]] = a[b[i]] + 1;
}

the above loop will have loop-carried data dependence if c[i] == b[i] (duplicate),
and if we naively vectorize the loop, we got wrong a[] results.However, if we
have the knowledge of where the duplicate is, we can sequentially update a[]
from the previous lane and then safely execute the current and incoming
lanes.
Then let us check the duplicate counting instruction:
 
vdupcnt.v vd, vs1, vs2, vm
 
Efficient address/index duplicate detection instruction `vdupcnt` aims to
detect memory dependency. it performs an all-to-all comparison across all 
lanes of the source vector registers vs1 from vs2 and outputs a vector of values
``vd`` in which each element shows the count of all identical values in vs2
prior to each element in the corresponding vector lane.

    7 6 5 4 3 2 1 0   Element number
    0 0 1 1 1 1 1 0   v0 contents
    3 0 2 1 2 0 2 1   v2 contents
    3 0 2 0 2 2 2 2   v3 contents
                      vdupcnt.v v1, v2, v3, v0.t
    0 0 3 0 2 0 0 0   v1 contents

With vdupcnt.v, we can vectorize the loop (may not be optimal and I assume bugs exist):

    la          a0, N
    ld          s1, a                # a[]
    ld          s2, b                # b[]
    ld          s3, c                # c[]
loop:
    vsetvli     t0, a0, e64          # t0 = vl, a0 = N, 64-bit  
    vmv.v.i     v0, 1                # v0 = todo_list
    vlxe.v      v3, s2, v1           # v3 = v_b
    vlxe.v      v4, s3, v1           # v4 = v_c
    vdupcnt.v   v5, v3, v4           # duplicate counting
    vmsne.vx    v6, v5, zero         # convert counts to dup_mask
patch_up:
    vmsbf.m     v7, v6, v0.t         # v7 = safe_list, locate lanes can be safely updated
    vlxe.v      v8, s1, v4, v7.t     # v8 = v_a, indexed loads
    vadd.vi     v8, v8, 1, v7.t      # modify
    vsxei32.v   v8, s1, v3, v7.t     # write back to memory
    vxor.vi     v7, v7, -1, v0.t     # ~safe_list for removing done lanes
    vand.vv     v0, v0, v7, v0.t     # remove safe_list done lanes from todo_list
    vmsif.m     v9, v6, v0           # find duplicate lanes for mask_off
    vxor.vi     v9, v9, -1, v0.t     # ~mask_off for removing
    vand.vv     v6, v6, v9, v0.t     # mask off this lane from dup_mask  
    vpopc.m     t0, v0, v0.t
    bne         t0, zero, patch_up   # proceed until no todo_list
    sub         a0, a0, t0
    bnez        a0, loop

To be efficient, this algorithm expects duplicates rarely happen.

For image processing, we have pixels range from 0 to 255, where a byte is enough to 
represent a pixel, we could also use this instruction to directly count histograms rather 
than detecting duplicates in this case.
Apart from histogram, I also found some duplicate data dependence loops in SPEC CPU,
and some other using cases for analytical database applications such as hash joins. 


Krste Asanovic
 

tech-vector-ext got dropped somehow, adding again.

We have to keep technical discussion on tech-vector-ext public mailing list.

On Tue, 30 Jun 2020 07:59:17 -0700, Krste Asanovic <krste@...> said:
| On Jun 30, 2020, at 12:35 AM, Roger Espasa <roger.espasa@...> wrote:
| @Krste: you wrote "forall" but you did mean "for(i=0; i < VL; i-i++) right? In other words, the hash is a recurrence on ALL previous lanes, and,
| from a hardware point of view, you have to "pass up" the value of Lane i to lane i+1 which is going to take N clocks (or whatever)

| Yes, should be a sequential loop.

| The hardware view is that this is a binary associative operation that can be done with an parallel-prefix-style OR-reduction tree, so shouldn’t need
| whole cycle per lane.

| @Dawei: I read the code more carefully, and I'm not sure I understand the "vmsbf.m v7, v6, v0.t " output. In your vdup example, this
| instruction would yield a 1 in lanes 0,1,2.3, which I don't think it's what you want... You would want a 1 in lanes 1 and 3, which are the ones
| that have the duplicate, right?

| I couldn’t follow code either.

| If there’s a single potential store-load dependency, then just using vmsbf.m on hash result will enable the elements before any conflict.

| You then continue following vector instructions under that mask, while scalar unit uses either vpopc.m to find out how many elements were handled to
| update pointers,counters before going back around for next strip.

| For histogram

Fixing errors below:

| loop:
| vsetvli t0, a0, e64
| vle64.v v8, (a1) # Get indices
| vmhash.vv v1, v8, v8 # Where are there possible duplicates in index vector?
| vmsbf.m v0, v1 # Enable where no duplicates
| Vlxe64.v v12, (a2). v0.t # Get bucket elements
vlxe64.v v12, (a2), v8, v0.t # Get bucket elements

| vpopc.m t1, v0 # How many without duplicates?
| vadd.vi v13, v12, 1, v0.t # Increment elements
| Vsxe64.v v13, (a2), v0.t # Store bucket elements
vsxe64.v v13, (a2), v8, v0.t # Store bucket elements

| sub a0, a0, t1 # Decrement count
| sll t0, t1, 3 # Multiply byte counts
| add a1, a1, t1 # Bump index pointer
| bnez a0, loop

| Which of course, in this particular case, could be done with vector AMOs but using as example.



When there are multiple dependencies in loop, you just OR the hashes
together, e.g.:

a[b[i]] = a[c[i]] + a[d[i]];

loop:
andi t2, a0, 15 # Only try up to 15 elements
vsetvli x0, t2, e64,ta,ma
vle64.v v11, (a1) # Load b[i]
vle64.v v12, (a2) # Load c[i]
vmhash.v v22, v11, v12 # c[i] match b[0:i-1] ?
vle64.v v13, (a3) # load d[i]
vmhash.v v23, v11, v13 # d[i] match b[0:i-1] ?
vmor.mm v24, v22, v23 # c[i] or d[i] match b[0:1-1] ?
vmsbf.m v0, v24 # flag safe iterations
vlxe64.v v1, (a4), v12, v0.t # load a[c[i]]
vpopc.m t0, v0 # Iterations done
vlxe64.v v2, (a4), v13, v0.t # load a[d[i]]
sub a0, a0, t0 # Subtract count of iterations
slli t1, t0, 3 # 8-byte elements
vadd.vv v3, v1, v2, v0.t # add elemnts
vsxe64.v v3, (a4), v11, v0.t # store a[b[i]]
add a1, a1, t1 # Bump pointers
add a2, a2, t1 # Bump pointers
add a3, a3, t1 # Bump pointers
bnez a0, loop

This caps speculative AVL to 15 elements.

Krste

| The software should be more sophisticated about setting vl, as there is no point trying a vector larger than number of bits in EEW used, so should do
| a min(AVL, EEW), and then probably halve that, when setting vl.

| Krste

| roger.

| On Tue, Jun 30, 2020 at 4:22 AM lidawei <lidawei14@...> wrote:

| Hi Krste & Roger,

| Replacing "vdupcnt" with "vmhash" can be quite simple, we might just replace "vdupcnt" and" vmsne" with a single "vmhash", then resolve the
| memory harzards from lowest to highest one by one .
| This way, the accuracy drop does not affect correctness of the loop, but just introduce some additional effort to handle the mis-predicted
| duplicates.

| Original assembly code:

| la a0, N
| ld s1, a # a[]
| ld s2, b # b[]
| ld s3, c # c[]
| loop:
| vsetvli t0, a0, e64 # t0 = vl, a0 = N, 64-bit
| vmv.v.i v0, 1 # v0 = todo_list
| vlxe.v v3, s2, v1 # v3 = v_b
| vlxe.v v4, s3, v1 # v4 = v_c
----| vdupcnt.v v5, v3, v4 # duplicate counting
----| vmsne.vx v6, v5, zero # convert counts to dup_mask
| patch_up:
| vmsbf.m v7, v6, v0.t # v7 = safe_list, locate lanes can be safely updated
| vlxe.v v8, s1, v4, v7.t # v8 = v_a, indexed loads
| vadd.vi v8, v8, 1, v7.t # modify
| vsxei32.v v8, s1, v3, v7.t # write back to memory
| vxor.vi v7, v7, -1, v0.t # ~safe_list for removing done lanes
| vand.vv v0, v0, v7, v0.t # remove safe_list done lanes from todo_list
| vmsif.m v9, v6, v0 # find duplicate lanes for mask_off
| vxor.vi v9, v9, -1, v0.t # ~mask_off for removing
| vand.vv v6, v6, v9, v0.t # mask off this lane from dup_mask
| vpopc.m t0, v0, v0.t
| bne t0, zero, patch_up # proceed until no todo_list
| sub a0, a0, t0
| bnez a0, loop

| Dawei

| -----邮件原件-----
| 发件人: lidawei
| 发送时间: 2020年6月18日 11:17
| 收件人: 'krste@...' <krste@...>; Roger Espasa <roger.espasa@...>
| 抄送: sunwenbo (A) <sunwenbo1@...>; Reza Azimi <reza.azimi1@...>; liuyingying (F) <liuyingying19@...>; xuqingqing (C) <
| xuqingqing7@...>; gouyue <gouyue@...>
| 主题: RE: [RISC-V] [tech-vector-ext] Duplicate Counting Instruction

| Hi Krste & Roger,

| Thank you for your solution.

| I agree that the instruction does not scale to long vectors, and I believe a vmhash.vv will do the work correctly if we keep conservative when
| we are not sure if memory hazard exists.
| However, I designed the instruction as a count pattern because with the count values, I can re-organize the vectorization algorithm to resolve
| multiple dependences at once.

| Example:

| // VLEN= 8 in the figure
| // N= total elements
| while (i + VLEN < N) {
| v_index = vec_load(keys, i)
| v_dup_cnt = vdupcnt(v_index, v_index)
| v_updated_mask = 0
| while (~v_update_mask) {
| v_pend_mask = test_nez(v_dup_cnt)
| v_mod_mask = ~(v_updated_mask | v_pending_mask)
| v2 = vec_load(buckets, v_index, v_mod_mask)
| // modify v2
| vec_store(buckets, v_index, v2, v_mod_mask)
| v_updated_mask |= v_mod_mask
| v_dup_cnt = vec_add(v_dup_cnt, -1, v_pend_mask)
| }
| I += VLEN
| }

| Let us keep the design open and seek optimal solutions.

| Regards,
| Dawei

| -----邮件原件-----
| 发件人: krste@... [mailto:krste@...]
| 发送时间: 2020年6月17日 9:33
| 收件人: Roger Espasa <roger.espasa@...>
| 抄送: lidawei <lidawei14@...>; Krste Asanovic <krste@...>
| 主题: Re: [RISC-V] [tech-vector-ext] Duplicate Counting Instruction

| This instruction doesn't really scale to long vectors (consider the BSC design where VLEN=16384).

| There is a fully software gadget used to vectorize this type of code that uses a reversed-order element store of element index into a hash
| table, followed by load/compare of same to see if there was a conflict.

| Perhaps a hash-based hardware instruction that approximates the full check would be sufficient in practice (e.g., parallel-prefix of a hash
| function), though I guess the Intel folks must have found that simpler approach didn't help Spec as much before going all out on the vdup
| instruction.

| The hardware hash-match instruction would operate something like:

| vmhash.vv vd, vs2, vs1, vm // Keeps hash in SEW element, usually XLEN

| vhash[-1] = 0;
| forall (i in VL)
| vd[i] = is_present_in_hash( vs1[i], vhash[i-1] )
| vhash[i] = hash_op( vhash[i-1] , vs2[i] )

| so mask register element i is set where vs1[i] is possibly in set of preceding elements of vs2. Could make it take/produce a scalar of input/
| ouput hash to stripmine, but in the memory hazard detection we don't need to stripmine these as tthe next stripmine loop is done sequentially
| anyway. Input mask v0.t probably applies to the inclusion of vs2 values, and all vd values are written?

| The simplest hash is setting a bit based on some LSBs of data value, with presence check being equally easy. Then there's always Bloom
| filters...

| The hash filter accuracy drops rapidly with increasing vector length but at least operation is well defined. The hardware cost of the vdup
| approach grows rapidly with vector length, so maybe the feasible range of both overlaps.

| Krste

| | On Tue, Jun 16, 2020 at 3:10 PM <lidawei14@...> wrote:

| | Hi all,

| | For some certain cases such as histogram we might have duplicate runtime
| | memory dependences, and the current V extension may fail to vectorize such
| | cases. Therefore, I would like to propose duplicate detection instructions
| | to enable vectorization.

| | Let us consider the following example code first:

| | for (i=0; i<N; i++) {
| | a[c[i]] = a[b[i]] + 1;
| | }

| | the above loop will have loop-carried data dependence if c[i] == b[i] (duplicate),
| | and if we naively vectorize the loop, we got wrong a[] results.However, if we
| | have the knowledge of where the duplicate is, we can sequentially update a[]
| | from the previous lane and then safely execute the current and incoming
| | lanes.
| | Then let us check the duplicate counting instruction:
| |
| | vdupcnt.v vd, vs1, vs2, vm
| |
| | Efficient address/index duplicate detection instruction `vdupcnt` aims to
| | detect memory dependency. it performs an all-to-all comparison across all
| | lanes of the source vector registers vs1 from vs2 and outputs a vector of values
| | ``vd`` in which each element shows the count of all identical values in vs2
| | prior to each element in the corresponding vector lane.

| | 7 6 5 4 3 2 1 0 Element number
| | 0 0 1 1 1 1 1 0 v0 contents
| | 3 0 2 1 2 0 2 1 v2 contents
| | 3 0 2 0 2 2 2 2 v3 contents
| | vdupcnt.v v1, v2, v3, v0.t
| | 0 0 3 0 2 0 0 0 v1 contents

| | With vdupcnt.v, we can vectorize the loop (may not be optimal and I assume bugs exist):

| | la a0, N
| | ld s1, a # a[]
| | ld s2, b # b[]
| | ld s3, c # c[]
| | loop:
| | vsetvli t0, a0, e64 # t0 = vl, a0 = N, 64-bit
| | vmv.v.i v0, 1 # v0 = todo_list
| | vlxe.v v3, s2, v1 # v3 = v_b
| | vlxe.v v4, s3, v1 # v4 = v_c
| | vdupcnt.v v5, v3, v4 # duplicate counting
| | vmsne.vx v6, v5, zero # convert counts to dup_mask
| | patch_up:
| | vmsbf.m v7, v6, v0.t # v7 = safe_list, locate lanes can be safely updated
| | vlxe.v v8, s1, v4, v7.t # v8 = v_a, indexed loads
| | vadd.vi v8, v8, 1, v7.t # modify
| | vsxei32.v v8, s1, v3, v7.t # write back to memory
| | vxor.vi v7, v7, -1, v0.t # ~safe_list for removing done lanes
| | vand.vv v0, v0, v7, v0.t # remove safe_list done lanes from todo_list
| | vmsif.m v9, v6, v0 # find duplicate lanes for mask_off
| | vxor.vi v9, v9, -1, v0.t # ~mask_off for removing
| | vand.vv v6, v6, v9, v0.t # mask off this lane from dup_mask
| | vpopc.m t0, v0, v0.t
| | bne t0, zero, patch_up # proceed until no todo_list
| | sub a0, a0, t0
| | bnez a0, loop

| | To be efficient, this algorithm expects duplicates rarely happen.

| | For image processing, we have pixels range from 0 to 255, where a byte is enough to
| | represent a pixel, we could also use this instruction to directly count histograms rather
| | than detecting duplicates in this case.
| | Apart from histogram, I also found some duplicate data dependence loops in SPEC CPU,
| | and some other using cases for analytical database applications such as hash joins.
| |


lidawei14@...
 

Hi Krste,

I read through your code and thanks for correcting my errors, 'or' is a good idea for multiple duplicates.
Here I'd like to explain why I made things a bit more complicated in my code.

In your code you are also fixing duplicates from least to most by v0 mask. But if we are sure that vhash
takes a lot more cycles to execute, then we can try to execute vhash only once since it outputs all
duplicates lanes at one time. We can thus store the duplicates in a mask register and fix them using
this register as a reference. We resolve duplicates from least to most as usual, where we mask off those
duplicate lanes resolved. As a result, we can only loop over the patch up loop without re-executing vhash
instruction.

The code does not place any design difference to the vhash instruction design, we just demonstrated 
that it works for memory hazard problems.


Krste Asanovic
 

vmhash should be cheap relative to the work you're doing on each loop.

redoing vmhash in each stripmine could lead to better performance as
you find longer non-conflicting index runs, rather than always
stopping at each VLMAX point.

Krste

On Mon, 06 Jul 2020 04:14:55 -0700, "lidawei14 via lists.riscv.org" <lidawei14=huawei.com@...> said:
| Hi Krste,
| I read through your code and thanks for correcting my errors, 'or' is a good idea for multiple duplicates.
| Here I'd like to explain why I made things a bit more complicated in my code.

| In your code you are also fixing duplicates from least to most by v0 mask. But if we are sure that vhash
| takes a lot more cycles to execute, then we can try to execute vhash only once since it outputs all
| duplicates lanes at one time. We can thus store the duplicates in a mask register and fix them using
| this register as a reference. We resolve duplicates from least to most as usual, where we mask off those
| duplicate lanes resolved. As a result, we can only loop over the patch up loop without re-executing vhash
| instruction.

| The code does not place any design difference to the vhash instruction design, we just demonstrated
| that it works for memory hazard problems.
|


lidawei14@...
 

Hi Krste,

Just would like to continue Roger's question on hardware implementation, as you said it can be done with a parallel-prefix-style OR-reduction tree, so can you please explain how we can avoid whole cycle per lane? How many cycles are required for vmhash then? Because as I presented, for vdupcnt, we can use a mask manipulation algorithm to resolve memory hazards in parallel for all existing unzero duplicate lanes (Please see a histogram example figure in the picture).
I think we can have a comparison and trade-off the two designs.

Dawei


lidawei14@...
 

Hi,

Sorry the picture was dropped somehow, here I can present using pure text:
For the simple loop example:

for (i = 0; i < N; i++) {
    a[b[i]] = a[c[i]] + 1;
}

We can use the algorithm:

while (i + VLEN < N) { 
   v_b = load(b[], i)
   v_c = load(c[], i)
   vdupcnt = vdupcnt(v_c, v_b)
   vdone = 0
   while (~vdone) {
      vdup = test_nez(vdupcnt)
      vmask = ~(vdone | vdup ) # update lanes with no dup and not done
      v_a = indexed_load(a[], v_c, vmask)
      v_a = vadd(v_a, 1)
      store(v_a, a[], v_b, vmask)
      vdone |= vmask
      vdupcnt = vsub(vdupcnt, 1)
   }
   i += VLEN
}

And for example:

vc:        3 0 5 3 1 7 3 6
vb:        1 2 3 4 3 2 1 0
vdone:     0 0 0 0 0 0 0 0
 
# First solve lanes with count being 0
vdupcnt:   0 0 0 1 1 0 2 0
vdup:      0 0 0 1 1 0 1 0
vmask:     1 1 1 0 0 1 0 1 # safely update a with mask
vdone:     1 1 1 0 0 1 0 1
 
# Then solve lanes with count being 1
vdupcnt:   0 0 0 0 0 0 1 0
vdup:      0 0 0 0 0 0 1 0
vmask:     0 0 0 1 1 0 0 0 # safely update a with mask
vdone:     1 1 1 1 1 1 0 1
 
# Then solve lanes with count being 2
vdupcnt:   0 0 0 0 0 0 0 0
vdup:      0 0 0 0 0 0 0 0
vmask:     0 0 0 0 0 0 0 0 # safely update a with mask
vdone:     1 1 1 1 1 1 1 1 # all done

The advantage is we can simultaneously update duplicate lanes with same count, and it does not need to re-execute vdupcnt. 

However, while duplicates extreamly rarely happen (where we actually interested), advantages of using vmhash from our discussion can be summarized:
1. vmhash costs less cycles and less hardware, and accuracy is not a problem because duplicates are rare (bloom filter).
2. vmhash algorithm regards each duplicates as a pause, we simply reduce vl once and re-extend vl to MAX after this duplicate, which is not bounded by current vl (vdupcnt has to solve all within current vl before going into next iter).
3. pausing to solve current duplicate might solve future duplicates at the same time (since we have updated several lanes before duplicate), re-executing vmhash eliminates those duplicates solved.

Such memory hazards can be found at many of real applications, then solving such problem as discussed can be really advantageous, any suggestions for next steps?

Thanks,
Dawei