Re: Duplicate Counting Instruction


lidawei14@...
 

Hi,

Sorry the picture was dropped somehow, here I can present using pure text:
For the simple loop example:

for (i = 0; i < N; i++) {
    a[b[i]] = a[c[i]] + 1;
}

We can use the algorithm:

while (i + VLEN < N) { 
   v_b = load(b[], i)
   v_c = load(c[], i)
   vdupcnt = vdupcnt(v_c, v_b)
   vdone = 0
   while (~vdone) {
      vdup = test_nez(vdupcnt)
      vmask = ~(vdone | vdup ) # update lanes with no dup and not done
      v_a = indexed_load(a[], v_c, vmask)
      v_a = vadd(v_a, 1)
      store(v_a, a[], v_b, vmask)
      vdone |= vmask
      vdupcnt = vsub(vdupcnt, 1)
   }
   i += VLEN
}

And for example:

vc:        3 0 5 3 1 7 3 6
vb:        1 2 3 4 3 2 1 0
vdone:     0 0 0 0 0 0 0 0
 
# First solve lanes with count being 0
vdupcnt:   0 0 0 1 1 0 2 0
vdup:      0 0 0 1 1 0 1 0
vmask:     1 1 1 0 0 1 0 1 # safely update a with mask
vdone:     1 1 1 0 0 1 0 1
 
# Then solve lanes with count being 1
vdupcnt:   0 0 0 0 0 0 1 0
vdup:      0 0 0 0 0 0 1 0
vmask:     0 0 0 1 1 0 0 0 # safely update a with mask
vdone:     1 1 1 1 1 1 0 1
 
# Then solve lanes with count being 2
vdupcnt:   0 0 0 0 0 0 0 0
vdup:      0 0 0 0 0 0 0 0
vmask:     0 0 0 0 0 0 0 0 # safely update a with mask
vdone:     1 1 1 1 1 1 1 1 # all done

The advantage is we can simultaneously update duplicate lanes with same count, and it does not need to re-execute vdupcnt. 

However, while duplicates extreamly rarely happen (where we actually interested), advantages of using vmhash from our discussion can be summarized:
1. vmhash costs less cycles and less hardware, and accuracy is not a problem because duplicates are rare (bloom filter).
2. vmhash algorithm regards each duplicates as a pause, we simply reduce vl once and re-extend vl to MAX after this duplicate, which is not bounded by current vl (vdupcnt has to solve all within current vl before going into next iter).
3. pausing to solve current duplicate might solve future duplicates at the same time (since we have updated several lanes before duplicate), re-executing vmhash eliminates those duplicates solved.

Such memory hazards can be found at many of real applications, then solving such problem as discussed can be really advantageous, any suggestions for next steps?

Thanks,
Dawei
 

Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.