Re: Duplicate Counting Instruction
lidawei14@...
Hi,
Sorry the picture was dropped somehow, here I can present using pure text:
For the simple loop example:
for (i = 0; i < N; i++) {
Sorry the picture was dropped somehow, here I can present using pure text:
For the simple loop example:
for (i = 0; i < N; i++) {
a[b[i]] = a[c[i]] + 1;
}
We can use the algorithm:
while (i + VLEN < N) {
v_b = load(b[], i)
v_c = load(c[], i)
vdupcnt = vdupcnt(v_c, v_b)
vdone = 0
while (~vdone) {
vdup = test_nez(vdupcnt)
vmask = ~(vdone | vdup ) # update lanes with no dup and not done
v_a = indexed_load(a[], v_c, vmask)
v_a = vadd(v_a, 1)
store(v_a, a[], v_b, vmask)
vdone |= vmask
vdupcnt = vsub(vdupcnt, 1)
}
i += VLEN
}
And for example:
vc: 3 0 5 3 1 7 3 6
vb: 1 2 3 4 3 2 1 0
vdone: 0 0 0 0 0 0 0 0
# First solve lanes with count being 0
vdupcnt: 0 0 0 1 1 0 2 0
vdup: 0 0 0 1 1 0 1 0
vmask: 1 1 1 0 0 1 0 1 # safely update a with mask
vdone: 1 1 1 0 0 1 0 1
# Then solve lanes with count being 1
vdupcnt: 0 0 0 0 0 0 1 0
vdup: 0 0 0 0 0 0 1 0
vmask: 0 0 0 1 1 0 0 0 # safely update a with mask
vdone: 1 1 1 1 1 1 0 1
# Then solve lanes with count being 2
vdupcnt: 0 0 0 0 0 0 0 0
vdup: 0 0 0 0 0 0 0 0
vmask: 0 0 0 0 0 0 0 0 # safely update a with mask
vdone: 1 1 1 1 1 1 1 1 # all done
The advantage is we can simultaneously update duplicate lanes with same count, and it does not need to re-execute vdupcnt.
However, while duplicates extreamly rarely happen (where we actually interested), advantages of using vmhash from our discussion can be summarized:
1. vmhash costs less cycles and less hardware, and accuracy is not a problem because duplicates are rare (bloom filter).
2. vmhash algorithm regards each duplicates as a pause, we simply reduce vl once and re-extend vl to MAX after this duplicate, which is not bounded by current vl (vdupcnt has to solve all within current vl before going into next iter).
3. pausing to solve current duplicate might solve future duplicates at the same time (since we have updated several lanes before duplicate), re-executing vmhash eliminates those duplicates solved.
Such memory hazards can be found at many of real applications, then solving such problem as discussed can be really advantageous, any suggestions for next steps?
Thanks,
Dawei
The advantage is we can simultaneously update duplicate lanes with same count, and it does not need to re-execute vdupcnt.
However, while duplicates extreamly rarely happen (where we actually interested), advantages of using vmhash from our discussion can be summarized:
1. vmhash costs less cycles and less hardware, and accuracy is not a problem because duplicates are rare (bloom filter).
2. vmhash algorithm regards each duplicates as a pause, we simply reduce vl once and re-extend vl to MAX after this duplicate, which is not bounded by current vl (vdupcnt has to solve all within current vl before going into next iter).
3. pausing to solve current duplicate might solve future duplicates at the same time (since we have updated several lanes before duplicate), re-executing vmhash eliminates those duplicates solved.
Such memory hazards can be found at many of real applications, then solving such problem as discussed can be really advantageous, any suggestions for next steps?
Thanks,
Dawei