Hi Krste,
Just would like to continue Roger's question on hardware implementation, as you said it can be done with a parallel-prefix-style OR-reduction tree, so can you please explain how we can avoid whole cycle per lane? How many cycles are required for vmhash then? Because as I presented, for vdupcnt, we can use a mask manipulation algorithm to resolve memory hazards in parallel for all existing unzero duplicate lanes (Please see a histogram example figure in the picture).
I think we can have a comparison and trade-off the two designs.
Dawei