Re: Duplicate Counting Instruction
vmhash should be cheap relative to the work you're doing on each loop.
redoing vmhash in each stripmine could lead to better performance as
you find longer non-conflicting index runs, rather than always
stopping at each VLMAX point.
Krste
| I read through your code and thanks for correcting my errors, 'or' is a good idea for multiple duplicates.
| Here I'd like to explain why I made things a bit more complicated in my code.
| In your code you are also fixing duplicates from least to most by v0 mask. But if we are sure that vhash
| takes a lot more cycles to execute, then we can try to execute vhash only once since it outputs all
| duplicates lanes at one time. We can thus store the duplicates in a mask register and fix them using
| this register as a reference. We resolve duplicates from least to most as usual, where we mask off those
| duplicate lanes resolved. As a result, we can only loop over the patch up loop without re-executing vhash
| instruction.
| The code does not place any design difference to the vhash instruction design, we just demonstrated
| that it works for memory hazard problems.
|
redoing vmhash in each stripmine could lead to better performance as
you find longer non-conflicting index runs, rather than always
stopping at each VLMAX point.
Krste
| Hi Krste,On Mon, 06 Jul 2020 04:14:55 -0700, "lidawei14 via lists.riscv.org" <lidawei14=huawei.com@...> said:
| I read through your code and thanks for correcting my errors, 'or' is a good idea for multiple duplicates.
| Here I'd like to explain why I made things a bit more complicated in my code.
| In your code you are also fixing duplicates from least to most by v0 mask. But if we are sure that vhash
| takes a lot more cycles to execute, then we can try to execute vhash only once since it outputs all
| duplicates lanes at one time. We can thus store the duplicates in a mask register and fix them using
| this register as a reference. We resolve duplicates from least to most as usual, where we mask off those
| duplicate lanes resolved. As a result, we can only loop over the patch up loop without re-executing vhash
| instruction.
| The code does not place any design difference to the vhash instruction design, we just demonstrated
| that it works for memory hazard problems.
|