I applaud the ordinal nature of the mask structure that is independent of SEW and LMUL.
The problem that I have is the mismatch between units of measure, which is bytes in element lengths and bits in the mask. This unnecessarily compounds skew in the ordinal alignment.
The mask internal structure, although it is simpler and simpler to express, the byte/bit mismatch does:
- not co-related to the reality of element units
- not provide a usable mapping of elements across physical registers
- not optimal for element to mask analysis
- not optimize wiring for any element width, indeed it is poor for all Element Widths
It is however good for vfirst and related mask ops.
I botched the formula in #448
I should have proposed
bit_location_of_mask [i] = ( i * 8 ) % VLEN + floor (( i * 8 ) / VLEN) /8
Which aligns mask bits at corresponding byte alignment , with each bit in the byte identifying one of the 8 physical registers.
Why choose byte for the alignment?
1) byte is the smallest unit for data alignment and
2) byte has the highest cardinality for max LMUL and a given VLEN.
For each successive element size correlation drops off by a factor of 4.
Granted muxing onto a bus and muxing to the operation units will substantially mitigate the skew but not eliminate entirely as the sub-byte clustering has no masking purpose.
As noted above it does have value in scanning for first set mask bit.
What is the appropriate trade-off from microarch?
I will defer to others.
But the software benefits are mentioned in the points above.
I spent an inordinate amount of time deriving a wiring metric based on
SUM ( abs(bit location of element [i] – bit location of mask [i]), where bit location is modulo VLEN)
No surprise that it shows a substantial benefit of the byte alignment over strict bit order.
And no surprise that the variation is proportional to VLEN**2.
Nor that for EEW of byte and byte optimized formula the horizontal cost is effectively zero.
Half and word, etc. benefit too, but of course with lesser weighting.