My apologies, especially to those who have sent some feedback.
I had thought I had already sent this second iteration (It has
been on git hub issue since Monday.
A slightly less simple design to partially address the
destructive nature of register overlay.
There are some undesirable characteristic of iteration zero of
the simplest fractional LMUL design.
- Use of any fractional sub-group is destructive to the
underlying base-arch register.
- As sub-groups have less capacity than the underlying
base-arch register overall usable capacity is also diminished,
up to 7/8ths of VLEN for each active sub-group.
Because the low (zero) elements aligning in the overlay
the sub-group is in the active portion of the base-arch register
the destructive impact is unavoidable. Similarly, an
operation that writes to the base-arch register overwrites at
least some of the register sub-group.
I use the term “active” loosely. Technically the active
portion is only defined when an operation is acting on the
register. Regardless, most of the time VSTART will be zero,
and so the active portion would start from zero on the next
operation.
However, if instead the VLMAX elements of the base-arch
register and the register sub-group are aligned then judicious
use of vl can avoid mutual assured destruction.
Register names would remain in the one to one correlation.
However, the register sub-groups would start at 1/2 VLEN, 3/4 VLEN
and 7/8 vlen depending upon fractional LMUL.
VLEN 1/1 7/8 3/4 1/2 0
---------------------------------------------------------------------
LMUL
1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1/2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1/4 xxxxxxxxxxxxxxxx
1/8 xxxxxxxx
Consider when LMUL=1 and tail-undisturbed is active and
VLEN a power of 2.
If vl is less than or equal 1/2 VLMAX then a LMUL=1/2 or 1/4 or
1/8 register sub-group is fully in the tail of the base-arch
register.
Similarly, with vl of 3/4 VLMAX or less then the tail fully
encompasses a LMUL=1/8 register sub-group.
The same applies for vl <= 7/8 VLMAX for LMUL=1/16 register
sub-group.
VLEN 1/1 7/8 3/4 1/2 0
------------------------------------------------------------------------
LMUL
1/2reg xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx---------vl <= 1/2 VLMAX--------
1/4reg xxxxxxxxxxxxxxxx------------------vl <= 3/4 VLMAX---------------
1/8reg xxxxxxxx-------------------vl <= 7/8 VLMAX----------------------
In the perfect scenario registers will all be used to
their maximum with fractional LMUL support.
Note: Up to 32 base-arch registers and 96 register sub-groups
can be "alive" at a given time.
Only 32 can be active at a time, with a single vlset[i]
instruction enabling each set of 32.
With appropriate values of SLEN, LMUL>1 can also use the
reduced vl to allow consecutive fractional register sub-groups to
co-exist.
Nor is the technique restricted for LMUL >=1. LMUL=1/2 can
tail protect 1/4 and 1/8th; and 1/4 LMUL tail protect 1/8th.
Note: Masks can be better managed with this design.*
As with non-mask registers, appropriate vl allows the tail of
a mask register, including mask register zero, to be used for
fractional register sub-groups.
( * better than “iteration zero” design.)
Further, the fractional register sub-group can store the
current LMUL significant mask bits with a single instruction:
vsbc.vvm vn, v0, v0, vmx
# where vn is destination fractional register, vmx is any mask register.
# vn[i] = if (vmx's mask bit set) -1 else 0
# v0,v0 could be any registers designation as long as they are the same
# (or the registers have the same contents).
and a single instruction to enable it in mask v0.
vmsne.vi v0,vn,0 #
Note: Not only algorithms that widen/marrow may
benefit. Algorithms with non-power of two data
usage (consider Fibonacci based structures) may especially.
The fractional sub-groups allow residence of additional data
(of any SEW) and operations on them to proceed in the unused
tail sections of base-arch register .
Note: Implementations with VLEN that are not a power
of 2 (say 3 * 2 ** n ) could provide the best of both
worlds. Algorithms working optimally with vl at a
power of 2 and fractional operations in the remaining tail.
(And of course one can mix and match on a base-arch or even
sub-group basis).
However, this is still not fully ideal.
- Whereas many algorithms may tolerate a halving of vl, some
will require substantial moderation to support a 3/4 or 7/8
VLMAX.
- Hardware support is complicated by register sub-groups
starting at non-zero. Those with comprehensive vstart support
can “just” adjust it appropriately.
- It does not address the SLEN optimization deficiency already
noted. Indeed it could in some implementation aggravate the
situation as propagation for widening and narrowing operations
are in the opposite directions for each vs LMUL > 1.
- Minimal implementations benefit least from this; consider, for
example, that 7/8ths VLMAX is only meaningful when VLMAX is
greater than 7.