issue #393  Towards a simple fractional LMUL design.
I'm sending out to the correct mailing list a copy of the
revised issue #393.
(link:
https://github.com/riscv/riscvvspec/issues/393 )
This was requested at the last TG meeting.
I believe it is consistent with casual discussions of fractional LMUL and it is intended to formalize a design.
To follow is the consideration of alternate register overlap to improve usability.The issue #393 update adds to the Glossary and notes that mask registers and operations are unchanged from the plan of record.
Towards a simple fractional LMUL design.
Background:
Prior to LMUL, an elaborate mapping of registers numbers to various width element under different configuration settings that allowed for polymorphic operations was proposed.
LMUL was introduced in a prev0.5 Nov 2018 in conjunction with
widening operations and SEW widths.
The LMUL>1 mapping of a register group is one to a power of 2
of consecutive nonoverlapping basearchregisters. The naming
uses the lowest basearchregister participating in the register
group.
The number of LMUL register is diminished by the same power of
2.
This design was substantially less complex than the predecessor,
with simple constructs like

LMUL in powers of 2 aligning with the widening by 2 operations.
Abandoning previous ideas of sequences like 1,2,3,4,5,6,8,10,16,32 
consecutive registers in register groups, aligned and addressed on multiples of LMUL
This issue will look at simplest implementations of fraction LMUL.
Glossary:
basearch registers* – the 32 registers addressable when LMUL=1
register group – consecutive registers determined by LMUL>1
register subgroup* – portion of physical register used by LMUL<1
SLEN  The striping distance in bits,
VLEN  The number of bits in a vector register,
VLMAX – LMUL * VLEN / SEW
. . no name is given to effective VLEN at different values of LMUL
vstart  readwrite CSR specifies the index of the first element to be executed by a vector
instruction.
( * whereas other terms are from the spec these * terms are added for this discussion)
Guidance.
Fractional LMUL follows the same rules as for LMUL>=1.
VLMAX applies the same.
The simplest extensions to the base retain the
fundamental characteristics.
Specifically then, for this proposal, ELEN, SEW (and its
encoding in vtype), VLEN and, mask register zero and mask
operation behaviour are not changed.
The simplest extension of LMUL to “fractional” is that
the observe affects continue predictably.
Specifically,

for changes in LMUL there is a corresponding change in VLMAX and

fractional LMUL changes by a factor of 2 from adjacent settings.
For LMUL >=1, VLMAX = LMUL * VLEN/SEW
Note: if SEW is unchanged, with variation of LMUL there is a
proportional change in VLMAX.
We can multiply both sides by SEW to get LMUL * VLEN = VLMAX *
SEW.
This table exhaustively represents this simplest extension effect when SEW is unchanged throughout:
LMUL VLMAX * SEW
8 8*VLEN
4 4*VLEN
2 2*VLEN
1 VLEN
1/2 VLEN/2
1/4 VLEN/4
1/8 VLEN/8
Fractional registers then have diminished capacity, 1/2 to 1/8th of a basearch register.
The simplest mapping of fractional LMUL registers is
one to one (and only one) of the basearch registers.
All 32 basearchregisters can participate and register
numbering can be the same.
The simplest overlay (analogous to the register group
overlay of consecutive basearch registers) is with zero
elements overlaying.
That is, the fractional register subgroup occupies the lowest
consecutive bytes in the basearch register. The bytes are in
the same ascending order.
I call this iteration zero of the simplest fractional LMUL designs.
Note: Mask behaviour does not change. Mask operations read and write to a basearch register. Basearch register zero remains the default mask register. With this "iteration zero" design, as with LMUL>=1, fractional LMUL “register zero”s are substantially limited in their use.
There are some undesirable characteristic of this design.

Use of any fractional subgroup is destructive to the underlying basearch register.
As subgroups have less capacity than the underlying basearch register overall usable capacity is also diminished, up to 7/8ths of VLEN for each active subgroup. 
Such subgroups are not optimized for widening operations.
There is no equivalent to SLEN to align single with widened operands.
My apologies, especially to those who have sent some feedback.
I had thought I had already sent this second iteration (It has been on git hub issue since Monday.
A slightly less simple design to partially address the destructive nature of register overlay.
There are some undesirable characteristic of iteration zero of the simplest fractional LMUL design.
 Use of any fractional subgroup is destructive to the underlying basearch register.
 As subgroups have less capacity than the underlying basearch register overall usable capacity is also diminished, up to 7/8ths of VLEN for each active subgroup.
Because the low (zero) elements aligning in the overlay the subgroup is in the active portion of the basearch register the destructive impact is unavoidable. Similarly, an operation that writes to the basearch register overwrites at least some of the register subgroup.
I use the term “active” loosely. Technically the active portion is only defined when an operation is acting on the register. Regardless, most of the time VSTART will be zero, and so the active portion would start from zero on the next operation.
However, if instead the VLMAX elements of the basearch register and the register subgroup are aligned then judicious use of vl can avoid mutual assured destruction. Register names would remain in the one to one correlation. However, the register subgroups would start at 1/2 VLEN, 3/4 VLEN and 7/8 vlen depending upon fractional LMUL.
VLEN 1/1 7/8 3/4 1/2 0

LMUL
1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1/2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1/4 xxxxxxxxxxxxxxxx
1/8 xxxxxxxx
Consider when LMUL=1 and tailundisturbed is active and
VLEN a power of 2.
If vl is less than or equal 1/2 VLMAX then a LMUL=1/2 or 1/4 or
1/8 register subgroup is fully in the tail of the basearch
register.
Similarly, with vl of 3/4 VLMAX or less then the tail fully
encompasses a LMUL=1/8 register subgroup.
The same applies for vl <= 7/8 VLMAX for LMUL=1/16 register
subgroup.
VLEN 1/1 7/8 3/4 1/2 0

LMUL
1/2reg xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxvl <= 1/2 VLMAX
1/4reg xxxxxxxxxxxxxxxxvl <= 3/4 VLMAX
1/8reg xxxxxxxxvl <= 7/8 VLMAX
In the perfect scenario registers will all be used to their maximum with fractional LMUL support.
Note: Up to 32 basearch registers and 96 register subgroups can be "alive" at a given time.
Only 32 can be active at a time, with a single vlset[i] instruction enabling each set of 32.
With appropriate values of SLEN, LMUL>1 can also use the reduced vl to allow consecutive fractional register subgroups to coexist.
Nor is the technique restricted for LMUL >=1. LMUL=1/2 can tail protect 1/4 and 1/8th; and 1/4 LMUL tail protect 1/8th.
Note: Masks can be better managed with this design.* As with nonmask registers, appropriate vl allows the tail of a mask register, including mask register zero, to be used for fractional register subgroups.
( * better than “iteration zero” design.)
Further, the fractional register subgroup can store the current LMUL significant mask bits with a single instruction:
vsbc.vvm vn, v0, v0, vmx
# where vn is destination fractional register, vmx is any mask register.
# vn[i] = if (vmx's mask bit set) 1 else 0
# v0,v0 could be any registers designation as long as they are the same
# (or the registers have the same contents).
and a single instruction to enable it in mask v0.
vmsne.vi v0,vn,0 #
Note: Not only algorithms that widen/marrow may benefit. Algorithms with nonpower of two data usage (consider Fibonacci based structures) may especially. The fractional subgroups allow residence of additional data (of any SEW) and operations on them to proceed in the unused tail sections of basearch register .
Note: Implementations with VLEN that are not a power of 2 (say 3 * 2 ** n ) could provide the best of both worlds. Algorithms working optimally with vl at a power of 2 and fractional operations in the remaining tail. (And of course one can mix and match on a basearch or even subgroup basis).
However, this is still not fully ideal.
 Whereas many algorithms may tolerate a halving of vl, some will require substantial moderation to support a 3/4 or 7/8 VLMAX.
 Hardware support is complicated by register subgroups starting at nonzero. Those with comprehensive vstart support can “just” adjust it appropriately.
 It does not address the SLEN optimization deficiency already noted. Indeed it could in some implementation aggravate the situation as propagation for widening and narrowing operations are in the opposite directions for each vs LMUL > 1.
 Minimal implementations benefit least from this; consider, for example, that 7/8ths VLMAX is only meaningful when VLMAX is greater than 7.