issue #393 - Towards a simple fractional LMUL design.
I'm sending out to the correct mailing list a copy of the
revised issue #393.
(link:
https://github.com/riscv/riscv-v-spec/issues/393 )
This was requested at the last TG meeting.
I believe it is consistent with casual discussions of fractional LMUL and it is intended to formalize a design.
To follow is the consideration of alternate register overlap to improve usability.The issue #393 update adds to the Glossary and notes that mask registers and operations are unchanged from the plan of record.
Towards a simple fractional LMUL design.
Background:
Prior to LMUL, an elaborate mapping of registers numbers to various width element under different configuration settings that allowed for polymorphic operations was proposed.
LMUL was introduced in a pre-v0.5 Nov 2018 in conjunction with
widening operations and SEW widths.
The LMUL>1 mapping of a register group is one to a power of 2
of consecutive non-overlapping base-arch-registers. The naming
uses the lowest base-arch-register participating in the register
group.
The number of LMUL register is diminished by the same power of
2.
This design was substantially less complex than the predecessor,
with simple constructs like
-
LMUL in powers of 2 aligning with the widening by 2 operations.
Abandoning previous ideas of sequences like 1,2,3,4,5,6,8,10,16,32 -
consecutive registers in register groups, aligned and addressed on multiples of LMUL
This issue will look at simplest implementations of fraction LMUL.
Glossary:
base-arch registers* – the 32 registers addressable when LMUL=1
register group – consecutive registers determined by LMUL>1
register sub-group* – portion of physical register used by LMUL<1
SLEN - The striping distance in bits,
VLEN - The number of bits in a vector register,
VLMAX – LMUL * VLEN / SEW
. . no name is given to effective VLEN at different values of LMUL
vstart - read-write CSR specifies the index of the first element to be executed by a vector
instruction.
( * whereas other terms are from the spec these * terms are added for this discussion)
Guidance.
Fractional LMUL follows the same rules as for LMUL>=1.
VLMAX applies the same.
The simplest extensions to the base retain the
fundamental characteristics.
Specifically then, for this proposal, ELEN, SEW (and its
encoding in vtype), VLEN and, mask register zero and mask
operation behaviour are not changed.
The simplest extension of LMUL to “fractional” is that
the observe affects continue predictably.
Specifically,
-
for changes in LMUL there is a corresponding change in VLMAX and
-
fractional LMUL changes by a factor of 2 from adjacent settings.
For LMUL >=1, VLMAX = LMUL * VLEN/SEW
Note: if SEW is unchanged, with variation of LMUL there is a
proportional change in VLMAX.
We can multiply both sides by SEW to get LMUL * VLEN = VLMAX *
SEW.
This table exhaustively represents this simplest extension effect when SEW is unchanged throughout:
LMUL VLMAX * SEW
8 8*VLEN
4 4*VLEN
2 2*VLEN
1 VLEN
1/2 VLEN/2
1/4 VLEN/4
1/8 VLEN/8
Fractional registers then have diminished capacity, 1/2 to 1/8th of a base-arch register.
The simplest mapping of fractional LMUL registers is
one to one (and only one) of the base-arch registers.
All 32 base-arch-registers can participate and register
numbering can be the same.
The simplest overlay (analogous to the register group
overlay of consecutive base-arch registers) is with zero
elements overlaying.
That is, the fractional register sub-group occupies the lowest
consecutive bytes in the base-arch register. The bytes are in
the same ascending order.
I call this iteration zero of the simplest fractional LMUL designs.
Note: Mask behaviour does not change. Mask operations read and write to a base-arch register. Base-arch register zero remains the default mask register. With this "iteration zero" design, as with LMUL>=1, fractional LMUL “register zero”s are substantially limited in their use.
There are some undesirable characteristic of this design.
-
Use of any fractional sub-group is destructive to the underlying base-arch register.
As sub-groups have less capacity than the underlying base-arch register overall usable capacity is also diminished, up to 7/8ths of VLEN for each active sub-group. -
Such sub-groups are not optimized for widening operations.
There is no equivalent to SLEN to align single with widened operands.