#### A simple fractional LMUL proposal

Krste Asanovic

I've been wading through the fractional LMUL discussion on github but
believe the simple basic solution below meets the immediate needs,
without blocking possible reuse of unused register fields later. I
want to put this out there to provide a baseline strawman against
which to compare the other more exotic variants.

The proposed mapping is given below.

* For machines with SLEN=VLEN, the microarchitectural modification to
support fractional LMUL is very minor. The main changes are to add
the additional bit in vtype to support the additional LMUL values, and
to have setvl calculations take the fractional LMUL into account when
calculating VLMAX and setting vl. The only effect is to execute
instructions with shorter vl than, but otherwise identically to,
existing LMULs.

* For machines with SLEN<VLEN, the simple "reduce VL" doesn't quite
work. Instead each SLEN-wide partition has to reduce VL locally. This
is shown in the figures below. Even this is not too large a change as
datapath wiring stays the same and it's mainly an issue of turning off
unused portions of the datapath, though in new patterns.

I'm not in favor of shifting the used portion to the top of the
register to enable scalar values or short vectors to use the space
below, as this would change the way fractional LMUL vector
instructions read out values and complicate chaining and interlock
checks for simple baselines. I believe there are cleaner
register-bit-scavenging schemes possible when we have a larger number
of architectural register names available.

The unused portions would be affected by tail undisturbed/agnostic

LMUL[2:0] encoding

111 LMUL=8
110 LMUL=4
101 LMUL=2
100 LMUL=1
011 LMUL=1/2
010 LMUL=1/4
001 LMUL=1/8
000 (reserved)

We limit mandatory supported SEW at different LMUL to following
values:

LMUL = 1/2, SEW <= ELEN/2
LMUL = 1/4, SEW <= ELEN/4
LMUL = 1/8, SEW <= ELEN/8

i.e., SEW <= LMUL*ELEN, for LMUL<=1 and ELEN @ LMUL=1
(some systems can have different ELEN for LMUL>1)

Example layout, drawn with two ASCII characters per byte
horizontally. This is drawn to show SLEN<VLEN (but just considering
the right 128b shows how SLEN=VLEN would look).

VLEN=256b, SLEN=128b

SEW/LMUL=4

2F2E2D2C2B2A29282726252423222120|0F0E0D0C0B0A09080706050403020100 SEW=8b, LMUL=2
3F3E3D3C3B3A39383736353433323130|1F1E1D1C1B1A19181716151413121110

--27--26--25--24--23--22--21--20|--07--06--05--04--03--02--01--00 SEW=16b, LMUL=4
--2F--2E--2D--2C--2B--2A--29--28|--0F--0E--0D--0C--0B--0A--09--08
--37--36--35--34--33--32--33--30|--17--16--15--14--13--12--11--10
--3F--3E--3D--3C--3B--3A--39--38|--1F--1E--1D--1C--1B--1A--19--18

------23------22------21------20|------03------02------01------00 SEW=32b, LMUL=8
------27------26------25------24|------07------06------05------04
------2B------2A------29------28|------0B------0A------09------08
------2F------2E------2D------2C|------0F------0E------0D------0C
....

SEW/LMUL=8

1F1E1D1C1B1A19181716151413121110|-F-E-D-C-B-A-9-8-7-6-5-4-3-2-1-0 SEW=8b, LMUL=1

--17--16--15--14--13--12--11--10|---7---6---5---4---3---2---1---0 SEW=16b, LMUL=2
--1F--1E--1D--1C--1B--1A--19--18|---F---E---D---C---B---A---9---8

------13------12------11------10|-------3-------2-------1-------0 SEW=32b, LMUL=4
------17------16------15------14|-------7-------6-------5-------4
...

--------------11--------------10|---------------1---------------0 SEW=64b, LMUL=8
...

SEW/LMUL=16

xxxxxxxxxxxxxxxx-F-E-D-C-B-A-9-8|xxxxxxxxxxxxxxxx-7-6-5-4-3-2-1-0 SEW=8b, LMUL=1/2

---F---E---D---C---B---A---9---8|---7---6---5---4---3---2---1---0 SEW=16b, LMUL=1

-------B-------A-------9-------8|-------3-------2-------1-------0 SEW=32b, LMUL=2
-------F-------E-------D-------C|-------7-------6-------5-------4

---------------9---------------8|---------------1---------------0 SEW=64b, LMUL=4
---------------B---------------A|---------------3---------------2
...

SEW/LMUL=32

xxxxxxxxxxxxxxxxxxxxxxxx-7-6-5-4|xxxxxxxxxxxxxxxxxxxxxxxx-3-2-1-0 SEW=8b, LMUL=1/4

xxxxxxxxxxxxxxxxx--7---6---5---4|xxxxxxxxxxxxxxxx---3---2---1---0 SEW=16b, LMUL=1/2

-------7-------6-------5-------4|-------3-------2-------1-------0 SEW=32b, LMUL=1

---------------5---------------4|---------------1---------------0 SEW=64b, LMUL=2
---------------7---------------6|---------------3---------------2

SEW/LMUL=64

xxxxxxxxxxxxxxxxxxxxxxxxxxxx-3-2|xxxxxxxxxxxxxxxxxxxxxxxxxxxx-1-0 SEW=8b, LMUL=1/8

xxxxxxxxxxxxxxxxxxxxxxxx---3---2|xxxxxxxxxxxxxxxxxxxxxxxx---1---0 SEW=16b, LMUL=1/4

xxxxxxxxxxxxxxxx-------3-------2|xxxxxxxxxxxxxxxx-------1-------0 SEW=32b, LMUL=1/2

---------------3---------------2|---------------1---------------0 SEW=64b, LMUL=1

Krste

Bill Huffman

Hi Krste,

I agree this is the basic solution. And very likely all we should include.

Bill

On 3/24/20 8:40 PM, Krste Asanovic wrote:

I've been wading through the fractional LMUL discussion on github but
believe the simple basic solution below meets the immediate needs,
without blocking possible reuse of unused register fields later. I
want to put this out there to provide a baseline strawman against
which to compare the other more exotic variants.

The proposed mapping is given below.

* For machines with SLEN=VLEN, the microarchitectural modification to
support fractional LMUL is very minor. The main changes are to add
the additional bit in vtype to support the additional LMUL values, and
to have setvl calculations take the fractional LMUL into account when
calculating VLMAX and setting vl. The only effect is to execute
instructions with shorter vl than, but otherwise identically to,
existing LMULs.

* For machines with SLEN<VLEN, the simple "reduce VL" doesn't quite
work. Instead each SLEN-wide partition has to reduce VL locally. This
is shown in the figures below. Even this is not too large a change as
datapath wiring stays the same and it's mainly an issue of turning off
unused portions of the datapath, though in new patterns.

I'm not in favor of shifting the used portion to the top of the
register to enable scalar values or short vectors to use the space
below, as this would change the way fractional LMUL vector
instructions read out values and complicate chaining and interlock
checks for simple baselines. I believe there are cleaner
register-bit-scavenging schemes possible when we have a larger number
of architectural register names available.

The unused portions would be affected by tail undisturbed/agnostic

LMUL[2:0] encoding

111 LMUL=8
110 LMUL=4
101 LMUL=2
100 LMUL=1
011 LMUL=1/2
010 LMUL=1/4
001 LMUL=1/8
000 (reserved)

We limit mandatory supported SEW at different LMUL to following
values:

LMUL = 1/2, SEW <= ELEN/2
LMUL = 1/4, SEW <= ELEN/4
LMUL = 1/8, SEW <= ELEN/8

i.e., SEW <= LMUL*ELEN, for LMUL<=1 and ELEN @ LMUL=1
(some systems can have different ELEN for LMUL>1)

Example layout, drawn with two ASCII characters per byte
horizontally. This is drawn to show SLEN<VLEN (but just considering
the right 128b shows how SLEN=VLEN would look).

VLEN=256b, SLEN=128b

SEW/LMUL=4

2F2E2D2C2B2A29282726252423222120|0F0E0D0C0B0A09080706050403020100 SEW=8b, LMUL=2
3F3E3D3C3B3A39383736353433323130|1F1E1D1C1B1A19181716151413121110

--27--26--25--24--23--22--21--20|--07--06--05--04--03--02--01--00 SEW=16b, LMUL=4
--2F--2E--2D--2C--2B--2A--29--28|--0F--0E--0D--0C--0B--0A--09--08
--37--36--35--34--33--32--33--30|--17--16--15--14--13--12--11--10
--3F--3E--3D--3C--3B--3A--39--38|--1F--1E--1D--1C--1B--1A--19--18

------23------22------21------20|------03------02------01------00 SEW=32b, LMUL=8
------27------26------25------24|------07------06------05------04
------2B------2A------29------28|------0B------0A------09------08
------2F------2E------2D------2C|------0F------0E------0D------0C
....

SEW/LMUL=8

1F1E1D1C1B1A19181716151413121110|-F-E-D-C-B-A-9-8-7-6-5-4-3-2-1-0 SEW=8b, LMUL=1

--17--16--15--14--13--12--11--10|---7---6---5---4---3---2---1---0 SEW=16b, LMUL=2
--1F--1E--1D--1C--1B--1A--19--18|---F---E---D---C---B---A---9---8

------13------12------11------10|-------3-------2-------1-------0 SEW=32b, LMUL=4
------17------16------15------14|-------7-------6-------5-------4
...

--------------11--------------10|---------------1---------------0 SEW=64b, LMUL=8
...

SEW/LMUL=16

xxxxxxxxxxxxxxxx-F-E-D-C-B-A-9-8|xxxxxxxxxxxxxxxx-7-6-5-4-3-2-1-0 SEW=8b, LMUL=1/2

---F---E---D---C---B---A---9---8|---7---6---5---4---3---2---1---0 SEW=16b, LMUL=1

-------B-------A-------9-------8|-------3-------2-------1-------0 SEW=32b, LMUL=2
-------F-------E-------D-------C|-------7-------6-------5-------4

---------------9---------------8|---------------1---------------0 SEW=64b, LMUL=4
---------------B---------------A|---------------3---------------2
...

SEW/LMUL=32

xxxxxxxxxxxxxxxxxxxxxxxx-7-6-5-4|xxxxxxxxxxxxxxxxxxxxxxxx-3-2-1-0 SEW=8b, LMUL=1/4

xxxxxxxxxxxxxxxxx--7---6---5---4|xxxxxxxxxxxxxxxx---3---2---1---0 SEW=16b, LMUL=1/2

-------7-------6-------5-------4|-------3-------2-------1-------0 SEW=32b, LMUL=1

---------------5---------------4|---------------1---------------0 SEW=64b, LMUL=2
---------------7---------------6|---------------3---------------2

SEW/LMUL=64

xxxxxxxxxxxxxxxxxxxxxxxxxxxx-3-2|xxxxxxxxxxxxxxxxxxxxxxxxxxxx-1-0 SEW=8b, LMUL=1/8

xxxxxxxxxxxxxxxxxxxxxxxx---3---2|xxxxxxxxxxxxxxxxxxxxxxxx---1---0 SEW=16b, LMUL=1/4

xxxxxxxxxxxxxxxx-------3-------2|xxxxxxxxxxxxxxxx-------1-------0 SEW=32b, LMUL=1/2

---------------3---------------2|---------------1---------------0 SEW=64b, LMUL=1

Krste

andrew@...

On Tue, Mar 24, 2020 at 8:41 PM Krste Asanovic <krste@...> wrote:

I've been wading through the fractional LMUL discussion on github but
believe the simple basic solution below meets the immediate needs,
without blocking possible reuse of unused register fields later.  I
want to put this out there to provide a baseline strawman against
which to compare the other more exotic variants.

The proposed mapping is given below.

*  For machines with SLEN=VLEN, the microarchitectural modification to
support fractional LMUL is very minor.  The main changes are to add
the additional bit in vtype to support the additional LMUL values, and
to have setvl calculations take the fractional LMUL into account when
calculating VLMAX and setting vl.  The only effect is to execute
instructions with shorter vl than, but otherwise identically to,
existing LMULs.

*  For machines with SLEN<VLEN, the simple "reduce VL" doesn't quite
work. Instead each SLEN-wide partition has to reduce VL locally. This
is shown in the figures below.  Even this is not too large a change as
datapath wiring stays the same and it's mainly an issue of turning off
unused portions of the datapath, though in new patterns.

I'm not in favor of shifting the used portion to the top of the
register to enable scalar values or short vectors to use the space
below, as this would change the way fractional LMUL vector
instructions read out values and complicate chaining and interlock
checks for simple baselines.  I believe there are cleaner
register-bit-scavenging schemes possible when we have a larger number
of architectural register names available.

The unused portions would be affected by tail undisturbed/agnostic

LMUL[2:0] encoding

111 LMUL=8
110 LMUL=4
101 LMUL=2
100 LMUL=1
011 LMUL=1/2
010 LMUL=1/4
001 LMUL=1/8
000 (reserved)

I recommend flipping the polarity of LMUL[2] to ease the transition for existing assemblers.  I understand the aesthetic underpinning of this encoding (especially as it might pertain to expanded LMUL in future), but it really is only aesthetic.

We limit mandatory supported SEW at different LMUL to following
values:

LMUL = 1/2, SEW <= ELEN/2
LMUL = 1/4, SEW <= ELEN/4
LMUL = 1/8, SEW <= ELEN/8

i.e., SEW <= LMUL*ELEN, for LMUL<=1 and ELEN @ LMUL=1
(some systems can have different ELEN for LMUL>1)

Example layout, drawn with two ASCII characters per byte
horizontally. This is drawn to show SLEN<VLEN (but just considering
the right 128b shows how SLEN=VLEN would look).

VLEN=256b, SLEN=128b

SEW/LMUL=4

2F2E2D2C2B2A29282726252423222120|0F0E0D0C0B0A09080706050403020100     SEW=8b, LMUL=2
3F3E3D3C3B3A39383736353433323130|1F1E1D1C1B1A19181716151413121110

--27--26--25--24--23--22--21--20|--07--06--05--04--03--02--01--00     SEW=16b, LMUL=4
--2F--2E--2D--2C--2B--2A--29--28|--0F--0E--0D--0C--0B--0A--09--08
--37--36--35--34--33--32--33--30|--17--16--15--14--13--12--11--10
--3F--3E--3D--3C--3B--3A--39--38|--1F--1E--1D--1C--1B--1A--19--18

------23------22------21------20|------03------02------01------00     SEW=32b, LMUL=8
------27------26------25------24|------07------06------05------04
------2B------2A------29------28|------0B------0A------09------08
------2F------2E------2D------2C|------0F------0E------0D------0C
....

SEW/LMUL=8

1F1E1D1C1B1A19181716151413121110|-F-E-D-C-B-A-9-8-7-6-5-4-3-2-1-0     SEW=8b, LMUL=1

--17--16--15--14--13--12--11--10|---7---6---5---4---3---2---1---0     SEW=16b, LMUL=2
--1F--1E--1D--1C--1B--1A--19--18|---F---E---D---C---B---A---9---8

------13------12------11------10|-------3-------2-------1-------0     SEW=32b, LMUL=4
------17------16------15------14|-------7-------6-------5-------4
...

--------------11--------------10|---------------1---------------0     SEW=64b, LMUL=8
...

SEW/LMUL=16

xxxxxxxxxxxxxxxx-F-E-D-C-B-A-9-8|xxxxxxxxxxxxxxxx-7-6-5-4-3-2-1-0     SEW=8b, LMUL=1/2

---F---E---D---C---B---A---9---8|---7---6---5---4---3---2---1---0     SEW=16b, LMUL=1

-------B-------A-------9-------8|-------3-------2-------1-------0     SEW=32b, LMUL=2
-------F-------E-------D-------C|-------7-------6-------5-------4

---------------9---------------8|---------------1---------------0     SEW=64b, LMUL=4
---------------B---------------A|---------------3---------------2
...

SEW/LMUL=32

xxxxxxxxxxxxxxxxxxxxxxxx-7-6-5-4|xxxxxxxxxxxxxxxxxxxxxxxx-3-2-1-0     SEW=8b, LMUL=1/4

xxxxxxxxxxxxxxxxx--7---6---5---4|xxxxxxxxxxxxxxxx---3---2---1---0     SEW=16b, LMUL=1/2

-------7-------6-------5-------4|-------3-------2-------1-------0     SEW=32b, LMUL=1

---------------5---------------4|---------------1---------------0     SEW=64b, LMUL=2
---------------7---------------6|---------------3---------------2

SEW/LMUL=64

xxxxxxxxxxxxxxxxxxxxxxxxxxxx-3-2|xxxxxxxxxxxxxxxxxxxxxxxxxxxx-1-0     SEW=8b, LMUL=1/8

xxxxxxxxxxxxxxxxxxxxxxxx---3---2|xxxxxxxxxxxxxxxxxxxxxxxx---1---0     SEW=16b, LMUL=1/4

xxxxxxxxxxxxxxxx-------3-------2|xxxxxxxxxxxxxxxx-------1-------0     SEW=32b, LMUL=1/2

---------------3---------------2|---------------1---------------0     SEW=64b, LMUL=1

Krste

Alex Solomatnikov

I suggest to treat fractional LMUL as a negative power of 2: 1/2 = 2^-1, 1/4 = 2^-2

Already LMUL == 2^vtype.vlmul , so no changes needed for all values of vtype.vlmul that are already in the spec.

Alex

On Tue, Mar 24, 2020 at 10:53 PM Andrew Waterman <andrew@...> wrote:

On Tue, Mar 24, 2020 at 8:41 PM Krste Asanovic <krste@...> wrote:

I've been wading through the fractional LMUL discussion on github but
believe the simple basic solution below meets the immediate needs,
without blocking possible reuse of unused register fields later.  I
want to put this out there to provide a baseline strawman against
which to compare the other more exotic variants.

The proposed mapping is given below.

*  For machines with SLEN=VLEN, the microarchitectural modification to
support fractional LMUL is very minor.  The main changes are to add
the additional bit in vtype to support the additional LMUL values, and
to have setvl calculations take the fractional LMUL into account when
calculating VLMAX and setting vl.  The only effect is to execute
instructions with shorter vl than, but otherwise identically to,
existing LMULs.

*  For machines with SLEN<VLEN, the simple "reduce VL" doesn't quite
work. Instead each SLEN-wide partition has to reduce VL locally. This
is shown in the figures below.  Even this is not too large a change as
datapath wiring stays the same and it's mainly an issue of turning off
unused portions of the datapath, though in new patterns.

I'm not in favor of shifting the used portion to the top of the
register to enable scalar values or short vectors to use the space
below, as this would change the way fractional LMUL vector
instructions read out values and complicate chaining and interlock
checks for simple baselines.  I believe there are cleaner
register-bit-scavenging schemes possible when we have a larger number
of architectural register names available.

The unused portions would be affected by tail undisturbed/agnostic

LMUL[2:0] encoding

111 LMUL=8
110 LMUL=4
101 LMUL=2
100 LMUL=1
011 LMUL=1/2
010 LMUL=1/4
001 LMUL=1/8
000 (reserved)

I recommend flipping the polarity of LMUL[2] to ease the transition for existing assemblers.  I understand the aesthetic underpinning of this encoding (especially as it might pertain to expanded LMUL in future), but it really is only aesthetic.

We limit mandatory supported SEW at different LMUL to following
values:

LMUL = 1/2, SEW <= ELEN/2
LMUL = 1/4, SEW <= ELEN/4
LMUL = 1/8, SEW <= ELEN/8

i.e., SEW <= LMUL*ELEN, for LMUL<=1 and ELEN @ LMUL=1
(some systems can have different ELEN for LMUL>1)

Example layout, drawn with two ASCII characters per byte
horizontally. This is drawn to show SLEN<VLEN (but just considering
the right 128b shows how SLEN=VLEN would look).

VLEN=256b, SLEN=128b

SEW/LMUL=4

2F2E2D2C2B2A29282726252423222120|0F0E0D0C0B0A09080706050403020100     SEW=8b, LMUL=2
3F3E3D3C3B3A39383736353433323130|1F1E1D1C1B1A19181716151413121110

--27--26--25--24--23--22--21--20|--07--06--05--04--03--02--01--00     SEW=16b, LMUL=4
--2F--2E--2D--2C--2B--2A--29--28|--0F--0E--0D--0C--0B--0A--09--08
--37--36--35--34--33--32--33--30|--17--16--15--14--13--12--11--10
--3F--3E--3D--3C--3B--3A--39--38|--1F--1E--1D--1C--1B--1A--19--18

------23------22------21------20|------03------02------01------00     SEW=32b, LMUL=8
------27------26------25------24|------07------06------05------04
------2B------2A------29------28|------0B------0A------09------08
------2F------2E------2D------2C|------0F------0E------0D------0C
....

SEW/LMUL=8

1F1E1D1C1B1A19181716151413121110|-F-E-D-C-B-A-9-8-7-6-5-4-3-2-1-0     SEW=8b, LMUL=1

--17--16--15--14--13--12--11--10|---7---6---5---4---3---2---1---0     SEW=16b, LMUL=2
--1F--1E--1D--1C--1B--1A--19--18|---F---E---D---C---B---A---9---8

------13------12------11------10|-------3-------2-------1-------0     SEW=32b, LMUL=4
------17------16------15------14|-------7-------6-------5-------4
...

--------------11--------------10|---------------1---------------0     SEW=64b, LMUL=8
...

SEW/LMUL=16

xxxxxxxxxxxxxxxx-F-E-D-C-B-A-9-8|xxxxxxxxxxxxxxxx-7-6-5-4-3-2-1-0     SEW=8b, LMUL=1/2

---F---E---D---C---B---A---9---8|---7---6---5---4---3---2---1---0     SEW=16b, LMUL=1

-------B-------A-------9-------8|-------3-------2-------1-------0     SEW=32b, LMUL=2
-------F-------E-------D-------C|-------7-------6-------5-------4

---------------9---------------8|---------------1---------------0     SEW=64b, LMUL=4
---------------B---------------A|---------------3---------------2
...

SEW/LMUL=32

xxxxxxxxxxxxxxxxxxxxxxxx-7-6-5-4|xxxxxxxxxxxxxxxxxxxxxxxx-3-2-1-0     SEW=8b, LMUL=1/4

xxxxxxxxxxxxxxxxx--7---6---5---4|xxxxxxxxxxxxxxxx---3---2---1---0     SEW=16b, LMUL=1/2

-------7-------6-------5-------4|-------3-------2-------1-------0     SEW=32b, LMUL=1

---------------5---------------4|---------------1---------------0     SEW=64b, LMUL=2
---------------7---------------6|---------------3---------------2

SEW/LMUL=64

xxxxxxxxxxxxxxxxxxxxxxxxxxxx-3-2|xxxxxxxxxxxxxxxxxxxxxxxxxxxx-1-0     SEW=8b, LMUL=1/8

xxxxxxxxxxxxxxxxxxxxxxxx---3---2|xxxxxxxxxxxxxxxxxxxxxxxx---1---0     SEW=16b, LMUL=1/4

xxxxxxxxxxxxxxxx-------3-------2|xxxxxxxxxxxxxxxx-------1-------0     SEW=32b, LMUL=1/2

---------------3---------------2|---------------1---------------0     SEW=64b, LMUL=1

Krste

David Horner

On 2020-03-24 11:40 p.m., Krste Asanovic wrote:
```I've been wading through the fractional LMUL discussion on github but
believe the simple basic solution below meets the immediate needs,```
I attempt to summarize the needs here:

1) to reduce the register pressure that successive levels of LMUL invoke

this arising from the need/desire

3) avoid the use of the registers numbers that are not a multiple LMUL (at eaach level)

There may be more?

By identifying the needs and enumerating them we can compare with other possible conflicting needs.

This basic solution addresses the need by stipulating 3 additional modes that

1) provide padding between element groups ,

2) reduce the number of elements per register by factors of 2 each additional level

3) and thus align with the same SLEN parameter used by LMUL

`without blocking possible reuse of unused register fields later. `
``` I want to put this out there to provide a baseline strawman against
which to compare the other more exotic variants.```
Thank you.
```The proposed mapping is given below.

*  For machines with SLEN=VLEN, the microarchitectural modification to
support fractional LMUL is very minor.  The main changes are to add
the additional bit in vtype to support the additional LMUL values, and
to have setvl calculations take the fractional LMUL into account when
calculating VLMAX and setting vl.  The only effect is to execute
instructions with shorter vl than, but otherwise identically to,
existing LMULs.```
In particular , otherwise identical to LMUL=1, yes?

SLEN=VLEN is a special case that also simplifies LMUL>1.

- the same input/output applies to each register in the register group.

- vl less than MAXVL allows the last registers in the register group to be avoided, specifically one for each MAXVL /LMUL reduction.

- For mixed LMUL operation this also reduces register pressure on lower LMUL levels.

- the commonality between LMUL levels allows LMUL<8 to use exactly the same process as LMUL=8.

setvl[i] will automatically restrict vl, only the register constraint need be checked on execution.

Thus fractional is identical to LMUL>=1, setvl[i] does the restriction .

All this to say what we already know, SLEN<VLEN is much more involved.

```*  For machines with SLEN<VLEN, the simple "reduce VL" doesn't quite
work. Instead each SLEN-wide partition has to reduce VL locally. This
is shown in the figures below.  Even this is not too large a change as
datapath wiring stays the same and it's mainly an issue of turning off
unused portions of the datapath, though in new patterns.```
Because the
```I'm not in favor of shifting the used portion to the top of the
register to enable scalar values or short vectors to use the space
below, as this would change the way fractional LMUL vector
and I suppose to a comparable extent write values
` and complicate chaining`
not clear on how a separate point , but I see how chaining is complicated related to read out.
`and interlock checks for simple baselines. `
agreed.
``` I believe there are cleaner
register-bit-scavenging schemes possible when we have a larger number
of architectural register names available.```
In principle I understand 32-bit should not overly constrain 64-bit encoding/functionality,

Should we exclude valuable scheme in 32-bits, unless we know that specific schemes is overwhelmingly valuable in 64-bit and precluded by that 32-bit scheme?

And I agree overall that the shift to top of the register is problematic, including for widening with matching lanes.

```The unused portions would be affected by tail undisturbed/agnostic

LMUL[2:0] encoding

111 LMUL=8
110 LMUL=4
101 LMUL=2
100 LMUL=1
011 LMUL=1/2
010 LMUL=1/4
001 LMUL=1/8
000 (reserved)

We limit mandatory supported SEW at different LMUL to following
values:

LMUL = 1/2, SEW <= ELEN/2
LMUL = 1/4, SEW <= ELEN/4
LMUL = 1/8, SEW <= ELEN/8

i.e., SEW <= LMUL*ELEN, for LMUL<=1 and ELEN @ LMUL=1
(some systems can have different ELEN for LMUL>1)

Example layout, drawn with two ASCII characters per byte
horizontally. This is drawn to show SLEN<VLEN (but just considering
the right 128b shows how SLEN=VLEN would look).

VLEN=256b, SLEN=128b

SEW/LMUL=4

I am totally puzzled by this mask layout.

Should mask for LMUL>=1 not be identical throughout?

Specifically, look like the mask layout under the SEW/LMUL=8 section?

```     1F1E1D1C1B1A19181716151413121110|-F-E-D-C-B-A-9-8-7-6-5-4-3-2-1-0     Mask

```

I understand for LMUL<1 the rationale to align with the mask to the elements,

however, they could equally be spread across the SLEN bits.

Krste, can you explain the SEW/LMUL=64 example more fully?

Is there intended to be a different layout per fractional level?

Or is it driven by the SEW/LMUL ratio?

```     2F2E2D2C2B2A29282726252423222120|0F0E0D0C0B0A09080706050403020100     SEW=8b, LMUL=2
3F3E3D3C3B3A39383736353433323130|1F1E1D1C1B1A19181716151413121110

--27--26--25--24--23--22--21--20|--07--06--05--04--03--02--01--00     SEW=16b, LMUL=4
--2F--2E--2D--2C--2B--2A--29--28|--0F--0E--0D--0C--0B--0A--09--08
--37--36--35--34--33--32--33--30|--17--16--15--14--13--12--11--10
--3F--3E--3D--3C--3B--3A--39--38|--1F--1E--1D--1C--1B--1A--19--18

------23------22------21------20|------03------02------01------00     SEW=32b, LMUL=8
------27------26------25------24|------07------06------05------04
------2B------2A------29------28|------0B------0A------09------08
------2F------2E------2D------2C|------0F------0E------0D------0C
....

SEW/LMUL=8

1F1E1D1C1B1A19181716151413121110|-F-E-D-C-B-A-9-8-7-6-5-4-3-2-1-0     SEW=8b, LMUL=1

--17--16--15--14--13--12--11--10|---7---6---5---4---3---2---1---0     SEW=16b, LMUL=2
--1F--1E--1D--1C--1B--1A--19--18|---F---E---D---C---B---A---9---8

------13------12------11------10|-------3-------2-------1-------0     SEW=32b, LMUL=4
------17------16------15------14|-------7-------6-------5-------4
...

--------------11--------------10|---------------1---------------0     SEW=64b, LMUL=8
...

SEW/LMUL=16

xxxxxxxxxxxxxxxx-F-E-D-C-B-A-9-8|xxxxxxxxxxxxxxxx-7-6-5-4-3-2-1-0     SEW=8b, LMUL=1/2

---F---E---D---C---B---A---9---8|---7---6---5---4---3---2---1---0     SEW=16b, LMUL=1

-------B-------A-------9-------8|-------3-------2-------1-------0     SEW=32b, LMUL=2
-------F-------E-------D-------C|-------7-------6-------5-------4

---------------9---------------8|---------------1---------------0     SEW=64b, LMUL=4
---------------B---------------A|---------------3---------------2
...

SEW/LMUL=32

xxxxxxxxxxxxxxxxxxxxxxxx-7-6-5-4|xxxxxxxxxxxxxxxxxxxxxxxx-3-2-1-0     SEW=8b, LMUL=1/4

xxxxxxxxxxxxxxxxx--7---6---5---4|xxxxxxxxxxxxxxxx---3---2---1---0     SEW=16b, LMUL=1/2

-------7-------6-------5-------4|-------3-------2-------1-------0     SEW=32b, LMUL=1

---------------5---------------4|---------------1---------------0     SEW=64b, LMUL=2
---------------7---------------6|---------------3---------------2

SEW/LMUL=64

xxxxxxxxxxxxxxxxxxxxxxxxxxxx-3-2|xxxxxxxxxxxxxxxxxxxxxxxxxxxx-1-0     SEW=8b, LMUL=1/8

xxxxxxxxxxxxxxxxxxxxxxxx---3---2|xxxxxxxxxxxxxxxxxxxxxxxx---1---0     SEW=16b, LMUL=1/4

xxxxxxxxxxxxxxxx-------3-------2|xxxxxxxxxxxxxxxx-------1-------0     SEW=32b, LMUL=1/2

---------------3---------------2|---------------1---------------0     SEW=64b, LMUL=1

```

```Krste

```

 1 - 5 of 5