Thoughts on Git update (8a9fbce) Added fractional LMUL, including modifying vector data register and vector mask register layouts for SLEN<VLEN implementations.


David Horner
 

First some observations from the revised LMUL.

*1 The format for a given SLEN and SEW is the same for all LMUL>=1
*2 LMUL=n is equivalent to LMUL=2 * n with vl < 1/2 vlmax at that level, for n=1,2,4.
*3 Doubling SEW halves the number of elements in the same number of register bits, and visa versa..

The first provide the benefits that quad or higher widening with ESEW <= SLEN stays in data lanes.
    (resolving an ugly characteristic of quad widening.)

The combined these leads to a realization that vl is the determinant of the register group size.
If vsetvli were separately provided the number of physical registers to calculate vl, LMUL>1 is eliminated.


The format for LMUL=1/2
     - does not align with LMUL=1 complicating mixed width instructions.
    - is wasteful of space
    - but it does reduce the active portion of registers, that could benefit renaming machines (if they rename at sufficient low granularity).

Noting that point *2 could be extended into LMUL=1/2 and in conjunction with point *3:

     Widening operations to LMUL=1 can equivalently be sourced from LMUL=1 where
         source is 1/2 SEW of widened result and
        vl is length of widened result.

Rephrased relative to source SEW:

     At LMUL=1, widening operations
            take source of SEW width elements and length vl,
             and create widened result as LMUL=1 with 2*SEW and length of vl.

I recommend this uniformity apply through "fractional modes" that allocate 1/2, 1/4, etc. of the physical registers bits.

A specific optimization, such as dynamic VLEN can address the renaming micro-architectures efficiency issue.

Instead I recommend "fractional modes" that fill 1/2 of each SLEN before moving on to next physical register, with one mode using the first half and the other mode the other half.
Similar to proposed in #412  Fractional vtype field vfill – Fractional Fill order and Fractional Instruction eLement Location.

As that proposal was designed to be added to the previous LMUL modes, I am working through details of such encoding now for a revised proposal.

However, in the interim I thought these considerations might be helpful as is.


Krste Asanovic
 

On Sat, 25 Apr 2020 18:23:07 -0400, "David Horner" <ds2horner@...> said:
| First some observations from the revised LMUL.
| *1 The format for a given SLEN and SEW is the same for all LMUL>=1
| *2 LMUL=n is equivalent to LMUL=2 * n with vl < 1/2 vlmax at that level,
| for n=1,2,4.
| *3 Doubling SEW halves the number of elements in the same number of
| register bits, and visa versa..

| The first provide the benefits that quad or higher widening with ESEW <=
| SLEN stays in data lanes.
|     (resolving an ugly characteristic of quad widening.)

| The combined these leads to a realization that vl is the determinant of
| the register group size.
| If vsetvli were separately provided the number of physical registers to
| calculate vl, LMUL>1 is eliminated.

Hmm, LMUL _is_ the number of architectural registers to allocate.

| The format for LMUL=1/2
|      - does not align with LMUL=1 complicating mixed width instructions.

I believe it does align.

I don't understand the issue you're trying to address below. I think
the new scheme already has the properties you're trying to attain.

Krste

|     - is wasteful of space
|     - but it does reduce the active portion of registers, that could
| benefit renaming machines (if they rename at sufficient low granularity).

| Noting that point *2 could be extended into LMUL=1/2 and in conjunction
| with point *3:

|      Widening operations to LMUL=1 can equivalently be sourced from
| LMUL=1 where
|          source is 1/2 SEW of widened result and
|         vl is length of widened result.

| Rephrased relative to source SEW:

|      At LMUL=1, widening operations
|             take source of SEW width elements and length vl,
|              and create widened result as LMUL=1 with 2*SEW and length
| of vl.

| I recommend this uniformity apply through "fractional modes" that
| allocate 1/2, 1/4, etc. of the physical registers bits.

| A specific optimization, such as dynamic VLEN can address the renaming
| micro-architectures efficiency issue.

| Instead I recommend "fractional modes" that fill 1/2 of each SLEN before
| moving on to next physical register, with one mode using the first half
| and the other mode the other half.
| Similar to proposed in #412  Fractional vtype field vfill – Fractional
| Fill order and Fractional Instruction eLement Location.

| As that proposal was designed to be added to the previous LMUL modes, I
| am working through details of such encoding now for a revised proposal.

| However, in the interim I thought these considerations might be helpful
| as is.

|


David Horner
 

On 2020-04-25 8:46 p.m., krste@... wrote:
On Sat, 25 Apr 2020 18:23:07 -0400, "David Horner" <ds2horner@...> said:
| First some observations from the revised LMUL.
| *1 The format for a given SLEN and SEW is the same for all LMUL>=1
| *2 LMUL=n is equivalent to LMUL=2 * n with vl < 1/2 vlmax at that level,
| for n=1,2,4.
| *3 Doubling SEW halves the number of elements in the same number of
| register bits, and visa versa..

| The first provide the benefits that quad or higher widening with ESEW <=
| SLEN stays in data lanes.
|     (resolving an ugly characteristic of quad widening.)

| The combined these leads to a realization that vl is the determinant of
| the register group size.
| If vsetvli were separately provided the number of physical registers to
| calculate vl, LMUL>1 is eliminated.

Hmm, LMUL _is_ the number of architectural registers to allocate.
Yes it is. However, it needn't be, and allowing it to not be gives greater flexibility at minimal cost.
This was a suggestion to implement #418. [Introduce vlmt (vl multiplicative threshold) / VLMT Vector LiMiT]

| The format for LMUL=1/2
|      - does not align with LMUL=1 complicating mixed width instructions.

I believe it does align.
You are correct.

I don't understand the issue you're trying to address below. I think
the new scheme already has the properties you're trying to attain.
The new scheme does do as I propose.
 I misread the application of the 1st LMUL*VLEN/SEW "elements" as "segments". (as described in 4.3, I worked back from 4.3 to 4.1b )
And I then failed to validate that misinterpretation with the diagrams (which I looked at but misread with the wrong mindset in place. ).
Sorry for the noise.

The wasteful of space still applies (As does the dynamic VLEN to have useful re-nameable tails)


Krste

|     - is wasteful of space
|     - but it does reduce the active portion of registers, that could
| benefit renaming machines (if they rename at sufficient low granularity).

| Noting that point *2 could be extended into LMUL=1/2 and in conjunction
| with point *3:

|      Widening operations to LMUL=1 can equivalently be sourced from
| LMUL=1 where
|          source is 1/2 SEW of widened result and
|         vl is length of widened result.

| Rephrased relative to source SEW:

|      At LMUL=1, widening operations
|             take source of SEW width elements and length vl,
|              and create widened result as LMUL=1 with 2*SEW and length
| of vl.
this is the multiplicative threshold vl proposal #418 that eliminates LMUL>1.
(I will reapply to this proposal and suggest an alternate encoding for vsetvli that uses none of the immediate bits).

| I recommend this uniformity apply through "fractional modes" that
| allocate 1/2, 1/4, etc. of the physical registers bits.
wonderful to be in agreement here.
| A specific optimization, such as dynamic VLEN can address the renaming
| micro-architectures efficiency issue.

| Instead I recommend "fractional modes" that fill 1/2 of each SLEN before
| moving on to next physical register, with one mode using the first half
| and the other mode the other half.
This is to not waste the space.
| Similar to proposed in #412  Fractional vtype field vfill – Fractional
| Fill order and Fractional Instruction eLement Location.

| As that proposal was designed to be added to the previous LMUL modes, I
| am working through details of such encoding now for a revised proposal.

| However, in the interim I thought these considerations might be helpful
| as is.
Greatly appreciate your quick response. Wonderful to not have to argue the uniformity position.


Krste Asanovic
 

On Sat, 25 Apr 2020 23:02:08 -0400, DSHORNER <ds2horner@...> said:
| On 2020-04-25 8:46 p.m., krste@... wrote:
|| On Sat, 25 Apr 2020 18:23:07 -0400, "David Horner" <ds2horner@...> said:
[...]
| The wasteful of space still applies (As does the dynamic VLEN to have
| useful re-nameable tails)

The space in a fractional LMUL register can be used in software.
E.g., by loading an SEW=8 vector with LMUL=1, then using LMUL=1/4 to
combine first quarter with SEW=32,LMUL=1 vectors in other registers.
Once the first quarter of the source SEW=8 vector register is
processed, a slidedown by vlenb/4 can be used to align the next
SEW=8,LMUL=1/4 vector of operands (though currently have to reset
SEW/LMUL around the slidedown to avoid zeros appearing).

Krste


David Horner
 

On 2020-04-26 3:05 a.m., krste@... wrote:
On Sat, 25 Apr 2020 23:02:08 -0400, DSHORNER <ds2horner@...> said:
| On 2020-04-25 8:46 p.m., krste@... wrote:
|| On Sat, 25 Apr 2020 18:23:07 -0400, "David Horner" <ds2horner@...> said:
[...]
| The wasteful of space still applies (As does the dynamic VLEN to have
| useful re-nameable tails)

The space in a fractional LMUL register can be used in software.
E.g., by loading an SEW=8 vector with LMUL=1, then using LMUL=1/4 to
combine first quarter with SEW=32,LMUL=1 vectors in other registers.
Once the first quarter of the source SEW=8 vector register is
processed, a slidedown by vlenb/4 can be used to align the next
SEW=8,LMUL=1/4 vector of operands (though currently have to reset
SEW/LMUL around the slidedown to avoid zeros appearing).
Yes. The register is not tainted by having been used as fractional.
But what I was meaning was that there is no active use of that space during fractional mode.

That temporary switch to slidedown can be assisted by the transient version of #423, additional instructions to set vtype fields.

Krste