#### On Vector Register Layout

David Horner

On 2020-06-24 10:18 p.m., Bill Huffman wrote:
Hi David,

If I try to compare this to the current proposal, it seems to me there
are two major differences.

** A layout difference in the wide registers
The wide register being the destination, correct?
where elements alternate between two registers
Not sure I follow this, which two registers? If physical registers, then no.
That design was the v0.8 method of vertical cycling through physical registers.
instead of going through first one and then the other.

** Two instructions to accomplish what one accomplishes today.
One of v0.9 instructions could do the work of two under "Even In Odd In Out Wide","eioiow".
But would require masks to do what eioiow can do in one unmasked.

A classic, Your Mileage May Vary.

The strong advantage is that eioiow does not cross lanes, has a simple implementation independent structure and is applicable in common use cases.
I've used even/odd arrangements a lot over the years and would certainly

5B seems to require twice as many register specifiers.
Not sure what you mean by register specifiers, but as the examples below show, there is no need for even/odd sets of opcodes.
Slideup1/down1 fusing and reversing vs1/vs2 designations replaces the need for them.

Example:

CompMult:   ; a0 length, a1 complex double float result addr, a2 and a3 complex single float inputs

... ; standard preamble and loop set up

add t0,a0,a0   ; - t0 twice complex arg length

loop:

vsetvli t1,t0,e4,m8,eioiow ; Even In Odd In Out Wide

;  note vsetvli with eioiow will ensure vl is even.

vlde v0,a2  ; load complex single as two consecutive floats
vlde v8,a3

vslideup1 v16,v0 ; make a2 real odd
;   op vd, vs2, vs1   ... widening uses vs2 even and vs1 odd
vfwmul.vv v16,v8,v16 ; note result can overwrite source
; above can fuse

vslidedown1 v24,v0 ; make a2 imaginary even, use imaginary result as temporary
vfwnmsac v16,v24,v8 ;
; chained ops can notice v24 does not need to be writtten due to vd of next op

vfwmul.vv v24,v0,v8 ; v0.real*v8.img
vfwmac.vv v24,v8,v0 ;v8.real*v0.img
; above can be chained, recognize the pattern and forward the other read elements.

;   optimizations may interleave even/odd read ports to match most frequent widths.

vsetlvi t0,a0,e8,m8 ; halve vl

vssseg2e8 v16,a1

; .... housekeeping and loop

Bill

On 6/24/20 7:05 PM, David Horner wrote:
EXTERNAL MAIL

On 2020-06-12 7:05 a.m., Krste Asanovic wrote:
The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items.  Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):
I will make a stab at even and odd layout for widening.

5) two versions of the widening ops are defined one for even and one odd.
The registers are divided into even:odd pairs.
Two versions of the widening ops are defined one for even and one odd.
The full widened result is the result of the operation performed on the
even (or odd) halves of the pairs.
The sides of this approach are:
a) the need for two instructions.
b) only 1/2 of the input register bandwidth is used.
The widening operation is in lane.

Note: this approach is similar to the v0.8 LMUL=1 widening if SLEN were
SEW wide.
Logically, V0.8 does both an even (to dest) and an odd (to dest+1)
set of instructions.

5B) a variation of this is possible for RVV. An even/odd widening op mode.
vs1 provides the odd elements and vs2 provides the even elements
and vd has a double width result.
This approach has a number of advantages.
a) When vs1 = vs2 then a single input vector provides both
arguments: single read port, reduced energy cost.
b) note that vd can also be either vs1 or vs2.
c) as a result vd can be used as a temp for a slideup1/down1 either
input to emulate even or odd pair ops.
(this could be fused or to allow even/odd
d) as with base even:odd operations are in lane, and with the v0.9
model up to register sets of up to 8 physical can participate.
e) with v0.9 the ordinal masking interoperates unchanged.

Note: under v0.9 existing instructions provide supporting operations.
e.g. for SEW>8 load with a 1/2 unit stride can simulate interleaved load.

I wanted to provide this option before the meeting because it clearly
demonstrates another plausible approach to HPC independent of an SLEN
parameter.

The presumption of SLEN, even when subsumed in the VLEN=SLEN, is not
necessary for a base model.

Assuming a SLEN<=VLEN model when stipulating VLEN=SLEN is like mandating
a rational ( a / b) number set and then stipulating the denominator (b )
is 1.
Better to mandate integer, a conceptually simpler number set, and
introduce rational (or reals) if and when  needed.

Bill Huffman

Hi David,

If I try to compare this to the current proposal, it seems to me there
are two major differences.

** A layout difference in the wide registers where elements alternate
between two registers instead of going through first one and then the other.

** Two instructions to accomplish what one accomplishes today.

I've used even/odd arrangements a lot over the years and would certainly

5B seems to require twice as many register specifiers.

Bill

On 6/24/20 7:05 PM, David Horner wrote:
EXTERNAL MAIL

On 2020-06-12 7:05 a.m., Krste Asanovic wrote:
The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items.  Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):
I will make a stab at even and odd layout for widening.

5) two versions of the widening ops are defined one for even and one odd.
The registers are divided into even:odd pairs.
Two versions of the widening ops are defined one for even and one odd.
The full widened result is the result of the operation performed on the
even (or odd) halves of the pairs.
The sides of this approach are:
a) the need for two instructions.
b) only 1/2 of the input register bandwidth is used.
The widening operation is in lane.

Note: this approach is similar to the v0.8 LMUL=1 widening if SLEN were
SEW wide.
Logically, V0.8 does both an even (to dest) and an odd (to dest+1)
set of instructions.

5B) a variation of this is possible for RVV. An even/odd widening op mode.
vs1 provides the odd elements and vs2 provides the even elements
and vd has a double width result.
This approach has a number of advantages.
a) When vs1 = vs2 then a single input vector provides both
arguments: single read port, reduced energy cost.
b) note that vd can also be either vs1 or vs2.
c) as a result vd can be used as a temp for a slideup1/down1 either
input to emulate even or odd pair ops.
(this could be fused or to allow even/odd
d) as with base even:odd operations are in lane, and with the v0.9
model up to register sets of up to 8 physical can participate.
e) with v0.9 the ordinal masking interoperates unchanged.

Note: under v0.9 existing instructions provide supporting operations.
e.g. for SEW>8 load with a 1/2 unit stride can simulate interleaved load.

I wanted to provide this option before the meeting because it clearly
demonstrates another plausible approach to HPC independent of an SLEN
parameter.

The presumption of SLEN, even when subsumed in the VLEN=SLEN, is not
necessary for a base model.

Assuming a SLEN<=VLEN model when stipulating VLEN=SLEN is like mandating
a rational ( a / b) number set and then stipulating the denominator (b )
is 1.
Better to mandate integer, a conceptually simpler number set, and
introduce rational (or reals) if and when  needed.

David Horner

On 2020-06-12 7:05 a.m., Krste Asanovic wrote:
The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items. Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):
I will make a stab at even and odd layout for widening.

5) two versions of the widening ops are defined one for even and one odd.
The registers are divided into even:odd pairs.
Two versions of the widening ops are defined one for even and one odd.
The full widened result is the result of the operation performed on the even (or odd) halves of the pairs.
The sides of this approach are:
a) the need for two instructions.
b) only 1/2 of the input register bandwidth is used.
The widening operation is in lane.

Note: this approach is similar to the v0.8 LMUL=1 widening if SLEN were SEW wide.
Logically, V0.8 does both an even (to dest) and an odd (to dest+1) set of instructions.

5B) a variation of this is possible for RVV. An even/odd widening op mode.
vs1 provides the odd elements and vs2 provides the even elements and vd has a double width result.
This approach has a number of advantages.
a) When vs1 = vs2 then a single input vector provides both arguments: single read port, reduced energy cost.
b) note that vd can also be either vs1 or vs2.
c) as a result vd can be used as a temp for a slideup1/down1 either input to emulate even or odd pair ops.
(this could be fused or to allow even/odd
d) as with base even:odd operations are in lane, and with the v0.9 model up to register sets of up to 8 physical can participate.
e) with v0.9 the ordinal masking interoperates unchanged.

Note: under v0.9 existing instructions provide supporting operations. e.g. for SEW>8 load with a 1/2 unit stride can simulate interleaved load.

I wanted to provide this option before the meeting because it clearly demonstrates another plausible approach to HPC independent of an SLEN parameter.

The presumption of SLEN, even when subsumed in the VLEN=SLEN, is not necessary for a base model.

Assuming a SLEN<=VLEN model when stipulating VLEN=SLEN is like mandating a rational ( a / b) number set and then stipulating the denominator (b ) is 1.
Better to mandate integer, a conceptually simpler number set, and introduce rational (or reals) if and when  needed.

Krste Asanovic

Nick,

The issue is that in a wide SIMD datapath, the microarchitecture is
going to want bits to be spread across physical datapath bits
differently depending on SEW, though software's view of where the bits
are doesn't change.

Krste

On Mon, 15 Jun 2020 18:49:53 -0700, Nick Knight <nick.knight@...> said:
| Hi Bill,
| My understanding was that the whole register loads and stores work by reinterpreting the (VLEN) bits in a V-register as if SEW were
| 8; in particular, any bit-permutation induced by vs1r.v will be inverted by the matching vl1r.v, making them effectively agnostic to
| element width. When refilling a V-register, it's up to software to (re)interpret the bits correctly by recording the appropriate
| CSRs.

| However, I've fallen behind in tracking the latest developments, so I could just be plain wrong..

| Best,
| Nick Knight

| On Mon, Jun 15, 2020 at 2:06 PM Bill Huffman <huffman@...> wrote:

| I've not seen very many responses here.  I'll try to describe more
| precisely what's concerning me.

| In a wide, in-order SIMD core, I can expect two VLEN sized memory
| accesses per cycle.  So a spill/fill pair costs a cycle in a memory
| limited loop and less in an arithmetically limited loop.

| I assume that whole register stores and loads (section 7.9 of the spec)
| will be used for spills and fills when register allocation is
| oversubscribed.  With this proposal, the spill will know the current
| element width from the micro-architectural tag and so will adjust
| without extra cost.   But the fill will not know what element width will
| be needed.  When the fill element width is (often) not the right one,
| there will be several cycles lost to do the (unpipelined) fix-up.

| So, a spill-fill pair will increase in cost from a partial cycle to
| several cycles.  When the core executes several vector instructions per
| cycle, this is a worrisome cost.

| I'm wondering whether the compiler can do anything to alleviate this.
| Can the compiler know, even most of the time, what EEW was used for a
| particular register when it spills and fills the register so that most
| spill/fill pairs can avoid this overhead?

| Whole register loads in general will cause this problem, but I think
| fills after spill are the only one where performance matters.

|       Bill

| On 6/12/20 4:58 PM, Bill Huffman wrote:
|| Hi Krste,
||
|| I've been thinking about what happens with this proposal and whole
|| register loads and stores - under the assumption of adding uops
|| (out-of-order or in-order) when a micro-architectural tag doesn't match
|| the intended usage.  I think the stores are probably fine as they know
|| the EEW that the register arrangement belongs to and can store correctly
|| with the same byte movement as a store of that size would.
||
|| But whole register loads don't know where to put the bytes and so will
|| have to choose and count on being fixed later, which brings up two points:
||
|| First, it makes spills/fills more expensive because after a fill,
|| there's an extra small number of cycles to fix the expected EEW.
||
|| Second, it means it probably matters to re-write the register when the
|| current arrangement and the expected one don't match.  Otherwise after,
|| say, returning from an interrupt, a register that's used a large number
|| of times will cause a large number of lost cycles.
||
|| I think I heard the comment this morning that the register probably
|| shouldn't be re-written, just rearranged on the fly for the current use.
||     Did I hear that?  What is the argument for that?
||
||          Bill
||
|| On 6/12/20 4:05 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| TL;DR: I'm leaning towards mandating SLEN=VLEN layout, at least for
||| application processor profiles.
|||
|||
||| Regarding register layout, I thought it would be good to lay out the
||| landscape and comparison with other SIMD ISAs before diving into a
||| proposal for RVV.
|||
|||
||| I think it's useful to distinguish "bitsliced" operations from
||| "bitcrossing" operations.
|||
||| It's also useful to define a separate term for physical datapath width
||| "DPW".  In sensible designs, VLEN is an integer power-of-2 multiple of
||| DPW.  If
|||
||| Bitsliced operations on elements of size EEW operate entirely within
||| an EEW region of DPW.
|||
||| Bitcrossing operations traverse more than (source/dest) EEW bits of
||| DPW.
|||
||| In all sane general-purpose SIMD designs, memory operations can move
||| vectors that are naturally aligned to element boundaries, not only to
||| VLEN boundaries, so all memory operations are bitcrossing operations
||| assuming DPW > smallest EEW and require at least a memory rotate if
||| not a full crossbar between memory ports and register file ports.
||| (Some specialized SIMD designs might retain a VLEN-alignment
||| constraint, but they're not of interest here).
|||
||| There are specialized register permute instructions that are
||| bitcrossing instructions, such as our slide, vrgather, and compress
||| instructions (reductions also).  All SIMD ISAs add some variants of
||| these.
|||
||| Many simple vector arithmetic operations are bitsliced.
|||
||| The interesting cases are mixed-width operations, which are prevalent
||| in low-precision multiply-accumulate kernels that dominate many
||| existing and emerging compute areas, but there are plenty of other
||| kernels that operate on mixed-width data items.  Classic SIMD ISAs
||| handle mixed-width operations in one of five ways (would be glad to
||| add other known options to this list):
|||
||| 0) Single-width elements.  All operations operate on same width, with
|||       variety of load/stores widening/narrowing into/out of this element
|||       size.  For lower precision arithmetic, this wastes a lot of
|||       register file capacity, regfile port bandwidth, and ALU throughput.
|||
||| 1) Specialized registers. Some SIMD machines have dedicated wider
|||      accumulator registers, so datapath remains effectively bitsliced.
|||      Many machines have dedicated predicate registers (treating
|||      predicates as example of mixed-width operation).
|||
||| 2) Pack/unpack.  Arithmetic instructions are all bitsliced, but with
|||      separate bitcrossing data movement operations to pack/unpack
|||      elements, e.g., register-register unpack will sign/zero-extend
|||      top/bottom of a vector to yield destination vector of 2*source-EEW
|||      elements.  pack will do reverse from parts of two vectors.  Unpacked
|||      memory loads/stores similarly sign/zero-extend or truncate on way
|||      in/out of memory.  The pack/unpack instructions tend to imply
|||      crossing the entire DPW, and also complicates software which has to
|||      unroll loop into hi/lo portions and issue separate intsructions for
|||      each half.
|||
||| 3) Register pairs, where wider operand is created by pairing two
|||      existing registers within EEW so avoiding bitcrossing.  But this
|||      splits a single element's storage across two architectural
|||      registers, which doesn't support load/store to in-memory application
|||      formats without pack/unpack bitcrossing operations or additional
|||      load/store instructions (also effectively pack/unpack instructions).
|||
||| 4) EDIV-style, where mixed operations are handled by dividing an
|||      element width into subelements and accumulating multiple subelements
|||      into parent element size (e.g., 4 8b*8b multiplies accumulated into
|||      32b accumulator).  There provide mixed-width operations while
|||      avoiding bitcrossing.  However, they impose restrictions on
|||      application input and/or output data layouts to achieve high
|||      efficiency.
|||
||| With RVV we are trying to support mixed-width operations without
||| adding specialized registers, or splitting an element across
||| architectural registers, or requiring implicit or explicit bitcrossing
||| beyond min(DPW,SLEN) on ALU operands (bitcrossing for memory
||| load/stores cannot be avoided).  We're also trying to support vector
||| units with current implementation targets ranging from VLEN=32 to
||| VLEN=16384.
|||
||| ----------------------------------------------------------------------
|||
||| The SLEN parameter allows implementations to optimize their wire
||| length.  For narrower DPW (<=128b) regardless of VLEN, the SLEN=VLEN
||| layout where in-register format matches in-memory works OK as
||| bitcrossing cannot be outside DPW.  For wider DPW (>128b), SLEN<VLEN
||| layouts minimize bitcrossing.  When codes want to pun bytes between
||| different element widths, the SLEN<ELEN requires cast operations that
||| will shuffle bytes around (become simple register moves on SLEN=VLEN
||| machines).
|||
||| I see two separate problems with the SLEN parameter.
|||
||| 1) SLEN=VLEN layout can be profitably exploited in some code,
||| encouraging programmers to ignore compatibility and drop cast
||| instructions.
|||
||| 2) Correct code varying in SLEN cannot be migrated between machines
||| with same VLEN.  This I view as a quite serious issue, not just for
||| migration but also verification and any case where we're working
||| across two implementations.
|||
||| We've worked through many alternatives, but at this point, I'm back to
||| proposing that SLEN=VLEN is an extension, and that this extension is
||| required for application processors (i.e., this is in "V").  The mode
||| bit idea (software indicates if SLEN=VLEN layout is needed) doesn't
||| solve thread migration, but we could add that to SLEN<VLEN profile
||| somehow instead of casting, in a way such that SLEN=VLEN could ignore
||| it and didn't have to implement the null cast instructions.
|||
||| For systems with DPW <=128b, this is simple to implement in all kinds
||| of system.
|||
||| For wider datapaths DPW>=256b, the SLEN<VLEN layout would be
||| preferable.
|||
||| For more complex microarchitectures that want wider DPW, SLEN=VLEN can
||| be software view, with internal layout hidden by the
||| microarchitectural tricks we previously discussed.  Any access using
||| the wrong EEW can shuffle the bytes microarchitecturally.  This does
||| add complexity, but assuming that casting is relatively rare, these
||| machines would benefit from fact the ISA not does not require bit
||| crossing for general mixed-width code (not just for EDIV).  There is
||| clearly some complexity here, but I think the shuffling is thankfully
||| contained with a DPW-bit element group.
|||
||| ----------------------------------------------------------------------
|||
||| A different direction would be to say that SLEN=VLEN layout is
||| mandatory but bitcrossing instructions are an extension.  But we'd
||| still need to define something for mixed-width arithmetic (EDIV is
||| probably least objectionable choice out of above list of 0-4 options).
|||
||| ----------------------------------------------------------------------
|||
||| Finally, I think for the crypto extensions there is actually no need
||| to limit ELEN.  We can instead just limit bitcrossing arithmetic
||| instructions to ELEN<=128.  ELEN >128 need only be supported by a few
||| operations such as crypto.  We can use wider ELEN with EDIV, where
||| EDIV cuts size of sub-element to supported arithmetic EEW (e.g.,
||| ELEN=256 with EDIV=8 for 8*32b floats).
|||
|||
||| Krste
|||
|||
|

David Horner

These decisions are not made independently.

E.g. Removing expanding loads led to fractional register mode.

I believe there are other considerations that affect a definitive decision.

1) The even/odd approach (which I expect Krste will have available soon) also would benefit from a specified "interleave"  register structure.
Specifically, for widening operations designating even-even, odd-odd and even-odd variants allows full register utilization.
These variants need not be specified in the opcode, just as multiplier/fractional-multiplier is a vtype parameter.

2) As a more general case of fractional-mul, fill factor in conjunction with the original integral lmul
allows multiple physical registers to participate (rather than restricting to a single physical register).

3) Element Interleave is another major structure that is only partially addressed by the segmented memory operation.
This functionality dovetails with both above points.  1 above.

Each of these three approaches can be added on top of a base model that assumes only
a) integral lmul and
b) in-memory order in-register data (i.e. non-segmented register mapping, in v0.9 it is called VLEN=SLEN )

As Krste has outlined here, there are multiple legitimate approaches to widening ops.
I agree that some are not reasonable candidates to propose as base, nor even to ensure convenient future inclusion.

However, as I suggested previously, ensuring support for more than one is important to meet RISCV's goals of a base for extensions, and RVV's goal of supporting a wide variety of physical hardware and micro-architurectures from IOT to HCP.

I have been reviewing past versions of RVV as I can find them.
The github riscv-v-spec goes back to Jul 27, 2018.
It predates register groups (lmul)  which had a profound effect in perception of structure.
Introduced originally with vertical alignment of source/destination of widening/narrowing ops,
it also enhances minimal systems effectively increasing VLEN.

Further, it continues to provide an alignment benefit even under v0.9 which abandons the strict vertical structure of v0.8.

Similarly, lmul and fractional-lmul appear to have been missing when discussions on even/odd approach occurred.
And perhaps value points have shifted since then.

I still have not located any extensive discussions.
e.g.  rejecting register-pairs or pack/unpack which are also closely related to register structure
and the decision to remove sign/unsign-extending loads.

If anyone can direct me to these specific discussions around widening/narrowing approaches I would be quite grateful.

On 2020-06-12 7:05 a.m., Krste Asanovic wrote:

```TL;DR: I'm leaning towards mandating SLEN=VLEN layout, at least for
application processor profiles.

Regarding register layout, I thought it would be good to lay out the
landscape and comparison with other SIMD ISAs before diving into a
proposal for RVV.

I think it's useful to distinguish "bitsliced" operations from
"bitcrossing" operations.

It's also useful to define a separate term for physical datapath width
"DPW".  In sensible designs, VLEN is an integer power-of-2 multiple of
DPW.  If

Bitsliced operations on elements of size EEW operate entirely within
an EEW region of DPW.

Bitcrossing operations traverse more than (source/dest) EEW bits of
DPW.

In all sane general-purpose SIMD designs, memory operations can move
vectors that are naturally aligned to element boundaries, not only to
VLEN boundaries, so all memory operations are bitcrossing operations
assuming DPW > smallest EEW and require at least a memory rotate if
not a full crossbar between memory ports and register file ports.
(Some specialized SIMD designs might retain a VLEN-alignment
constraint, but they're not of interest here).

There are specialized register permute instructions that are
bitcrossing instructions, such as our slide, vrgather, and compress
instructions (reductions also).  All SIMD ISAs add some variants of
these.

Many simple vector arithmetic operations are bitsliced.

The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items.  Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):

0) Single-width elements.
1) Specialized registers.
2) Pack/unpack.

3) Register pairs,

4) EDIV-style,```

```With RVV we are trying to support mixed-width operations without
adding specialized registers, or splitting an element across
architectural registers, or requiring implicit or explicit bitcrossing
beyond min(DPW,SLEN) on ALU operands (bitcrossing for memory
load/stores cannot be avoided).  We're also trying to support vector
units with current implementation targets ranging from VLEN=32 to
VLEN=16384.

```

Nick Knight

Hi Bill,

My understanding was that the whole register loads and stores work by reinterpreting the (VLEN) bits in a V-register as if SEW were 8; in particular, any bit-permutation induced by vs1r.v will be inverted by the matching vl1r.v, making them effectively agnostic to element width. When refilling a V-register, it's up to software to (re)interpret the bits correctly by recording the appropriate CSRs.

However, I've fallen behind in tracking the latest developments, so I could just be plain wrong..

Best,
Nick Knight

On Mon, Jun 15, 2020 at 2:06 PM Bill Huffman <huffman@...> wrote:
I've not seen very many responses here.  I'll try to describe more
precisely what's concerning me.

In a wide, in-order SIMD core, I can expect two VLEN sized memory
accesses per cycle.  So a spill/fill pair costs a cycle in a memory
limited loop and less in an arithmetically limited loop.

I assume that whole register stores and loads (section 7.9 of the spec)
will be used for spills and fills when register allocation is
oversubscribed.  With this proposal, the spill will know the current
element width from the micro-architectural tag and so will adjust
without extra cost.   But the fill will not know what element width will
be needed.  When the fill element width is (often) not the right one,
there will be several cycles lost to do the (unpipelined) fix-up.

So, a spill-fill pair will increase in cost from a partial cycle to
several cycles.  When the core executes several vector instructions per
cycle, this is a worrisome cost.

I'm wondering whether the compiler can do anything to alleviate this.
Can the compiler know, even most of the time, what EEW was used for a
particular register when it spills and fills the register so that most
spill/fill pairs can avoid this overhead?

Whole register loads in general will cause this problem, but I think
fills after spill are the only one where performance matters.

Bill

On 6/12/20 4:58 PM, Bill Huffman wrote:
> Hi Krste,
>
> I've been thinking about what happens with this proposal and whole
> register loads and stores - under the assumption of adding uops
> (out-of-order or in-order) when a micro-architectural tag doesn't match
> the intended usage.  I think the stores are probably fine as they know
> the EEW that the register arrangement belongs to and can store correctly
> with the same byte movement as a store of that size would.
>
> But whole register loads don't know where to put the bytes and so will
> have to choose and count on being fixed later, which brings up two points:
>
> First, it makes spills/fills more expensive because after a fill,
> there's an extra small number of cycles to fix the expected EEW.
>
> Second, it means it probably matters to re-write the register when the
> current arrangement and the expected one don't match.  Otherwise after,
> say, returning from an interrupt, a register that's used a large number
> of times will cause a large number of lost cycles.
>
> I think I heard the comment this morning that the register probably
> shouldn't be re-written, just rearranged on the fly for the current use.
>    Did I hear that?  What is the argument for that?
>
>         Bill
>
> On 6/12/20 4:05 AM, Krste Asanovic wrote:
>> EXTERNAL MAIL
>>
>>
>>
>> TL;DR: I'm leaning towards mandating SLEN=VLEN layout, at least for
>> application processor profiles.
>>
>>
>> Regarding register layout, I thought it would be good to lay out the
>> landscape and comparison with other SIMD ISAs before diving into a
>> proposal for RVV.
>>
>>
>> I think it's useful to distinguish "bitsliced" operations from
>> "bitcrossing" operations.
>>
>> It's also useful to define a separate term for physical datapath width
>> "DPW".  In sensible designs, VLEN is an integer power-of-2 multiple of
>> DPW.  If
>>
>> Bitsliced operations on elements of size EEW operate entirely within
>> an EEW region of DPW.
>>
>> Bitcrossing operations traverse more than (source/dest) EEW bits of
>> DPW.
>>
>> In all sane general-purpose SIMD designs, memory operations can move
>> vectors that are naturally aligned to element boundaries, not only to
>> VLEN boundaries, so all memory operations are bitcrossing operations
>> assuming DPW > smallest EEW and require at least a memory rotate if
>> not a full crossbar between memory ports and register file ports.
>> (Some specialized SIMD designs might retain a VLEN-alignment
>> constraint, but they're not of interest here).
>>
>> There are specialized register permute instructions that are
>> bitcrossing instructions, such as our slide, vrgather, and compress
>> instructions (reductions also).  All SIMD ISAs add some variants of
>> these.
>>
>> Many simple vector arithmetic operations are bitsliced.
>>
>> The interesting cases are mixed-width operations, which are prevalent
>> in low-precision multiply-accumulate kernels that dominate many
>> existing and emerging compute areas, but there are plenty of other
>> kernels that operate on mixed-width data items.  Classic SIMD ISAs
>> handle mixed-width operations in one of five ways (would be glad to
>> add other known options to this list):
>>
>> 0) Single-width elements.  All operations operate on same width, with
>>      variety of load/stores widening/narrowing into/out of this element
>>      size.  For lower precision arithmetic, this wastes a lot of
>>      register file capacity, regfile port bandwidth, and ALU throughput.
>>
>> 1) Specialized registers. Some SIMD machines have dedicated wider
>>     accumulator registers, so datapath remains effectively bitsliced.
>>     Many machines have dedicated predicate registers (treating
>>     predicates as example of mixed-width operation).
>>
>> 2) Pack/unpack.  Arithmetic instructions are all bitsliced, but with
>>     separate bitcrossing data movement operations to pack/unpack
>>     elements, e.g., register-register unpack will sign/zero-extend
>>     top/bottom of a vector to yield destination vector of 2*source-EEW
>>     elements.  pack will do reverse from parts of two vectors.  Unpacked
>>     memory loads/stores similarly sign/zero-extend or truncate on way
>>     in/out of memory.  The pack/unpack instructions tend to imply
>>     crossing the entire DPW, and also complicates software which has to
>>     unroll loop into hi/lo portions and issue separate intsructions for
>>     each half.
>>
>> 3) Register pairs, where wider operand is created by pairing two
>>     existing registers within EEW so avoiding bitcrossing.  But this
>>     splits a single element's storage across two architectural
>>     registers, which doesn't support load/store to in-memory application
>>     formats without pack/unpack bitcrossing operations or additional
>>     load/store instructions (also effectively pack/unpack instructions).
>>
>> 4) EDIV-style, where mixed operations are handled by dividing an
>>     element width into subelements and accumulating multiple subelements
>>     into parent element size (e.g., 4 8b*8b multiplies accumulated into
>>     32b accumulator).  There provide mixed-width operations while
>>     avoiding bitcrossing.  However, they impose restrictions on
>>     application input and/or output data layouts to achieve high
>>     efficiency.
>>
>> With RVV we are trying to support mixed-width operations without
>> adding specialized registers, or splitting an element across
>> architectural registers, or requiring implicit or explicit bitcrossing
>> beyond min(DPW,SLEN) on ALU operands (bitcrossing for memory
>> load/stores cannot be avoided).  We're also trying to support vector
>> units with current implementation targets ranging from VLEN=32 to
>> VLEN=16384.
>>
>> ----------------------------------------------------------------------
>>
>> The SLEN parameter allows implementations to optimize their wire
>> length.  For narrower DPW (<=128b) regardless of VLEN, the SLEN=VLEN
>> layout where in-register format matches in-memory works OK as
>> bitcrossing cannot be outside DPW.  For wider DPW (>128b), SLEN<VLEN
>> layouts minimize bitcrossing.  When codes want to pun bytes between
>> different element widths, the SLEN<ELEN requires cast operations that
>> will shuffle bytes around (become simple register moves on SLEN=VLEN
>> machines).
>>
>> I see two separate problems with the SLEN parameter.
>>
>> 1) SLEN=VLEN layout can be profitably exploited in some code,
>> encouraging programmers to ignore compatibility and drop cast
>> instructions.
>>
>> 2) Correct code varying in SLEN cannot be migrated between machines
>> with same VLEN.  This I view as a quite serious issue, not just for
>> migration but also verification and any case where we're working
>> across two implementations.
>>
>> We've worked through many alternatives, but at this point, I'm back to
>> proposing that SLEN=VLEN is an extension, and that this extension is
>> required for application processors (i.e., this is in "V").  The mode
>> bit idea (software indicates if SLEN=VLEN layout is needed) doesn't
>> solve thread migration, but we could add that to SLEN<VLEN profile
>> somehow instead of casting, in a way such that SLEN=VLEN could ignore
>> it and didn't have to implement the null cast instructions.
>>
>> For systems with DPW <=128b, this is simple to implement in all kinds
>> of system.
>>
>> For wider datapaths DPW>=256b, the SLEN<VLEN layout would be
>> preferable.
>>
>> For more complex microarchitectures that want wider DPW, SLEN=VLEN can
>> be software view, with internal layout hidden by the
>> microarchitectural tricks we previously discussed.  Any access using
>> the wrong EEW can shuffle the bytes microarchitecturally.  This does
>> add complexity, but assuming that casting is relatively rare, these
>> machines would benefit from fact the ISA not does not require bit
>> crossing for general mixed-width code (not just for EDIV).  There is
>> clearly some complexity here, but I think the shuffling is thankfully
>> contained with a DPW-bit element group.
>>
>> ----------------------------------------------------------------------
>>
>> A different direction would be to say that SLEN=VLEN layout is
>> mandatory but bitcrossing instructions are an extension.  But we'd
>> still need to define something for mixed-width arithmetic (EDIV is
>> probably least objectionable choice out of above list of 0-4 options).
>>
>> ----------------------------------------------------------------------
>>
>> Finally, I think for the crypto extensions there is actually no need
>> to limit ELEN.  We can instead just limit bitcrossing arithmetic
>> instructions to ELEN<=128.  ELEN >128 need only be supported by a few
>> operations such as crypto.  We can use wider ELEN with EDIV, where
>> EDIV cuts size of sub-element to supported arithmetic EEW (e.g.,
>> ELEN=256 with EDIV=8 for 8*32b floats).
>>
>>
>> Krste
>>
>>

Bill Huffman

I've not seen very many responses here. I'll try to describe more
precisely what's concerning me.

In a wide, in-order SIMD core, I can expect two VLEN sized memory
accesses per cycle. So a spill/fill pair costs a cycle in a memory
limited loop and less in an arithmetically limited loop.

I assume that whole register stores and loads (section 7.9 of the spec)
will be used for spills and fills when register allocation is
oversubscribed. With this proposal, the spill will know the current
element width from the micro-architectural tag and so will adjust
without extra cost. But the fill will not know what element width will
be needed. When the fill element width is (often) not the right one,
there will be several cycles lost to do the (unpipelined) fix-up.

So, a spill-fill pair will increase in cost from a partial cycle to
several cycles. When the core executes several vector instructions per
cycle, this is a worrisome cost.

I'm wondering whether the compiler can do anything to alleviate this.
Can the compiler know, even most of the time, what EEW was used for a
particular register when it spills and fills the register so that most
spill/fill pairs can avoid this overhead?

Whole register loads in general will cause this problem, but I think
fills after spill are the only one where performance matters.

Bill

On 6/12/20 4:58 PM, Bill Huffman wrote:
Hi Krste,

I've been thinking about what happens with this proposal and whole
(out-of-order or in-order) when a micro-architectural tag doesn't match
the intended usage. I think the stores are probably fine as they know
the EEW that the register arrangement belongs to and can store correctly
with the same byte movement as a store of that size would.

But whole register loads don't know where to put the bytes and so will
have to choose and count on being fixed later, which brings up two points:

First, it makes spills/fills more expensive because after a fill,
there's an extra small number of cycles to fix the expected EEW.

Second, it means it probably matters to re-write the register when the
current arrangement and the expected one don't match. Otherwise after,
say, returning from an interrupt, a register that's used a large number
of times will cause a large number of lost cycles.

I think I heard the comment this morning that the register probably
shouldn't be re-written, just rearranged on the fly for the current use.
Did I hear that? What is the argument for that?

Bill

On 6/12/20 4:05 AM, Krste Asanovic wrote:
EXTERNAL MAIL

TL;DR: I'm leaning towards mandating SLEN=VLEN layout, at least for
application processor profiles.

Regarding register layout, I thought it would be good to lay out the
landscape and comparison with other SIMD ISAs before diving into a
proposal for RVV.

I think it's useful to distinguish "bitsliced" operations from
"bitcrossing" operations.

It's also useful to define a separate term for physical datapath width
"DPW". In sensible designs, VLEN is an integer power-of-2 multiple of
DPW. If

Bitsliced operations on elements of size EEW operate entirely within
an EEW region of DPW.

Bitcrossing operations traverse more than (source/dest) EEW bits of
DPW.

In all sane general-purpose SIMD designs, memory operations can move
vectors that are naturally aligned to element boundaries, not only to
VLEN boundaries, so all memory operations are bitcrossing operations
assuming DPW > smallest EEW and require at least a memory rotate if
not a full crossbar between memory ports and register file ports.
(Some specialized SIMD designs might retain a VLEN-alignment
constraint, but they're not of interest here).

There are specialized register permute instructions that are
bitcrossing instructions, such as our slide, vrgather, and compress
instructions (reductions also). All SIMD ISAs add some variants of
these.

Many simple vector arithmetic operations are bitsliced.

The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items. Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):

0) Single-width elements. All operations operate on same width, with
variety of load/stores widening/narrowing into/out of this element
size. For lower precision arithmetic, this wastes a lot of
register file capacity, regfile port bandwidth, and ALU throughput.

1) Specialized registers. Some SIMD machines have dedicated wider
accumulator registers, so datapath remains effectively bitsliced.
Many machines have dedicated predicate registers (treating
predicates as example of mixed-width operation).

2) Pack/unpack. Arithmetic instructions are all bitsliced, but with
separate bitcrossing data movement operations to pack/unpack
elements, e.g., register-register unpack will sign/zero-extend
top/bottom of a vector to yield destination vector of 2*source-EEW
elements. pack will do reverse from parts of two vectors. Unpacked
memory loads/stores similarly sign/zero-extend or truncate on way
in/out of memory. The pack/unpack instructions tend to imply
crossing the entire DPW, and also complicates software which has to
unroll loop into hi/lo portions and issue separate intsructions for
each half.

3) Register pairs, where wider operand is created by pairing two
existing registers within EEW so avoiding bitcrossing. But this
splits a single element's storage across two architectural
registers, which doesn't support load/store to in-memory application
formats without pack/unpack bitcrossing operations or additional
load/store instructions (also effectively pack/unpack instructions).

4) EDIV-style, where mixed operations are handled by dividing an
element width into subelements and accumulating multiple subelements
into parent element size (e.g., 4 8b*8b multiplies accumulated into
32b accumulator). There provide mixed-width operations while
avoiding bitcrossing. However, they impose restrictions on
application input and/or output data layouts to achieve high
efficiency.

With RVV we are trying to support mixed-width operations without
adding specialized registers, or splitting an element across
architectural registers, or requiring implicit or explicit bitcrossing
beyond min(DPW,SLEN) on ALU operands (bitcrossing for memory
load/stores cannot be avoided). We're also trying to support vector
units with current implementation targets ranging from VLEN=32 to
VLEN=16384.

----------------------------------------------------------------------

The SLEN parameter allows implementations to optimize their wire
length. For narrower DPW (<=128b) regardless of VLEN, the SLEN=VLEN
layout where in-register format matches in-memory works OK as
bitcrossing cannot be outside DPW. For wider DPW (>128b), SLEN<VLEN
layouts minimize bitcrossing. When codes want to pun bytes between
different element widths, the SLEN<ELEN requires cast operations that
will shuffle bytes around (become simple register moves on SLEN=VLEN
machines).

I see two separate problems with the SLEN parameter.

1) SLEN=VLEN layout can be profitably exploited in some code,
encouraging programmers to ignore compatibility and drop cast
instructions.

2) Correct code varying in SLEN cannot be migrated between machines
with same VLEN. This I view as a quite serious issue, not just for
migration but also verification and any case where we're working
across two implementations.

We've worked through many alternatives, but at this point, I'm back to
proposing that SLEN=VLEN is an extension, and that this extension is
required for application processors (i.e., this is in "V"). The mode
bit idea (software indicates if SLEN=VLEN layout is needed) doesn't
somehow instead of casting, in a way such that SLEN=VLEN could ignore
it and didn't have to implement the null cast instructions.

For systems with DPW <=128b, this is simple to implement in all kinds
of system.

For wider datapaths DPW>=256b, the SLEN<VLEN layout would be
preferable.

For more complex microarchitectures that want wider DPW, SLEN=VLEN can
be software view, with internal layout hidden by the
microarchitectural tricks we previously discussed. Any access using
the wrong EEW can shuffle the bytes microarchitecturally. This does
add complexity, but assuming that casting is relatively rare, these
machines would benefit from fact the ISA not does not require bit
crossing for general mixed-width code (not just for EDIV). There is
clearly some complexity here, but I think the shuffling is thankfully
contained with a DPW-bit element group.

----------------------------------------------------------------------

A different direction would be to say that SLEN=VLEN layout is
mandatory but bitcrossing instructions are an extension. But we'd
still need to define something for mixed-width arithmetic (EDIV is
probably least objectionable choice out of above list of 0-4 options).

----------------------------------------------------------------------

Finally, I think for the crypto extensions there is actually no need
to limit ELEN. We can instead just limit bitcrossing arithmetic
instructions to ELEN<=128. ELEN >128 need only be supported by a few
operations such as crypto. We can use wider ELEN with EDIV, where
EDIV cuts size of sub-element to supported arithmetic EEW (e.g.,
ELEN=256 with EDIV=8 for 8*32b floats).

Krste

Bill Huffman

Hi Krste,

I've been thinking about what happens with this proposal and whole
(out-of-order or in-order) when a micro-architectural tag doesn't match
the intended usage. I think the stores are probably fine as they know
the EEW that the register arrangement belongs to and can store correctly
with the same byte movement as a store of that size would.

But whole register loads don't know where to put the bytes and so will
have to choose and count on being fixed later, which brings up two points:

First, it makes spills/fills more expensive because after a fill,
there's an extra small number of cycles to fix the expected EEW.

Second, it means it probably matters to re-write the register when the
current arrangement and the expected one don't match. Otherwise after,
say, returning from an interrupt, a register that's used a large number
of times will cause a large number of lost cycles.

I think I heard the comment this morning that the register probably
shouldn't be re-written, just rearranged on the fly for the current use.
Did I hear that? What is the argument for that?

Bill

On 6/12/20 4:05 AM, Krste Asanovic wrote:
EXTERNAL MAIL

TL;DR: I'm leaning towards mandating SLEN=VLEN layout, at least for
application processor profiles.

Regarding register layout, I thought it would be good to lay out the
landscape and comparison with other SIMD ISAs before diving into a
proposal for RVV.

I think it's useful to distinguish "bitsliced" operations from
"bitcrossing" operations.

It's also useful to define a separate term for physical datapath width
"DPW". In sensible designs, VLEN is an integer power-of-2 multiple of
DPW. If

Bitsliced operations on elements of size EEW operate entirely within
an EEW region of DPW.

Bitcrossing operations traverse more than (source/dest) EEW bits of
DPW.

In all sane general-purpose SIMD designs, memory operations can move
vectors that are naturally aligned to element boundaries, not only to
VLEN boundaries, so all memory operations are bitcrossing operations
assuming DPW > smallest EEW and require at least a memory rotate if
not a full crossbar between memory ports and register file ports.
(Some specialized SIMD designs might retain a VLEN-alignment
constraint, but they're not of interest here).

There are specialized register permute instructions that are
bitcrossing instructions, such as our slide, vrgather, and compress
instructions (reductions also). All SIMD ISAs add some variants of
these.

Many simple vector arithmetic operations are bitsliced.

The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items. Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):

0) Single-width elements. All operations operate on same width, with
variety of load/stores widening/narrowing into/out of this element
size. For lower precision arithmetic, this wastes a lot of
register file capacity, regfile port bandwidth, and ALU throughput.

1) Specialized registers. Some SIMD machines have dedicated wider
accumulator registers, so datapath remains effectively bitsliced.
Many machines have dedicated predicate registers (treating
predicates as example of mixed-width operation).

2) Pack/unpack. Arithmetic instructions are all bitsliced, but with
separate bitcrossing data movement operations to pack/unpack
elements, e.g., register-register unpack will sign/zero-extend
top/bottom of a vector to yield destination vector of 2*source-EEW
elements. pack will do reverse from parts of two vectors. Unpacked
memory loads/stores similarly sign/zero-extend or truncate on way
in/out of memory. The pack/unpack instructions tend to imply
crossing the entire DPW, and also complicates software which has to
unroll loop into hi/lo portions and issue separate intsructions for
each half.

3) Register pairs, where wider operand is created by pairing two
existing registers within EEW so avoiding bitcrossing. But this
splits a single element's storage across two architectural
registers, which doesn't support load/store to in-memory application
formats without pack/unpack bitcrossing operations or additional
load/store instructions (also effectively pack/unpack instructions).

4) EDIV-style, where mixed operations are handled by dividing an
element width into subelements and accumulating multiple subelements
into parent element size (e.g., 4 8b*8b multiplies accumulated into
32b accumulator). There provide mixed-width operations while
avoiding bitcrossing. However, they impose restrictions on
application input and/or output data layouts to achieve high
efficiency.

With RVV we are trying to support mixed-width operations without
adding specialized registers, or splitting an element across
architectural registers, or requiring implicit or explicit bitcrossing
beyond min(DPW,SLEN) on ALU operands (bitcrossing for memory
load/stores cannot be avoided). We're also trying to support vector
units with current implementation targets ranging from VLEN=32 to
VLEN=16384.

----------------------------------------------------------------------

The SLEN parameter allows implementations to optimize their wire
length. For narrower DPW (<=128b) regardless of VLEN, the SLEN=VLEN
layout where in-register format matches in-memory works OK as
bitcrossing cannot be outside DPW. For wider DPW (>128b), SLEN<VLEN
layouts minimize bitcrossing. When codes want to pun bytes between
different element widths, the SLEN<ELEN requires cast operations that
will shuffle bytes around (become simple register moves on SLEN=VLEN
machines).

I see two separate problems with the SLEN parameter.

1) SLEN=VLEN layout can be profitably exploited in some code,
encouraging programmers to ignore compatibility and drop cast
instructions.

2) Correct code varying in SLEN cannot be migrated between machines
with same VLEN. This I view as a quite serious issue, not just for
migration but also verification and any case where we're working
across two implementations.

We've worked through many alternatives, but at this point, I'm back to
proposing that SLEN=VLEN is an extension, and that this extension is
required for application processors (i.e., this is in "V"). The mode
bit idea (software indicates if SLEN=VLEN layout is needed) doesn't
somehow instead of casting, in a way such that SLEN=VLEN could ignore
it and didn't have to implement the null cast instructions.

For systems with DPW <=128b, this is simple to implement in all kinds
of system.

For wider datapaths DPW>=256b, the SLEN<VLEN layout would be
preferable.

For more complex microarchitectures that want wider DPW, SLEN=VLEN can
be software view, with internal layout hidden by the
microarchitectural tricks we previously discussed. Any access using
the wrong EEW can shuffle the bytes microarchitecturally. This does
add complexity, but assuming that casting is relatively rare, these
machines would benefit from fact the ISA not does not require bit
crossing for general mixed-width code (not just for EDIV). There is
clearly some complexity here, but I think the shuffling is thankfully
contained with a DPW-bit element group.

----------------------------------------------------------------------

A different direction would be to say that SLEN=VLEN layout is
mandatory but bitcrossing instructions are an extension. But we'd
still need to define something for mixed-width arithmetic (EDIV is
probably least objectionable choice out of above list of 0-4 options).

----------------------------------------------------------------------

Finally, I think for the crypto extensions there is actually no need
to limit ELEN. We can instead just limit bitcrossing arithmetic
instructions to ELEN<=128. ELEN >128 need only be supported by a few
operations such as crypto. We can use wider ELEN with EDIV, where
EDIV cuts size of sub-element to supported arithmetic EEW (e.g.,
ELEN=256 with EDIV=8 for 8*32b floats).

Krste

David Horner

On Fri, Jun 12, 2020, 07:05 Krste Asanovic, <krste@...> wrote:

TL;DR: I'm leaning towards mandating SLEN=VLEN layout, at least for
application processor profiles.
I agree that this should be default or base configuration.
.

The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items.  Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):

I believe the extended fractional layout is a 6th approach. Issue 465. Does it qualify as a known option?

As with option 2 load and stores format the data for widening operations. However the data remains in single width and the widening instructions.

This approach ensures software is aware as explicit layout formatting is required in advance of the widening instructions.

Krste Asanovic

TL;DR: I'm leaning towards mandating SLEN=VLEN layout, at least for
application processor profiles.

Regarding register layout, I thought it would be good to lay out the
landscape and comparison with other SIMD ISAs before diving into a
proposal for RVV.

I think it's useful to distinguish "bitsliced" operations from
"bitcrossing" operations.

It's also useful to define a separate term for physical datapath width
"DPW". In sensible designs, VLEN is an integer power-of-2 multiple of
DPW. If

Bitsliced operations on elements of size EEW operate entirely within
an EEW region of DPW.

Bitcrossing operations traverse more than (source/dest) EEW bits of
DPW.

In all sane general-purpose SIMD designs, memory operations can move
vectors that are naturally aligned to element boundaries, not only to
VLEN boundaries, so all memory operations are bitcrossing operations
assuming DPW > smallest EEW and require at least a memory rotate if
not a full crossbar between memory ports and register file ports.
(Some specialized SIMD designs might retain a VLEN-alignment
constraint, but they're not of interest here).

There are specialized register permute instructions that are
bitcrossing instructions, such as our slide, vrgather, and compress
instructions (reductions also). All SIMD ISAs add some variants of
these.

Many simple vector arithmetic operations are bitsliced.

The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items. Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):

0) Single-width elements. All operations operate on same width, with
variety of load/stores widening/narrowing into/out of this element
size. For lower precision arithmetic, this wastes a lot of
register file capacity, regfile port bandwidth, and ALU throughput.

1) Specialized registers. Some SIMD machines have dedicated wider
accumulator registers, so datapath remains effectively bitsliced.
Many machines have dedicated predicate registers (treating
predicates as example of mixed-width operation).

2) Pack/unpack. Arithmetic instructions are all bitsliced, but with
separate bitcrossing data movement operations to pack/unpack
elements, e.g., register-register unpack will sign/zero-extend
top/bottom of a vector to yield destination vector of 2*source-EEW
elements. pack will do reverse from parts of two vectors. Unpacked
memory loads/stores similarly sign/zero-extend or truncate on way
in/out of memory. The pack/unpack instructions tend to imply
crossing the entire DPW, and also complicates software which has to
unroll loop into hi/lo portions and issue separate intsructions for
each half.

3) Register pairs, where wider operand is created by pairing two
existing registers within EEW so avoiding bitcrossing. But this
splits a single element's storage across two architectural
registers, which doesn't support load/store to in-memory application
formats without pack/unpack bitcrossing operations or additional
load/store instructions (also effectively pack/unpack instructions).

4) EDIV-style, where mixed operations are handled by dividing an
element width into subelements and accumulating multiple subelements
into parent element size (e.g., 4 8b*8b multiplies accumulated into
32b accumulator). There provide mixed-width operations while
avoiding bitcrossing. However, they impose restrictions on
application input and/or output data layouts to achieve high
efficiency.

With RVV we are trying to support mixed-width operations without
adding specialized registers, or splitting an element across
architectural registers, or requiring implicit or explicit bitcrossing
beyond min(DPW,SLEN) on ALU operands (bitcrossing for memory
load/stores cannot be avoided). We're also trying to support vector
units with current implementation targets ranging from VLEN=32 to
VLEN=16384.

----------------------------------------------------------------------

The SLEN parameter allows implementations to optimize their wire
length. For narrower DPW (<=128b) regardless of VLEN, the SLEN=VLEN
layout where in-register format matches in-memory works OK as
bitcrossing cannot be outside DPW. For wider DPW (>128b), SLEN<VLEN
layouts minimize bitcrossing. When codes want to pun bytes between
different element widths, the SLEN<ELEN requires cast operations that
will shuffle bytes around (become simple register moves on SLEN=VLEN
machines).

I see two separate problems with the SLEN parameter.

1) SLEN=VLEN layout can be profitably exploited in some code,
encouraging programmers to ignore compatibility and drop cast
instructions.

2) Correct code varying in SLEN cannot be migrated between machines
with same VLEN. This I view as a quite serious issue, not just for
migration but also verification and any case where we're working
across two implementations.

We've worked through many alternatives, but at this point, I'm back to
proposing that SLEN=VLEN is an extension, and that this extension is
required for application processors (i.e., this is in "V"). The mode
bit idea (software indicates if SLEN=VLEN layout is needed) doesn't
somehow instead of casting, in a way such that SLEN=VLEN could ignore
it and didn't have to implement the null cast instructions.

For systems with DPW <=128b, this is simple to implement in all kinds
of system.

For wider datapaths DPW>=256b, the SLEN<VLEN layout would be
preferable.

For more complex microarchitectures that want wider DPW, SLEN=VLEN can
be software view, with internal layout hidden by the
microarchitectural tricks we previously discussed. Any access using
the wrong EEW can shuffle the bytes microarchitecturally. This does
add complexity, but assuming that casting is relatively rare, these
machines would benefit from fact the ISA not does not require bit
crossing for general mixed-width code (not just for EDIV). There is
clearly some complexity here, but I think the shuffling is thankfully
contained with a DPW-bit element group.

----------------------------------------------------------------------

A different direction would be to say that SLEN=VLEN layout is
mandatory but bitcrossing instructions are an extension. But we'd
still need to define something for mixed-width arithmetic (EDIV is
probably least objectionable choice out of above list of 0-4 options).

----------------------------------------------------------------------

Finally, I think for the crypto extensions there is actually no need
to limit ELEN. We can instead just limit bitcrossing arithmetic
instructions to ELEN<=128. ELEN >128 need only be supported by a few
operations such as crypto. We can use wider ELEN with EDIV, where
EDIV cuts size of sub-element to supported arithmetic EEW (e.g.,
ELEN=256 with EDIV=8 for 8*32b floats).

Krste

 1 - 10 of 10