#### Re: On Vector Register Layout

David Horner

On 2020-06-24 10:18 p.m., Bill Huffman wrote:

That design was the v0.8 method of vertical cycling through physical registers.

But would require masks to do what eioiow can do in one unmasked.

A classic, Your Mileage May Vary.

The strong advantage is that eioiow does not cross lanes, has a simple implementation independent structure and is applicable in common use cases.

Slideup1/down1 fusing and reversing vs1/vs2 designations replaces the need for them.

Example:

CompMult: ; a0 length, a1 complex double float result addr, a2 and a3 complex single float inputs

... ; standard preamble and loop set up

add t0,a0,a0 ; - t0 twice complex arg length

loop:

vsetvli t1,t0,e4,m8,eioiow ; Even In Odd In Out Wide

; note vsetvli with eioiow will ensure vl is even.

vlde v0,a2 ; load complex single as two consecutive floats

vlde v8,a3

vslideup1 v16,v0 ; make a2 real odd

; op vd, vs2, vs1 ... widening uses vs2 even and vs1 odd

vfwmul.vv v16,v8,v16 ; note result can overwrite source

; above can fuse

vslidedown1 v24,v0 ; make a2 imaginary even, use imaginary result as temporary

vfwnmsac v16,v24,v8 ;

; chained ops can notice v24 does not need to be writtten due to vd of next op

vfwmul.vv v24,v0,v8 ; v0.real*v8.img

vfwmac.vv v24,v8,v0 ;v8.real*v0.img

; above can be chained, recognize the pattern and forward the other read elements.

; optimizations may interleave even/odd read ports to match most frequent widths.

vsetlvi t0,a0,e8,m8 ; halve vl

vssseg2e8 v16,a1

; .... housekeeping and loop

Hi David,The wide register being the destination, correct?

If I try to compare this to the current proposal, it seems to me there

are two major differences.

** A layout difference in the wide registers

where elements alternate between two registersNot sure I follow this, which two registers? If physical registers, then no.

That design was the v0.8 method of vertical cycling through physical registers.

instead of going through first one and then the other.

One of v0.9 instructions could do the work of two under "Even In Odd In Out Wide","eioiow".

** Two instructions to accomplish what one accomplishes today.

But would require masks to do what eioiow can do in one unmasked.

A classic, Your Mileage May Vary.

The strong advantage is that eioiow does not cross lanes, has a simple implementation independent structure and is applicable in common use cases.

I've used even/odd arrangements a lot over the years and would certainlyNot sure what you mean by register specifiers, but as the examples below show, there is no need for even/odd sets of opcodes.

consider them for advantage. But I'm not seeing the advantage here.

5B seems to require twice as many register specifiers.

Slideup1/down1 fusing and reversing vs1/vs2 designations replaces the need for them.

Example:

CompMult: ; a0 length, a1 complex double float result addr, a2 and a3 complex single float inputs

... ; standard preamble and loop set up

add t0,a0,a0 ; - t0 twice complex arg length

loop:

vsetvli t1,t0,e4,m8,eioiow ; Even In Odd In Out Wide

; note vsetvli with eioiow will ensure vl is even.

vlde v0,a2 ; load complex single as two consecutive floats

vlde v8,a3

vslideup1 v16,v0 ; make a2 real odd

; op vd, vs2, vs1 ... widening uses vs2 even and vs1 odd

vfwmul.vv v16,v8,v16 ; note result can overwrite source

; above can fuse

vslidedown1 v24,v0 ; make a2 imaginary even, use imaginary result as temporary

vfwnmsac v16,v24,v8 ;

; chained ops can notice v24 does not need to be writtten due to vd of next op

vfwmul.vv v24,v0,v8 ; v0.real*v8.img

vfwmac.vv v24,v8,v0 ;v8.real*v0.img

; above can be chained, recognize the pattern and forward the other read elements.

; optimizations may interleave even/odd read ports to match most frequent widths.

vsetlvi t0,a0,e8,m8 ; halve vl

vssseg2e8 v16,a1

; .... housekeeping and loop

Bill

On 6/24/20 7:05 PM, David Horner wrote:EXTERNAL MAIL

On 2020-06-12 7:05 a.m., Krste Asanovic wrote:The interesting cases are mixed-width operations, which are prevalentI will make a stab at even and odd layout for widening.

in low-precision multiply-accumulate kernels that dominate many

existing and emerging compute areas, but there are plenty of other

kernels that operate on mixed-width data items. Classic SIMD ISAs

handle mixed-width operations in one of five ways (would be glad to

add other known options to this list):

5) two versions of the widening ops are defined one for even and one odd.

The registers are divided into even:odd pairs.

Two versions of the widening ops are defined one for even and one odd.

The full widened result is the result of the operation performed on the

even (or odd) halves of the pairs.

The sides of this approach are:

a) the need for two instructions.

b) only 1/2 of the input register bandwidth is used.

The widening operation is in lane.

Note: this approach is similar to the v0.8 LMUL=1 widening if SLEN were

SEW wide.

Logically, V0.8 does both an even (to dest) and an odd (to dest+1)

set of instructions.

5B) a variation of this is possible for RVV. An even/odd widening op mode.

vs1 provides the odd elements and vs2 provides the even elements

and vd has a double width result.

This approach has a number of advantages.

a) When vs1 = vs2 then a single input vector provides both

arguments: single read port, reduced energy cost.

b) note that vd can also be either vs1 or vs2.

c) as a result vd can be used as a temp for a slideup1/down1 either

input to emulate even or odd pair ops.

(this could be fused or to allow even/odd

d) as with base even:odd operations are in lane, and with the v0.9

model up to register sets of up to 8 physical can participate.

e) with v0.9 the ordinal masking interoperates unchanged.

Note: under v0.9 existing instructions provide supporting operations.

e.g. for SEW>8 load with a 1/2 unit stride can simulate interleaved load.

I wanted to provide this option before the meeting because it clearly

demonstrates another plausible approach to HPC independent of an SLEN

parameter.

The presumption of SLEN, even when subsumed in the VLEN=SLEN, is not

necessary for a base model.

Assuming a SLEN<=VLEN model when stipulating VLEN=SLEN is like mandating

a rational ( a / b) number set and then stipulating the denominator (b )

is 1.

Better to mandate integer, a conceptually simpler number set, and

introduce rational (or reals) if and when needed.