#### Re: On Vector Register Layout

David Horner

On 2020-06-24 10:18 p.m., Bill Huffman wrote:
Hi David,

If I try to compare this to the current proposal, it seems to me there
are two major differences.

** A layout difference in the wide registers
The wide register being the destination, correct?
where elements alternate between two registers
Not sure I follow this, which two registers? If physical registers, then no.
That design was the v0.8 method of vertical cycling through physical registers.
instead of going through first one and then the other.

** Two instructions to accomplish what one accomplishes today.
One of v0.9 instructions could do the work of two under "Even In Odd In Out Wide","eioiow".
But would require masks to do what eioiow can do in one unmasked.

A classic, Your Mileage May Vary.

The strong advantage is that eioiow does not cross lanes, has a simple implementation independent structure and is applicable in common use cases.
I've used even/odd arrangements a lot over the years and would certainly

5B seems to require twice as many register specifiers.
Not sure what you mean by register specifiers, but as the examples below show, there is no need for even/odd sets of opcodes.
Slideup1/down1 fusing and reversing vs1/vs2 designations replaces the need for them.

Example:

CompMult:   ; a0 length, a1 complex double float result addr, a2 and a3 complex single float inputs

... ; standard preamble and loop set up

add t0,a0,a0   ; - t0 twice complex arg length

loop:

vsetvli t1,t0,e4,m8,eioiow ; Even In Odd In Out Wide

;  note vsetvli with eioiow will ensure vl is even.

vlde v0,a2  ; load complex single as two consecutive floats
vlde v8,a3

vslideup1 v16,v0 ; make a2 real odd
;   op vd, vs2, vs1   ... widening uses vs2 even and vs1 odd
vfwmul.vv v16,v8,v16 ; note result can overwrite source
; above can fuse

vslidedown1 v24,v0 ; make a2 imaginary even, use imaginary result as temporary
vfwnmsac v16,v24,v8 ;
; chained ops can notice v24 does not need to be writtten due to vd of next op

vfwmul.vv v24,v0,v8 ; v0.real*v8.img
vfwmac.vv v24,v8,v0 ;v8.real*v0.img
; above can be chained, recognize the pattern and forward the other read elements.

;   optimizations may interleave even/odd read ports to match most frequent widths.

vsetlvi t0,a0,e8,m8 ; halve vl

vssseg2e8 v16,a1

; .... housekeeping and loop

Bill

On 6/24/20 7:05 PM, David Horner wrote:
EXTERNAL MAIL

On 2020-06-12 7:05 a.m., Krste Asanovic wrote:
The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items.  Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):
I will make a stab at even and odd layout for widening.

5) two versions of the widening ops are defined one for even and one odd.
The registers are divided into even:odd pairs.
Two versions of the widening ops are defined one for even and one odd.
The full widened result is the result of the operation performed on the
even (or odd) halves of the pairs.
The sides of this approach are:
a) the need for two instructions.
b) only 1/2 of the input register bandwidth is used.
The widening operation is in lane.

Note: this approach is similar to the v0.8 LMUL=1 widening if SLEN were
SEW wide.
Logically, V0.8 does both an even (to dest) and an odd (to dest+1)
set of instructions.

5B) a variation of this is possible for RVV. An even/odd widening op mode.
vs1 provides the odd elements and vs2 provides the even elements
and vd has a double width result.
This approach has a number of advantages.
a) When vs1 = vs2 then a single input vector provides both
arguments: single read port, reduced energy cost.
b) note that vd can also be either vs1 or vs2.
c) as a result vd can be used as a temp for a slideup1/down1 either
input to emulate even or odd pair ops.
(this could be fused or to allow even/odd
d) as with base even:odd operations are in lane, and with the v0.9
model up to register sets of up to 8 physical can participate.
e) with v0.9 the ordinal masking interoperates unchanged.

Note: under v0.9 existing instructions provide supporting operations.
e.g. for SEW>8 load with a 1/2 unit stride can simulate interleaved load.

I wanted to provide this option before the meeting because it clearly
demonstrates another plausible approach to HPC independent of an SLEN
parameter.

The presumption of SLEN, even when subsumed in the VLEN=SLEN, is not
necessary for a base model.

Assuming a SLEN<=VLEN model when stipulating VLEN=SLEN is like mandating
a rational ( a / b) number set and then stipulating the denominator (b )
is 1.
Better to mandate integer, a conceptually simpler number set, and
introduce rational (or reals) if and when  needed.

Join tech-vector-ext@lists.riscv.org to automatically receive all group messages.