On Vector Register Layout


Krste Asanovic
 

TL;DR: I'm leaning towards mandating SLEN=VLEN layout, at least for
application processor profiles.


Regarding register layout, I thought it would be good to lay out the
landscape and comparison with other SIMD ISAs before diving into a
proposal for RVV.


I think it's useful to distinguish "bitsliced" operations from
"bitcrossing" operations.

It's also useful to define a separate term for physical datapath width
"DPW". In sensible designs, VLEN is an integer power-of-2 multiple of
DPW. If

Bitsliced operations on elements of size EEW operate entirely within
an EEW region of DPW.

Bitcrossing operations traverse more than (source/dest) EEW bits of
DPW.

In all sane general-purpose SIMD designs, memory operations can move
vectors that are naturally aligned to element boundaries, not only to
VLEN boundaries, so all memory operations are bitcrossing operations
assuming DPW > smallest EEW and require at least a memory rotate if
not a full crossbar between memory ports and register file ports.
(Some specialized SIMD designs might retain a VLEN-alignment
constraint, but they're not of interest here).

There are specialized register permute instructions that are
bitcrossing instructions, such as our slide, vrgather, and compress
instructions (reductions also). All SIMD ISAs add some variants of
these.

Many simple vector arithmetic operations are bitsliced.

The interesting cases are mixed-width operations, which are prevalent
in low-precision multiply-accumulate kernels that dominate many
existing and emerging compute areas, but there are plenty of other
kernels that operate on mixed-width data items. Classic SIMD ISAs
handle mixed-width operations in one of five ways (would be glad to
add other known options to this list):

0) Single-width elements. All operations operate on same width, with
variety of load/stores widening/narrowing into/out of this element
size. For lower precision arithmetic, this wastes a lot of
register file capacity, regfile port bandwidth, and ALU throughput.

1) Specialized registers. Some SIMD machines have dedicated wider
accumulator registers, so datapath remains effectively bitsliced.
Many machines have dedicated predicate registers (treating
predicates as example of mixed-width operation).

2) Pack/unpack. Arithmetic instructions are all bitsliced, but with
separate bitcrossing data movement operations to pack/unpack
elements, e.g., register-register unpack will sign/zero-extend
top/bottom of a vector to yield destination vector of 2*source-EEW
elements. pack will do reverse from parts of two vectors. Unpacked
memory loads/stores similarly sign/zero-extend or truncate on way
in/out of memory. The pack/unpack instructions tend to imply
crossing the entire DPW, and also complicates software which has to
unroll loop into hi/lo portions and issue separate intsructions for
each half.

3) Register pairs, where wider operand is created by pairing two
existing registers within EEW so avoiding bitcrossing. But this
splits a single element's storage across two architectural
registers, which doesn't support load/store to in-memory application
formats without pack/unpack bitcrossing operations or additional
load/store instructions (also effectively pack/unpack instructions).

4) EDIV-style, where mixed operations are handled by dividing an
element width into subelements and accumulating multiple subelements
into parent element size (e.g., 4 8b*8b multiplies accumulated into
32b accumulator). There provide mixed-width operations while
avoiding bitcrossing. However, they impose restrictions on
application input and/or output data layouts to achieve high
efficiency.

With RVV we are trying to support mixed-width operations without
adding specialized registers, or splitting an element across
architectural registers, or requiring implicit or explicit bitcrossing
beyond min(DPW,SLEN) on ALU operands (bitcrossing for memory
load/stores cannot be avoided). We're also trying to support vector
units with current implementation targets ranging from VLEN=32 to
VLEN=16384.

----------------------------------------------------------------------

The SLEN parameter allows implementations to optimize their wire
length. For narrower DPW (<=128b) regardless of VLEN, the SLEN=VLEN
layout where in-register format matches in-memory works OK as
bitcrossing cannot be outside DPW. For wider DPW (>128b), SLEN<VLEN
layouts minimize bitcrossing. When codes want to pun bytes between
different element widths, the SLEN<ELEN requires cast operations that
will shuffle bytes around (become simple register moves on SLEN=VLEN
machines).

I see two separate problems with the SLEN parameter.

1) SLEN=VLEN layout can be profitably exploited in some code,
encouraging programmers to ignore compatibility and drop cast
instructions.

2) Correct code varying in SLEN cannot be migrated between machines
with same VLEN. This I view as a quite serious issue, not just for
migration but also verification and any case where we're working
across two implementations.

We've worked through many alternatives, but at this point, I'm back to
proposing that SLEN=VLEN is an extension, and that this extension is
required for application processors (i.e., this is in "V"). The mode
bit idea (software indicates if SLEN=VLEN layout is needed) doesn't
solve thread migration, but we could add that to SLEN<VLEN profile
somehow instead of casting, in a way such that SLEN=VLEN could ignore
it and didn't have to implement the null cast instructions.

For systems with DPW <=128b, this is simple to implement in all kinds
of system.

For wider datapaths DPW>=256b, the SLEN<VLEN layout would be
preferable.

For more complex microarchitectures that want wider DPW, SLEN=VLEN can
be software view, with internal layout hidden by the
microarchitectural tricks we previously discussed. Any access using
the wrong EEW can shuffle the bytes microarchitecturally. This does
add complexity, but assuming that casting is relatively rare, these
machines would benefit from fact the ISA not does not require bit
crossing for general mixed-width code (not just for EDIV). There is
clearly some complexity here, but I think the shuffling is thankfully
contained with a DPW-bit element group.

----------------------------------------------------------------------

A different direction would be to say that SLEN=VLEN layout is
mandatory but bitcrossing instructions are an extension. But we'd
still need to define something for mixed-width arithmetic (EDIV is
probably least objectionable choice out of above list of 0-4 options).

----------------------------------------------------------------------

Finally, I think for the crypto extensions there is actually no need
to limit ELEN. We can instead just limit bitcrossing arithmetic
instructions to ELEN<=128. ELEN >128 need only be supported by a few
operations such as crypto. We can use wider ELEN with EDIV, where
EDIV cuts size of sub-element to supported arithmetic EEW (e.g.,
ELEN=256 with EDIV=8 for 8*32b floats).


Krste

Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.