Re: Vector Task Group minutes 2020/5/15

David Horner

On Tue, May 26, 2020, 04:38 , <krste@...> wrote:



I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting. 
I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words. 
I don't understand the reason for this constraint.
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).
Still not got it.

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b
clstr is not a  count but a size.
When CLSTR is 32 this last row is
                7       5       3       1       6       4       2       0 SEW=ELEN=32b
If I understood your diagram correctly.

See #461. It is effectively what SLEN was under v0.8. But potentially configurable.

I'm doing this for my cell phone. I'll work it up better when I'm at my laptop

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).


Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

The problem is determining when you would be needing to do these Micro Ops. That's what I had proposed as the flagging solution in response to Bill. Again when I'm on my laptop.


I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
It is an approach the incorporates the clustering consecutive byte concept but is more limiting. 


Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.


>>>>> On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be

| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.

| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))

|| Bill
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github:;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
||| Issues discussed:
||| # MLEN=1 change
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
||| #434 SLEN=VLEN as optional extension
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
||| # v0.9
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.


Join to automatically receive all group messages.