Re: Vector Task Group minutes 2020/5/15


David Horner
 

This is v0.8 with SLEN=8.



On Wed, May 27, 2020, 07:59 Grigorios Magklis, <grigorios.magklis@...> wrote:
Hi all,

I was wondering if the group has considered before (and rejected) the following
register layout proposal.

In this scheme, there is no SLEN parameter, instead the layout is solely defined
by SEW and LMUL. For a given LMUL and SEW, the i-th element in the register
group starts at the n-th byte of the j-th register (both values starting at 0), as follows:

  n = (i div LMUL)*SEW/8
  j = (i mod LMUL) when LMUL > 1, else j = 0

where 'div' is integer division, e.g., 7 div 4 = 1.

As shown in the examples below, with this scheme, when LMUL=1 the register
layout is the same as the memory layout (similar to SLEN=VLEN), when LMUL>1
elements are allocated "vertically" across the register group (similar to
SLEN=SEW), and when LMUL<1 elements are evenly spaced-out across the register
(similar to SLEN=SEW/LMUL):

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


VLEN=128b, SEW=8b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   3C 38 34 30 2C 28 24 20 1C 18 14 10  C  8  4  0
v[2*n+1] 3D 39 35 31 2D 29 25 21 1D 19 15 11  D  9  5  1
v[2*n+2] 3E 3A 36 32 2E 2A 26 22 1E 1A 16 12  E  A  6  2
v[2*n+3] 3F 3B 37 33 2F 2B 27 23 1F 1B 17 13  F  B  7  3

VLEN=128b, SEW=16b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]    1C  18  14  10   C   8   4   0
v[2*n+1]  1D  19  15  11   D   9   5   1
v[2*n+2]  1E  1A  16  12   E   A   6   2
v[2*n+3]  1F  1B  17  13   F   B   7   3

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       1       -       0


VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   -   -   1   -   -   -   0

VLEN=128b, SEW=32b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           -       -       -       0

The benefit of this scheme is that software always knows the layout of the elements
in the register just by programming SEW and LMUL, as this is now the same for
all implementations, and thus can optimize code accordingly if it so wishes.
Also, as long as we keep SEW/LMUL constant, mixed-width operations always stay
in-lane, i.e., inside their max(src_EEW, dst_EEW) <= ELEN container, as shown
below.

SEW/LMUL=32:

VLEN=128b, SEW=8b, LMUL=1/4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - - - 3 - - - 2 - - - 1 - - - 0

VLEN=128b, SEW=16b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       -   3   -   2   -   1   -   0

VLEN=128b, SEW=32b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]           3       2       1       0


SEW/LMUL=16:

VLEN=128b, SEW=8b, LMUL=1/2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 0

VLEN=128b, SEW=16b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]       7   6   5   4   3   2   1   0

VLEN=128b, SEW=32b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         6       4       2       0
v[2*n+1]       7       5       3       1


SEW/LMUL=8:

VLEN=128b, SEW=8b, LMUL=1
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[n]     F E D C B A 9 8 7 6 5 4 3 2 1 0

VLEN=128b, SEW=16b, LMUL=2
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]     E   C   A   8   6   4   2   0
v[2*n+1]   F   D   B   9   7   5   3   1

VLEN=128b, SEW=32b, LMUL=4
Byte     F E D C B A 9 8 7 6 5 4 3 2 1 0
v[2*n]         C       8       4       0
v[2*n+1]       D       9       5       1
v[2*n+2]       E       A       6       2
v[2*n+3]       F       B       7       3


SEW/LMUL=4:

VLEN=128b, SEW=8b, LMUL=2
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]   1E 1C 1A 18 16 14 12 10  E  C  A  8  6  4  2  0
v[2*n+1] 1F 1D 1B 19 17 15 13 11  F  D  B  9  7  5  3  1

VLEN=128b, SEW=16b, LMUL=4
Byte      F  E  D  C  B  A  9  8  7  6  5  4  3  2  1  0
v[2*n]      1C    18    14    10     C     8     4     0
v[2*n+1]    1D    19    15    11     D     9     5     1
v[2*n+2]    1E    1A    16    12     E     A     6     2
v[2*n+3]    1F    1B    17    13     F     B     7     3

Additionally, because the in-register layout is the same as the in-memory
layout when LMUL=1, there is no need to shuffle bytes when moving data in and
out of memory, which may allow the implementation to optimize this case (and
for software to eliminate any typecast instructions). When LMUL>1 or LMUL<1,
then loads and stores need to shuffle bytes around, but I think the cost of this is
similar to what v0.9 requires with SLEN<VLEN.

So, I think this scheme has the same benefits and similar implementation costs
to the v0.9 SLEN=VLEN scheme for machines with short vectors (except that
SLEN=VLEN does not need to space-out elements when load/storing with LMUL<1),
but also similar implementation costs to the v0.9 SLEN<VLEN scheme for machines
with long vectors (which is the whole point of avoiding SLEN=VLEN), but with
the benefit that the layout is architected and we do not introduce
fragmentation to the ecosystem.

Thanks,
Grigorios Magklis

On 27 May 2020, at 01:35, Bill Huffman <huffman@...> wrote:

I appreciate your agreement with my analysis, Krste.  However, I wasn't 
drawing a conclusion.  I lean toward the conclusion that we keep the 
"new v0.9 scheme" below and live with casts.  But I wasn't fully sure 
and wanted to see where the discussion might go.  I suspect each of the 
extra gates for memory access and the slower speed of short vectors is 
sufficient by itself to argue pretty strongly against the "v0.8 scheme" 
- which is the only one I can see that might have the desired properties.

I also agree that I don't think it's possible to find "a design where 
bytes are contiguous within ELEN."  I worked out what I think are the 
outlines of a proof that it's not possible, but I thought I'd suggest 
what I did at a high level first and only try to make the proof more 
rigorous if necessary.

      Bill

On 5/26/20 1:37 AM, Krste Asanovic wrote:
EXTERNAL MAIL



I think Bill is right in his analysis, but I disagree with his
conclusion.

The scheme Bill is describing is basically the v0.8 scheme:

  Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n           13      12      11      10       3       2       1       0
   v4*n+1         17      16      15      14       7       6       5       4
   v4*n+2         1B      1A      19      18       B       A       9       8
   v4*n+3         1F      1E      1D      1C       F       E       D       C

Some background.  We adopted this scheme in Hwacha which had decoupled
lanes, where each SLEN partition could run independently at different
decoupled rates.  That meant it was difficult to load a unit-stride
vector and share the memory access across lanes without lots of
complex buffering, so we packed contiguous bytes into each lane.

The "new" v0.9 scheme:

Eg: VLEN=256b, SLEN=128b, SEW=32b, LMUL=4

   Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
   v4*n            7       5       3       1       6       4       2       0
   v4*n+1          F       D       B       9       E       C       A       8
   v4*n+2         17      15      13      11      16      14      12      10
   v4*n+3         1F      1D      1B      19      1E      1C      1A      18

is actually reverting back to an older layout (Figure 7.2 in my
thesis).  This has property that it enables simpler synchronous lanes
to stay maximally busy with a given application vector length.

My concern is that Bill's point #2 doesn't just apply to memory, but
also to arithmetic units.  For example, in the above 0.8 example, if
AVL is less than 17, only half the datapath is used.  This is why I
don't agree with Bill's conclusion.  I think attaining high throughput
on shorter application vectors is going to be more important than
avoiding the cast operations, as I believe shorter vectors are going
to be much more common than casting.  The v0.9 layout is also simpler
for hardware design.

The cast operations can all be single-cycle occupancy and fully
pipelined/chained, as they all just rearrange bytes across one element
group.

----------------------------------------------------------------------

I think David is trying to find a design where bytes are contiguous
within ELEN (or some other unit < ELEN) but then striped above that to
avoid casting.  I don't think this can work.

First, SLEN has to be big enough to hold ELEN/8 * ELEN words.  E.g.,
when ELEN=32, you can pack four contiguous bytes in ELEN,but then
require SLEN to have space for four ELEN words to avoid either wires
crossing SLEN partitions, or requiring multiple cycles to compute
small vectors (v0.8 design).

VLEN=256b, ELEN=32b, SLEN=(ELEN**2)/8,
Byte     1F1E1D1C1B1A19181716151413121110 F E D C B A 9 8 7 6 5 4 3 2 1 0
                                  7 6 5 4                         3 2 1 0 SEW=8b
                7       6       5       4       3       2       1       0 SEW=ELEN=32b

When ELEN=64, you'd need SLEN=512, which is too big.

Also, now that I draw it out, I don't actually see how even this can
work, given that bytes are in memory order within a word, but now
words are still scrambled relative to memory order (e.g., doing a byte
load wouldn't let you cast to words in memory order).

A random thought while thinking through this is that there is little
incentive to make SLEN!=ELEN when SLEN<VLEN, which might cut down on
variants (although someone might want to support various ELEN options
with single lane design I guess).

----------------------------------------------------------------------

Roger asked about a microarchitectural solution to hide difference
between SLEN<VLEN and SLEN=VLEN machines.  I think this is viable in a
complex microarch.  Basically, the microarchitecture tags the vector
register with the EEW used to write it.  Reading the register with a
different EEW requires inserting microops to cast bytes on the fly -
this can be done cycle-by-cycle.  Undisturbed writes to the register
with a different EEW will also require an on-the-fly cast for the
undisturbed elements, with a read-modify-write if not already being
renamed. Some issues are that if you keep reading with the wrong EEW
you'll generate a lot of additional uops, and ideally would want to
eventually rewrite the register to that EEW and avoid the wire
crossings (sounds like an ISCA paper...)

----------------------------------------------------------------------

I think EDIV might actually suffice for some common use cases without
needing casting.  The widest unit can be loaded as an element with
contiguous memory bytes, which is then subdivided into smaller pieces
for processing.  This might be what David is referring to as
clustering.

----------------------------------------------------------------------

Compared to other SIMD architectures, the V extension gives up on
memory format in registers in exchange for avoiding cross-datapath
interactions for mixed-precision loops. The other architectures
require explicit widening of bottom half to full vector, which implies
cross-datapath communication and also more complex code to handle upper/lower
halves of vector separately, and even more complexity if there are more
than 2X widths in a loop.

Krste


On Wed, 20 May 2020 07:42:00 -0400, "David Horner" <ds2horner@...> said:

| that may be


| On 2020-05-19 7:14 p.m., Bill Huffman wrote:
|| I believe it is provably not possible for our vectors to have more than
|| two of the following properties:
| For me the definitions contained in 1,2 and 3 need to be more rigorously
| defined before I can agree that the constraints/behaviours  described
| are provably inconsistent on aggregate..
||
|| 1. The datapath can be sliced into multiple slices to improve wiring
|| such that corresponding elements of different sizes reside in the
|| same slice.
|| 2. Memory accesses containing enough contiguous bytes to fill the
|| datapath width corresponding to one vector can spread evenly
|| across the slices when loading or storing a vector group of two,
|| four, or eight vector registers.
| This one is particularly difficult for me to formalize.
| When vl = vlen * lmul, (for lmul 2,4 or 8)  then cache lines can be
| requested in an order such that when they arrive corresponding segments
| can be filled.
| So, I'm not sure if the focus here is an efficiency concern?
|| 3. The operation corresponding to storing a register group at one
|| element width and loading back the same number of bytes into a
|| register group of the same size but with a different element width
|| results in exactly the same register position for all bytes.
| What we can definitely prove is that a specific design has specific
| characteristics and eliminates other characteristics.
| I agree that the current design has the characteristics you describe.

| However, for #3, I appears to me that a facility that clusters elements
| of smaller than a certain size still allows behaviours 1 and 2.
| Further,for element lengths up to that cluster size in-register order
| matches the in-memory order.
||
|| The SLEN solution we've had for some time allows for #1 and #2.  We're
|| discussing requiring "cast" operations in place of having property #3.
||
|| I wonder whether we should look again at giving up property #2 instead.
| I also agree reconsidering #2
|| It would cost additional logic in wide, sliced datapaths to keep up
|| memory bandwidth.
| Here I believe is where you introduce efficacy in implementation.
| Once implementation design considerations are introduced the proof
| becomes much more complex;
| Compounded by objectives and technical tradeoffs and less a mathematics
| rigor issue .

|| But the damage might be less than requiring casts and
|| the potential of splitting the ecosystem?
| I also agree with you that reconsidering #2 can lead to conceptually
| simpler designs that perhaps will result in less eco fragmentation.
| However, anticipating a communities response to even the smallest of
| changes is crystal ball material.

| There are many variations and approaches still open to us to address
| in-register and in-memory order agreement, and to address widening
| approaches (in particular, interleaving or striping with generalized
| SLEN parameters).

| I'm still waiting on the proposed casting details. If that resolves all
| our concerns, great.


| In the interim I believe it may be worthwhile exercises to consider
| equivalences of functionality.

| Specifically, vertical stripping vs horizontal interleave for widening
| ops, in-register vs in-memory order for element width alignment.
| I hope that the more we identify the easier it will be to compare them
| and evaluate trade-offs.

| I also think it constructive to consider big-endian vs little-endian
| with concerns about granularity (inherent in big endian and obscured
| with little-endian (aligned vs unaligned still relevant))



||
|| Bill
||
|| On 5/15/20 11:55 AM, Krste Asanovic wrote:
||| EXTERNAL MAIL
|||
|||
|||
||| Date: 2020/5/15
||| Task Group: Vector Extension
||| Chair: Krste Asanovic
||| Co-Chair: Roger Espasa
||| Number of Attendees: ~20
||| Current issues on github: https://urldefense.com/v3/__https://github.com/riscv/riscv-v-spec__;!!EHscmS1ygiU1lA!W3LXrGwuFwNIJ12NX5xQnmMbzk4zgzIDO39xVFEgrQGQSggvT8Zg9M2ElNRv61w$
|||
||| Issues discussed:
|||
||| # MLEN=1 change
|||
||| The new layout of mask registers with fixed MLEN=1 was discussed.  The
||| group was generally in favor of the change, though there is a proposal
||| in flight to rearrange bits to align with bytes.  This might save some
||| wiring but could increase bits read/written for the mask in a
||| microarchitecture.
|||
||| #434 SLEN=VLEN as optional extension
|||
||| Most of the time was spent discussing the possible software
||| fragmentation from having code optimized for SLEN=LEN versus
||| SLEN<VLEN, and how to avoid.  The group was keen to prevent possible
||| fragmentation, so is going to consider several options:
|||
||| - providing cast instructions that are mandatory, so at least
||| SLEN<VLEN code runs correctly on SLEN=VLEN machines.
|||
||| - consider a different data layout that could allow casting up to ELEN
||| (<=SLEN), however these appear to result in even greater variety of
||| layouts or dynamic layouts
|||
||| - invent a microarchitecture that can appear as SLEN=VLEN but
||| internally restrict datapath communication within SLEN width of
||| datapath, or prove this is impossible/expensive
|||
|||
||| # v0.9
|||
||| The group agreed to declare the current version of the spec as 0.9,
||| representing a clear stable step for software and implementors.
|||
|||
|||
|||
|||
|||
|||
||
||


|





Join {tech-vector-ext@lists.riscv.org to automatically receive all group messages.