Re: More thoughts on Git update (8a9fbce) Added fractional LMUL

David Horner

On 2020-04-27 3:56 p.m., Bill Huffman wrote:

On 4/27/20 12:32 PM, krste@... wrote:

On Mon, 27 Apr 2020 18:14:39 +0000, Bill Huffman <huffman@...> said:
| On 4/27/20 7:02 AM, Krste Asanovic wrote:
|| I created a github issue for this, #434 - text repeated below,
|| Krste
|| Should SLEN=VLEN be an extension?

| It might be the case that the machines where SLEN=VLEN would be the same
| machines where it would be attractive to use vectors for such code -
| machines where vectors provided larger registers and some parallelism
| rather than machines where vectors usually complete in one or a few
| cycles and wouldn't deal well with irregular operations.  That probably
| increases the value of an extension.

I think having vectors complete in one or a few cycles (shallow
temporal) is orthogonal to choice of SLEN=VLEN.

I think SLEN=VLEN is simply about how wide you want interactions
between arithmetic units.  I'm guessing e.g. 128-256b wide datapaths
are probably OK with SLEN=VLEN, whereas 512b and up datapaths are
probably starting to see issues, independent of VLEN in either case.
Sorry, I didn't say what I meant very well.  I agree that it's the width 
that matters.  Machines with short vector registers are likely to be 
SLEN=VLEN even if the complete quickly.

In my experience 256b width is shaky and may well want SLEN=128.

In any case, I'm wondering if having cast instructions is better than an 
extension.  I think it avoids the potential fragmentation.

| On the other hand, adding casting operations would seem to decrease the
| value of an extension (see below).

|| A second issue either way is whether we should add "cast"
|| operations. They are primarily useful for the SLEN<VLEN machines
|| though are difficult to implement efficiently there; the SLEN=VLEN
|| implementation is just a register-register copy. We could choose to
|| add the cast operations as another optional extension, which is my
|| preference at this time.

| Where SLEN<VLEN, cast operations might be implemented as vector register
| gather operations with element index values determined by SLEN, VLEN and
| SEW.

Agree this is a sensible implementation strategy, but pattern is
simpler than general vrgather, and can also implement as a store(src
SEW)+load(dest SEW) across memory crossbar given that you need to
materialize/parse in-memory formats there anyway.
OK.  That's also quite do-able.  Physical layout and control issues 
could make for either implementation, I think.

| But where SLEN=VLEN, they would be moves.  If then, we add casts,
| would an SLEN=VLEN extension still be valuable?

Casting makes it possible to have a common interface, but given that
SLEN=VLEN will be common choice and it's easy for software to figure
this out, and there is a performance/complexity advantage to not using
the casts when SLEN=VLEN, I can't see mandating everyone use the
casting model as working in practice.  Also, I don't believe casting
provides an efficient solution for all the use cases.

Now, a SLEN<VLEN machine could provide a configuration switch to turn
off all but the first SLEN partition (maybe what David was alluding to)
and then support the SLEN=VLEN extension albeit at reduced
Agreed.  That's feasible.  It might be set by vsetvl, but unchanged by 
Perhaps you would be willing to comment on #410 Place stabler RVV control fields in bits [30:12] of vtype.
and implemented by reduction of VLMAX as you suggest.  That 
might be a reasonable tradeoff.
reduction of VLMAX is not sufficient.
Within each SLEN chunk the existing data will already be "scrambled".
It would be possible to load SEW=SLEN data (or load whole register) to prep the data, avoiding scrambling.
But otherwise, the new _source_ data will need to be loaded under tne new mode.

And it does not address register groups (of more than 1 physical register)
To both limit to one SLEN group AND reduce registrar groups to a single physical register is a double wammy.

Maybe there's no cast and no extension.  Only a bit that may reduce 
performance, but makes SLEN=VLEN.


And an SLEN=VLEN machine could implement the cast extension to run
software that used those at no penalty.


Join to automatically receive all group messages.