Re: More thoughts on Git update (8a9fbce) Added fractional LMUL

David Horner

In trying to make SEW level interleave by augmenting the instruction set (including casting),
I have a few observations.
- arithmetic operators need to function at a given SEW and there is no in-memory form requirement.
- exploitation of in-memory form in SEW * n elements is substantially (if not completely)
        of SEW level bit-wise operations and/or/xor/move/load and shift  and
        SEW level masking (not SEW * n).

That use of in-memory form for an algorithm can be identified and provided only when needed.
That the need for such instructions will be neither statically nor dynamically frequent in common code.
That the gather of SEW level elements to build a SEW * n result is not prohibitively expensive for the "only when needed" and infrequent aspects of the algorithm.

If these are true, then we can provide augmented forms of the bitwise and shift instructions that
         source a SEW level set of n consecutive elements and
         if another vector source is needed either
             another such SEW level set of n consecutive elements or
             a SEW * n element and
        stores a SEW * n element with the operator applied by each SEW level element in turn, under the mask at the SEW level.
The total number of SEW elements processed is determined by vl.
Lets say the value in vl is required to be a multiple of n, for now.

The two needed data elements are
-    n (the aggregate level for the target) (lets call it inmemn)
             2 bits would appear necessary for XLEN=64,
                with derived values of values 1 (standard operation),2,4,8 (allowing byte to double).
                However,3 bits would additionally allow factors of 16 through 128 which might be useful for encryption)

- single bit indicator whether second source two is level SEW or SEW * n . (lets call it inmem2)

If these were incorporated into the bitwise/shift variants the opcodes would be increased by a minimum of 3 bits.
It would appear the vtype opcode compression method should be leveraged again.

These two "parameters", inmem2 and inmemn, could be included in vtype as a persistent modifier.
However, it is fully conceivable that most of the data masaging cam be done at the SEW level with only the last operation required to pace it in a SEW * n destination.
This would be a good justification to allow a transient form of the vmodinstr prefix. Issue #423 - additional instructions to set vtype fields.

Note, neither of these changes the vl  of these instructions. And further, the execution of these is expected to be infrequent (one of the postulates).
  Therefore it is a candidate for the alternate vmodtype instruction, rather than further vsetvli immediate bit use. (also issue #423)

On 2020-04-27 3:56 p.m., Bill Huffman wrote:

On 4/27/20 12:32 PM, krste@... wrote:

On Mon, 27 Apr 2020 18:14:39 +0000, Bill Huffman <huffman@...> said:
| On 4/27/20 7:02 AM, Krste Asanovic wrote:
|| I created a github issue for this, #434 - text repeated below,
|| Krste
|| Should SLEN=VLEN be an extension?

| It might be the case that the machines where SLEN=VLEN would be the same
| machines where it would be attractive to use vectors for such code -
| machines where vectors provided larger registers and some parallelism
| rather than machines where vectors usually complete in one or a few
| cycles and wouldn't deal well with irregular operations. That probably
| increases the value of an extension.

I think having vectors complete in one or a few cycles (shallow
temporal) is orthogonal to choice of SLEN=VLEN.

I think SLEN=VLEN is simply about how wide you want interactions
between arithmetic units. I'm guessing e.g. 128-256b wide datapaths
are probably OK with SLEN=VLEN, whereas 512b and up datapaths are
probably starting to see issues, independent of VLEN in either case.
Sorry, I didn't say what I meant very well. I agree that it's the width
that matters. Machines with short vector registers are likely to be
SLEN=VLEN even if the complete quickly.

In my experience 256b width is shaky and may well want SLEN=128.

In any case, I'm wondering if having cast instructions is better than an
extension. I think it avoids the potential fragmentation.

| On the other hand, adding casting operations would seem to decrease the
| value of an extension (see below).

|| A second issue either way is whether we should add "cast"
|| operations. They are primarily useful for the SLEN<VLEN machines
|| though are difficult to implement efficiently there; the SLEN=VLEN
|| implementation is just a register-register copy. We could choose to
|| add the cast operations as another optional extension, which is my
|| preference at this time.

| Where SLEN<VLEN, cast operations might be implemented as vector register
| gather operations with element index values determined by SLEN, VLEN and
| SEW.

Agree this is a sensible implementation strategy, but pattern is
simpler than general vrgather, and can also implement as a store(src
SEW)+load(dest SEW) across memory crossbar given that you need to
materialize/parse in-memory formats there anyway.
I expect the SEW level sourcing of the augmented bitwise/shift could readily use this path via a passthrough to the execution units

OK. That's also quite do-able. Physical layout and control issues
could make for either implementation, I think.

| But where SLEN=VLEN, they would be moves. If then, we add casts,
| would an SLEN=VLEN extension still be valuable?

Casting makes it possible to have a common interface, but given that
SLEN=VLEN will be common choice and it's easy for software to figure
this out, and there is a performance/complexity advantage to not using
the casts when SLEN=VLEN, I can't see mandating everyone use the
casting model as working in practice. Also, I don't believe casting
provides an efficient solution for all the use cases.

Now, a SLEN<VLEN machine could provide a configuration switch to turn
off all but the first SLEN partition (maybe what David was alluding to)
yes it was.
and then support the SLEN=VLEN extension albeit at reduced
Agreed. That's feasible. It might be set by vsetvl, but unchanged by
vsetvli, and implemented by reduction of VLMAX as you suggest. That
might be a reasonable tradeoff.
I don't expect it would be well accepted by a purchaser that their 4096 bit vector accelerator is 256 bit brain damaged by what are infrequent but "essential" in-memory mapped transforms.
The high end market would look else where than RISCV with such dumbed down support.
If we are going to have an industry bucking non-standard internal register format, but provide in-memory format support , it had better be proficient.

Maybe there's no cast and no extension. Only a bit that may reduce
performance, but makes SLEN=VLEN.


And an SLEN=VLEN machine could implement the cast extension to run
software that used those at no penalty.
Because the augmented instructions are essential to the performance of the algorithm there is even less penalty for SLEN=VLEN implementations.
That is no extraneous moves.
The bits in vtype become no ops, and the prefix becomes a nop that a linker could remove.


Join { to automatically receive all group messages.