Re: Thoughts for Vector TG Meeting Friday June 12

Bill Huffman

On 6/11/20 4:57 PM, David Horner wrote:

There hasn't been any traffic on this feed about  vector layout, and
specifically about v0.9 SLEN size standards.
    So I will add my thoughts here with the hopes that I will not take
excessive time during the meeting.

A) I concur with the v0.9 approach for in register layout.

B) Specifically I consider VLEN=SLEN as the default layout
    It is quite appropriate for the minimalist implementations with
VLEN<128 and reasonable for 128 and 256.

C) The SLEN<VLEN mapping
            (successive elements are round-robin allocated to
successive SLEN chunks)
             is a reasonable extension to the default layout.
I think that it's important that SLEN=VLEN is the extension. For the
same reason that TSO memory ordering is the extension and RVWMO is the
default. Or even M is the extension rather than not-M being the
extension. An extension adds capabilities. SLEN=VLEN adds
capabilities. SLEN<VLEN would be an "extension" that takes away

Your suggestion for a field containing the mapping seems like a
generalization of the one bit mode we discussed earlier. I like the
idea. SLEN=VLEN would be implemented in the machines I imagine by
reducing VLEN to a single slice. If the performance didn't matter,
fine. If it did, some code rewriting would be needed (casts or a
different algorithm, or maybe even just setting the field differently).
Not too bad. Far better than if the code didn't work at all.

This has the advantage of making explicit the needs and making them
check-able. It makes the potential divergence more manageable, I think,
as almost all cores can support both modes - just sometimes with a loss
of performance, which may or may not need rectifying. It puts this
difference in the same class with many other significant performance
differences, with far less loss than instructions that are "implemented"
by trap and emulate, for example.

I think that sort of thing will happen a lot. Some implementations will
do gathers and scatters at one element per cycle. Others will do them
much faster. Some will do divides far more rapidly than others. And so on.


D) I don't agree that a specific SLEN be stipulated at the ISA level
(likely it is appropriate in use-case profiles)
        The most appropriate "stipulated" SLEN is already
implementation dependent, and is and will change with technology used.
        I consider it analogous to MIPS choosing one instruction
delayed branch
            (vs two or more instructions executed in the branch shadow),
        when technology/implementation changes have obsoleted it.

E) A specified SLEN will not eliminate the in-resister vs in-memory
program anomalies.
        None of the reasonable SLEN values (i.e. >=64) eliminate the
duality that exists
            between (at least) VLEN=32 (and 64) implementations vs
            SLEN<VLEN ones.
         Software may be sufficient to address the layout discrepancies.
         Hardware assist (e.g. casting instructions) may help.
         Hardware detection, (e.g. trapping when stored SEW mismatches
read SEW) may ensure safety, and also allow optimization.
         I have a specific suggestion below in G.

F) I question that we need to only have this extension.
        Rather than the only allowed approach, I believe others are not
only possible, but each has specific merit.
        a) The v0.8 approach has notable drawbacks and so I concur that
it should not be the espoused approach.
        b) Fractional register interleave is another viable approach
that resolves the widening/narrowing alignment issue
                and also provides for even/odd argument handling.
         c) further variants of V0.9 (e.g. clustering size) provide
specific benefits
        I believe these and other alternative mappings should not be
locked out of the standard.
        Rather, the standard should accommodate the availability of
non-standard and other extended layouts.
            (just as the RVI base architecture is a base for extensive
experimentation and adaptation)

G) To allow multiple register layouts I propose an application writable
field (It could be vcsr) that contains the mapping the code expects.
        I propose a 4 bit field that initially has two modes defined:
            zero (0) - reserved
            one (1) -  the default mapping (VLEN=SLEN)
            two (2)  - the v0.9 proposed SLEN<VLEN.
            three (3) - software manages with either of these modes.
            four through 11 are initially reserved with 12 through 15
user defined.
        If the hardware cannot or is not configured to run in the
written  mode an exception occurs.
        This provides a mechanism to ensure software is aware of its
operating layout mode and
            to allow additional layout expansion.
        We can expect that the "official" software ecosystem will
initially only support the "standard" layouts,
             but will allow customization and  as demand occurs
controlled expansion.

Join { to automatically receive all group messages.