Thoughts for Vector TG Meeting Friday June 12
There hasn't been any traffic on this feed about vector layout, and specifically about v0.9 SLEN size standards.
So I will add my thoughts here with the hopes that I will not take excessive time during the meeting.
A) I concur with the v0.9 approach for in register layout.
B) Specifically I consider VLEN=SLEN as the default layout
It is quite appropriate for the minimalist implementations with VLEN<128 and reasonable for 128 and 256.
C) The SLEN<VLEN mapping
(successive elements are round-robin allocated to successive SLEN chunks)
is a reasonable extension to the default layout.
D) I don't agree that a specific SLEN be stipulated at the ISA level (likely it is appropriate in use-case profiles)
The most appropriate "stipulated" SLEN is already implementation dependent, and is and will change with technology used.
I consider it analogous to MIPS choosing one instruction delayed branch
(vs two or more instructions executed in the branch shadow),
when technology/implementation changes have obsoleted it.
E) A specified SLEN will not eliminate the in-resister vs in-memory program anomalies.
None of the reasonable SLEN values (i.e. >=64) eliminate the duality that exists
between (at least) VLEN=32 (and 64) implementations vs
Software may be sufficient to address the layout discrepancies.
Hardware assist (e.g. casting instructions) may help.
Hardware detection, (e.g. trapping when stored SEW mismatches read SEW) may ensure safety, and also allow optimization.
I have a specific suggestion below in G.
F) I question that we need to only have this extension.
Rather than the only allowed approach, I believe others are not only possible, but each has specific merit.
a) The v0.8 approach has notable drawbacks and so I concur that it should not be the espoused approach.
b) Fractional register interleave is another viable approach that resolves the widening/narrowing alignment issue
and also provides for even/odd argument handling.
c) further variants of V0.9 (e.g. clustering size) provide specific benefits
I believe these and other alternative mappings should not be locked out of the standard.
Rather, the standard should accommodate the availability of non-standard and other extended layouts.
(just as the RVI base architecture is a base for extensive experimentation and adaptation)
G) To allow multiple register layouts I propose an application writable field (It could be vcsr) that contains the mapping the code expects.
I propose a 4 bit field that initially has two modes defined:
zero (0) - reserved
one (1) - the default mapping (VLEN=SLEN)
two (2) - the v0.9 proposed SLEN<VLEN.
three (3) - software manages with either of these modes.
four through 11 are initially reserved with 12 through 15 user defined.
If the hardware cannot or is not configured to run in the written mode an exception occurs.
This provides a mechanism to ensure software is aware of its operating layout mode and
to allow additional layout expansion.
We can expect that the "official" software ecosystem will initially only support the "standard" layouts,
but will allow customization and as demand occurs controlled expansion.
On 6/11/20 4:57 PM, David Horner wrote:
EXTERNAL MAILI think that it's important that SLEN=VLEN is the extension. For the
same reason that TSO memory ordering is the extension and RVWMO is the
default. Or even M is the extension rather than not-M being the
extension. An extension adds capabilities. SLEN=VLEN adds
capabilities. SLEN<VLEN would be an "extension" that takes away
Your suggestion for a field containing the mapping seems like a
generalization of the one bit mode we discussed earlier. I like the
idea. SLEN=VLEN would be implemented in the machines I imagine by
reducing VLEN to a single slice. If the performance didn't matter,
fine. If it did, some code rewriting would be needed (casts or a
different algorithm, or maybe even just setting the field differently).
Not too bad. Far better than if the code didn't work at all.
This has the advantage of making explicit the needs and making them
check-able. It makes the potential divergence more manageable, I think,
as almost all cores can support both modes - just sometimes with a loss
of performance, which may or may not need rectifying. It puts this
difference in the same class with many other significant performance
differences, with far less loss than instructions that are "implemented"
by trap and emulate, for example.
I think that sort of thing will happen a lot. Some implementations will
do gathers and scatters at one element per cycle. Others will do them
much faster. Some will do divides far more rapidly than others. And so on.