On 2020-05-16 2:25 p.m., Bill Huffman wrote:
I wasn't at all trying to suggest that we could use the store/load
combination instead of defining cast instructions.
Nor was I intending on implying that you believed these casting instructions would affect the memory system.
Although Krste and I believe Andrew did suggest the memory side of the register file could be leveraged to provide appropriate rearrangement in respect to fractional loads.
I was only trying to
verify the required operation.
So I apologize that I didn't make that clearer in my response.
You said that it is a sufficient condition. If there's a less
constraining condition that is also sufficient, could you describe?
To some extent it is but conjecture on my part. A hopeful possibility. Especially the intermediate format, that occurs in two parts to provide the full transform.
Something that comes close, is a requirement that the residual space on a casting instruction is indeterminate.
A prohibition of tail undisturbed and a variant from Tail agnostic that does not require a poison value fill.
And a further constraint that vl cannot be changed for this "transformed" register (that is while any cast is active).
Such an approach would allow a microarch to tag the register that is "cast" with the transform it will need.
Then when the register is referenced, and only then, will the transform be applied within that reference only.
This would be a win for a comparison of two registers in a SEW other than that loaded, as
1)both registers could be so tagged, and then
2) the comparison could occur in parallel across all SLENs starting with most significant parts at the load SEW sizes and advance to lower significance only is required.
3) Of course the mask register is going to be in ordinal order and possibly closely aligned with the original SEW load structures, even if it isn't as well aligned with the newer SEW.
But if I can think of such things, then how much more the experts!!
If these operations are indeed infrequent then the impact can be deferred as in the above example, whereas the general programing principle is to discard as little as necessary with the expectation that it might be used later. (optimistic vs pessimistic).
Of course such a deferral needs to address how to remove stale state, for example at a context switch.
At that time, a standard store will need to do in-memory formatting of the new SEW, so that's doable.
However, a full register store will not store exactly as in-register but the new SEWs representation so deferral has a higher cost then.
But we are now talking spilling to memory, so performance is expected to be limited.
In any case, there are possibilities.
hoping either to prove there is not a layout or (possibly) show that
there is one.
I'm not holding my breath, but I am very hopeful that such a viable compromise situation can be formulated.
Loosing tail unchanged, for example, is a reasonable compromise. Does improved performance warrant increased validation cost for tail undefined?
Perhaps tail undefined is not strictly necessary, that poison fill could also be deferred.
But starting to track changing vl lengths is problematic, and a trap on vl change when a cast change is "active" is going to be annoying to some programmers, but worse is that SLEN=VLEN implementations will need to enforce it ( annoying microarchitect engineers) if deferred transform is going to be reliable in the ecosytem.
As I mentioned, such a solution as this deferral is similar to the prefix approach.
We have to wait to see what will be proposed by Krste and co. As I said I have been surprised before, and pleasantly.
Then it can be contrasted to other approaches, tweaked or edge cases addressed.
Until then, like you, I'm hoping.
On 5/16/20 6:06 AM, David Horner wrote:
By the way, this is similar to the "prefix" proposal that applies the
transform to selected source and/or destinations.
The cost is reduced as the "transform" is occurring while the operation
is also occurring,
- on SLEN= machines these "prefix" are nops
- on SLEN< micro archs the specific combination of transform and
operation can be optimized - especially for high use combinations.
The cost is further reduced for some combinations as an intermediate
state can be used for that op and transform rather than storing a result
that has to be usable by any operation.
To the extent that these cast operations can apply in this way, the
benefit of the "prefix" over cast may be limited and not warrant a "new"
prefix mechanism like #423 (and a specific application #456).
On 2020-05-16 8:50 a.m., David Horner via lists.riscv.org wrote:
I would agree that "by definition" this is a sufficient condition to
obtain the instructions that Krste was envisioning of instructions the
also nop on SLEN=VLEN machine.
That is a sufficient condition to address the byte mismatch of SLEN <
However, is it necessary as it is a very expensive operation for SLEN<?
Are there casting instructions that are reasonably low cost on both
SLEN= and SLEN< VLEN that create an intermediate state that works for
And if there are such operations, do you only provide them (and NOT
the heavy handed "as if written to memory and back")?
Can two such instructions do the full transition for SLEN< to SLEN=?
If so, is it sufficiently easy to recognize such a pair and fuse as a
nop on SLEN= systems?
Can applications alternatively rely on a linkage editor to nop them?
I have no good solution (yet) as the guts of the range of microarch
tricks is not my forte.
But there are others who undoubtedly are mulling over such
It would not be a lose-win proposition but a limited win-win.
I look forward to Krste's proposals . I have been surprised before!!
On 2020-05-16 2:07 a.m., Bill Huffman wrote:
It seems like the function of a cast instruction the same as storing to
memory (stride-1) with one SEW and loading back the same number of bytes
with another SEW. Is that a correct understanding?
On 5/15/20 11:55 AM, Krste Asanovic wrote:
Task Group: Vector Extension
Chair: Krste Asanovic
Co-Chair: Roger Espasa
Number of Attendees: ~20
Current issues on github:
# MLEN=1 change
The new layout of mask registers with fixed MLEN=1 was discussed. The
group was generally in favor of the change, though there is a proposal
in flight to rearrange bits to align with bytes. This might save some
wiring but could increase bits read/written for the mask in a
#434 SLEN=VLEN as optional extension
Most of the time was spent discussing the possible software
fragmentation from having code optimized for SLEN=LEN versus
SLEN<VLEN, and how to avoid. The group was keen to prevent possible
fragmentation, so is going to consider several options:
- providing cast instructions that are mandatory, so at least
SLEN<VLEN code runs correctly on SLEN=VLEN machines.
- consider a different data layout that could allow casting up to ELEN
(<=SLEN), however these appear to result in even greater variety of
layouts or dynamic layouts
- invent a microarchitecture that can appear as SLEN=VLEN but
internally restrict datapath communication within SLEN width of
datapath, or prove this is impossible/expensive
The group agreed to declare the current version of the spec as 0.9,
representing a clear stable step for software and implementors.