On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.
Yes it's what I'm asking for.
OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with
the D-side's latest values. I'm still don't think that works.
> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.
No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).
I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it
🙂.
> My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.
I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.
Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think
that matters at all.
There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.
This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by
the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension
that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment>
So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe.
I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers.
So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally.
I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.
I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation
specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs.
> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.
I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.
OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the
J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years).
The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.
Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use.
One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.
I'm not at all a fan of hack like non-guarantees.
This problem is not the same as the one the J-group is attempting to solve.
Actually, I think it is. Since we disagree here, please explain the difference.
> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts)
with the J group proposal.
I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.
As I say above, I think you will need the J group extension in most of its totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it
all with just the I-cache invalidation instruction.
The J extension isn't that costly in complexity or performance.
> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.
My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.
I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here.
> We also need to know the expectations and use cases for the others you've seen request it.
I'll use Google to help us out, but I won't carefully read each link below.
Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.
A few discussion forums:
There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).
You can also find code that expects to use icache flush instructions:
https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html
(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)
----
> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.
You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.
Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point.
I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation,
that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions.
For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years).
However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.
To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.
I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now?
No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to.
Derek
Guy