Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #cmos


John Ingalls
 

Folks --
This email thread is getting long and my inbox is getting full.  I suggest that interested parties get the J group's proposal slide deck from Derek, and frame their questions and needs specifically relative to that proposal.
-- John


On Fri, Aug 12, 2022 at 8:28 AM Sean Halle <seanhalle@...> wrote:

Hi Guy, thanks for the email thread.  Very interesting.

I am hoping to get a precise understanding of 1) Your high level need, and 2) The details of the semantics you are asking for.

I do need to say up front that I'm not familiar with the current proposals from J Group, nor the details of other proposals on this subject, but this email is more to get precise details of what you're after.

Taking 2 first.. could you correct this interpretation of what, precisely you would like (independent of the use case or reason)?

some instruction (referred to as "target instruction") that:
- supplies an address (the "target addr")
- invalidates any cache line that is in the core's instr cache and contains the target address at the point that instr "takes effect".  (In practice, it takes effect when the instruction has completed update of the i-cache's tags, yes?  So the execution pipeline sees this.. in the mem stage for a 5 stage pipe?)
- Any instructions that appear in code order after the target instruction are stalled (or killed and restarted) until the target instr takes effect?
- no other effect.  At the cycle the target instruction takes effect, if a valid cache line exists (with either read or write permissions) in the same core's d-cache or in any other cache in the system, none are accessed nor affected in any way by this instruction.
- after the instruction takes effect, the next time the core's i-cache performs a read or write on the target cache line.. what?  Does the i-cache fetch only from the DRAM (or LLC), ignoring the core's d-cache and all other d-caches?  Or do the semantics of the target instruction have a side effect that causes behavior in the d-caches as well?  If so, could you say with similar precision, the semantics of interaction between the target instruction and d-caches?

===

So, with some clarity on the desired semantics, would you be up for saying a bit more about 1) -- the high level need?

My inference is that you are creating hardware that has, what, some alternate mechanism for synchronizing caches?  Say, at the software level?  If so, I'm guessing that what you want is the bare minimum logic in the caches and pipelines?  So, the goal is minimal logic, and minimal time spent to invalidate a single i-cache line.. because you have something else going on that handles the updates from d-caches to.. DRAM/LLC?  So all you need is something that just invalidates an i-cache line, then this other thing going on ensures that the next time that i-cache line is accessed, the contents in DRAM/LLC will be correct?  Something like that?

If so, then semantics that are designed for OoO cores, or support for JITs, or hardware with coherent caches, etc, are too heavy weight?  That's why you're proposing the above alternate semantics?

Thanks Guy, and Derek, et al, really interesting discussion :-)

Sean

P.S. If I am way off base, and this is not a constructive addition to the thread, I apologise, please ignore in that case.


On Fri, Aug 12, 2022 at 6:15 AM Guy Lemieux <guy.lemieux@...> wrote:
Derek, let's try to move the discussion forward instead of just back and forth.

If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?

Let's summarize this thread:

CBO.INVAL.I for icache invalidation is necessary. Since it will be used in a loop, it is important not to lock up the pipeline for performance reasons.

You believe CBO.INVAL.I must flush instruction pipes, otherwise such an instruction would be architecturally incomplete. To me, you are trying to turn a CBO into a fence instruction. As I understand it, CBOs do not execute synchronously/block, meaning they can be queued up somewhere in the system (but they should follow the memory system ordering semantics).

Perhaps we should simply add new FENCE.I instructions, or new semantics to the existing instruction, which flush instruction pipes and sync any outstanding CBO.INVAL.I operations, but do not dump the entire i-cache in noncoherent systems?

Such an instruction may be used ONCE at the end of the loop that issues all of the CBO.INVAL.I instructions, thus maintaining performance and imposing correctness.

The commentary in the FENCE.I instruction suggests there is encoding room for alternatives which use imm[11:0], rs1, and rd operands which "are reserved for finer-grain fences in future extensions". Here is one way to use that encoding space to add the required FENCE.I semantics:

-- FENCE.I with imm[11:0]==0x0 and rs1!=0: flushes instr pipelienes and fences against only those outstanding CBO.INVAL.I that specified the cache block named in rs1 (ie, it does not dump the full i-cache in non-coherent systems; in coherent systems it does a fine-grained FENCE.I against any visible updates to that cache block)

-- FENCE.I with imm[11:0]==0xFFF and rs1==0: flushes instr pipelienes and fences against ALL outstanding CBO.INVAL.I (ie, does not dump full i-cache in noncoherent systems; in coherent systems it behaves as FENCE.I with imm[11:0]==0).

Guy


On Thu, Aug 11, 2022 at 10:20 PM Derek Williams <striker@...> wrote:

Thank you for this email Guy. It clarifies many things. 


From: Mark Himelstein <markhimelstein@...>
Sent: Thursday, August 11, 2022 9:01 AM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>; tech-privileged <tech-privileged@...>
Cc: Derek Williams <striker@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
+tech-privileged  On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote: On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote: > You say below you state you are asking for a way to guarantee
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with the D-side's latest values.  I'm still don't think that works. 

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it 🙂.

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think that matters at all. 

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment> 

So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe. I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers. 

So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally. 

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs. 

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.


OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years). 

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use. 

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

I'm not at all a fan of hack like non-guarantees. 

This problem is not the same as the one the J-group is attempting to solve.

Actually, I think it is.  Since we disagree here, please explain the difference.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

As I say above, I think you will need the J group extension in most of its  totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it all with just the I-cache invalidation instruction. 

The J extension isn't that costly in complexity or performance. 

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here. 

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.

A few discussion forums:




There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:


https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.

Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point. 

I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation, that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions. 

For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years). However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now? No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to. 

Derek 


Guy





Join tech-privileged@lists.riscv.org to automatically receive all group messages.