[RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #cmos


mark
 


On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

This problem is not the same as the one the J-group is attempting to solve.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

A few discussion forums:

https://lists.riscv.org/g/tech-j-ext/topic/risc_v_tech_cmo_first/80495338?p=

https://groups.google.com/a/groups.riscv.org/g/hw-dev/c/ZUoEwt65gno

https://patchwork.kernel.org/project/linux-mm/patch/20190624054311.30256-17-hch@.../

There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:

https://elixir.bootlin.com/linux/v4.15.5/source/arch/riscv/include/asm/cacheflush.h

https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

https://github.com/torvalds/linux/blob/master/arch/riscv/mm/cacheflush.c
(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other
use-cases you are failing to recognize.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

Guy






Krste Asanovic
 

On Thu, 11 Aug 2022 07:01:58 -0700, "mark" <markhimelstein@...> said:
| +tech-privileged 
| On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:

| On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
|| You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're
| asking for.

| Yes it's what I'm asking for.

|| I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given
| address, that would be enough to get you there.

| No, we already have FENCE.I, and I understand how to use it. I
| understand that in a non-coherent system, it may not do anything at
| all to the i-cache (eg, if an IODMA channel replaces executable code
| in memory, and those writes are not observable by the hart executing
| FENCE.I).

Assuming you have a way of knowing when the IODMA channel has made all
its writes visible to the local hart (e.g., an interrupt on
completion), then a FENCE.I should make any writes made by any agent
in the memory system visible to the local hart.

If you have an incoherent I-cache, then most likely the implementation
will have to flush the I-cache as well as the instruction pipeline to
implement FENCE.I correctly.

I believe this handles your use case below.

[...]

|| My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs
| this defined right now.

| I am starting a new project around IODMA and coherence issues.
| Although it is academic, it will exist physically (real logic on an
| FPGA) and run real code. This is not so urgent that I need the
| instruction yesterday, or even in 12 months, as I can always create my
| own instruction. I do not normally do OS-level work, but this is where
| such an instruction would be used the most, and consultation should be
| made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
| Zephyr, etc to see how they handle this problem.

| The problem, to be clear, is inadvertent execution of stale code
| because there is no i-cache coherence and there are no i-cache
| management instructions.

| One "workaround" to this problem is for the OS to never re-use a
| physical address for new code until it has to wrap around. This is a
| "lazy" way that hopes i-cache contents are eventually replaced on
| their own. However, it is not a guarantee.

[...]

Krste


Guy Lemieux <guy.lemieux@...>
 

Hi Krste,

Dumping the entire i-cache via FENCE.I is different. I am requesting invalidation of a single cache block from the i-cache.

The Zifencei spec recognizes Zifencei is expensive to implement:

"on some systems, FENCE.I will be expensive to implement and alternate mechanisms are being discussed in the memory model task group. In particular, for designs that have an incoherent instruction cache and an incoherent data cache, or where the instruction cache refill does not snoop a coherent data cache, both caches must be completely flushed when a FENCE.I instruction is encountered. This problem is exacerbated when there are multiple levels of I and D cache in front of a unified cache or outer memory system."

In addition to being "expensive to implement", it is also rather costly to runtime.

-----

Since Zifencei is an optional extension, it makes sense that other optional things might be used to replace it. The spec itself suggests that another mechanism may come about in the future, namely: "Currently, this instruction is the only standard mechanism to ensure that stores visible to a hart will also be visible to its instruction fetches."

------

As an aside, the FENCE.I instruction has been misinterpreted many times, and there are many posts correcting people (including correcting me) about its operation. This is partly because the spec is ambiguously written, and partly because it is unexpected that it could have an impact on an entire i-cache.

At the heart of the ambiguity is in the introductory paragraph of the Zifencei spec:

"This chapter defines the “Zifencei” extension ... explicit synchronization between writes to instruction memory and instruction fetches on the same hart."

This could be interpreted as "between (writes to instruction memory and instruction fetches) on the same hart", or it could be interpreted as "between writes to instruction memory and (instruction fetches on the same hart)". If the latter, then what is the other thing that is being done to warrant the expression "on the same hart"?

Some of the subsequent text can be read in either context, thus propagating the ambiguity. In particular, the key line of the spec is ambiguous: "FENCE.I does not ensure that other RISC-V harts’ instruction fetches will observe the local hart’s stores in a multiprocessor system." In this case, the spec should read: "Executing FENCE.I on a local hart does not ensure that other RISC-V harts' instruction fetches will observe the local hart's stores." However, an equally valid but opposite interpretation would be: "Executing FENCE.I on a multiprocessor system does not ensure that other RISC-V harts' instruction fetches will observe a local hart's stores."

At some point, the Zifencei spec should be updated to be more clear.

Thanks,
Guy







On Thu, Aug 11, 2022 at 7:43 AM <krste@...> wrote:
>>>>> On Thu, 11 Aug 2022 07:01:58 -0700, "mark" <markhimelstein@...> said:
| +tech-privileged 
| On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:

|     On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
|| You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're
|     asking for.

|     Yes it's what I'm asking for.

|| I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given
|     address, that would be enough to get you there.

|     No, we already have FENCE.I, and I understand how to use it. I
|     understand that in a non-coherent system, it may not do anything at
|     all to the i-cache (eg, if an IODMA channel replaces executable code
|     in memory, and those writes are not observable by the hart executing
|     FENCE.I).

Assuming you have a way of knowing when the IODMA channel has made all
its writes visible to the local hart (e.g., an interrupt on
completion), then a FENCE.I should make any writes made by any agent
in the memory system visible to the local hart.

If you have an incoherent I-cache, then most likely the implementation
will have to flush the I-cache as well as the instruction pipeline to
implement FENCE.I correctly.

I believe this handles your use case below.

[...]

|| My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs
|     this defined right now.

|     I am starting a new project around IODMA and coherence issues.
|     Although it is academic, it will exist physically (real logic on an
|     FPGA) and run real code. This is not so urgent that I need the
|     instruction yesterday, or even in 12 months, as I can always create my
|     own instruction. I do not normally do OS-level work, but this is where
|     such an instruction would be used the most, and consultation should be
|     made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
|     Zephyr, etc to see how they handle this problem.

|     The problem, to be clear, is inadvertent execution of stale code
|     because there is no i-cache coherence and there are no i-cache
|     management instructions.

|     One "workaround" to this problem is for the OS to never re-use a
|     physical address for new code until it has to wrap around. This is a
|     "lazy" way that hopes i-cache contents are eventually replaced on
|     their own. However, it is not a guarantee.

[...]

Krste


striker@...
 


Thank you for this email Guy. It clarifies many things. 


From: Mark Himelstein <markhimelstein@...>
Sent: Thursday, August 11, 2022 9:01 AM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>; tech-privileged <tech-privileged@...>
Cc: Derek Williams <striker@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
+tech-privileged  On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote: On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote: > You say below you state you are asking for a way to guarantee
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with the D-side's latest values.  I'm still don't think that works. 

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it 🙂.

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think that matters at all. 

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment> 

So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe. I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers. 

So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally. 

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs. 

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.


OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years). 

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use. 

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

I'm not at all a fan of hack like non-guarantees. 

This problem is not the same as the one the J-group is attempting to solve.

Actually, I think it is.  Since we disagree here, please explain the difference.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

As I say above, I think you will need the J group extension in most of its  totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it all with just the I-cache invalidation instruction. 

The J extension isn't that costly in complexity or performance. 

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here. 

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.

A few discussion forums:




There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:


https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.

Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point. 

I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation, that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions. 

For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years). However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now? No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to. 

Derek 


Guy






Guy Lemieux <guy.lemieux@...>
 

Derek, let's try to move the discussion forward instead of just back and forth.

If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?

Let's summarize this thread:

CBO.INVAL.I for icache invalidation is necessary. Since it will be used in a loop, it is important not to lock up the pipeline for performance reasons.

You believe CBO.INVAL.I must flush instruction pipes, otherwise such an instruction would be architecturally incomplete. To me, you are trying to turn a CBO into a fence instruction. As I understand it, CBOs do not execute synchronously/block, meaning they can be queued up somewhere in the system (but they should follow the memory system ordering semantics).

Perhaps we should simply add new FENCE.I instructions, or new semantics to the existing instruction, which flush instruction pipes and sync any outstanding CBO.INVAL.I operations, but do not dump the entire i-cache in noncoherent systems?

Such an instruction may be used ONCE at the end of the loop that issues all of the CBO.INVAL.I instructions, thus maintaining performance and imposing correctness.

The commentary in the FENCE.I instruction suggests there is encoding room for alternatives which use imm[11:0], rs1, and rd operands which "are reserved for finer-grain fences in future extensions". Here is one way to use that encoding space to add the required FENCE.I semantics:

-- FENCE.I with imm[11:0]==0x0 and rs1!=0: flushes instr pipelienes and fences against only those outstanding CBO.INVAL.I that specified the cache block named in rs1 (ie, it does not dump the full i-cache in non-coherent systems; in coherent systems it does a fine-grained FENCE.I against any visible updates to that cache block)

-- FENCE.I with imm[11:0]==0xFFF and rs1==0: flushes instr pipelienes and fences against ALL outstanding CBO.INVAL.I (ie, does not dump full i-cache in noncoherent systems; in coherent systems it behaves as FENCE.I with imm[11:0]==0).

Guy


On Thu, Aug 11, 2022 at 10:20 PM Derek Williams <striker@...> wrote:

Thank you for this email Guy. It clarifies many things. 


From: Mark Himelstein <markhimelstein@...>
Sent: Thursday, August 11, 2022 9:01 AM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>; tech-privileged <tech-privileged@...>
Cc: Derek Williams <striker@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
+tech-privileged  On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote: On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote: > You say below you state you are asking for a way to guarantee
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with the D-side's latest values.  I'm still don't think that works. 

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it 🙂.

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think that matters at all. 

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment> 

So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe. I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers. 

So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally. 

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs. 

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.


OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years). 

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use. 

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

I'm not at all a fan of hack like non-guarantees. 

This problem is not the same as the one the J-group is attempting to solve.

Actually, I think it is.  Since we disagree here, please explain the difference.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

As I say above, I think you will need the J group extension in most of its  totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it all with just the I-cache invalidation instruction. 

The J extension isn't that costly in complexity or performance. 

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here. 

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.

A few discussion forums:




There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:


https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.

Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point. 

I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation, that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions. 

For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years). However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now? No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to. 

Derek 


Guy






John Ingalls
 

Guy --

Have you read Derek's slide deck that he prepared for the J group on this?  I believe it contains the answers and solutions that you are looking for.

-- John

On Fri, Aug 12, 2022, 6:15 AM Guy Lemieux <guy.lemieux@...> wrote:
Derek, let's try to move the discussion forward instead of just back and forth.

If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?

Let's summarize this thread:

CBO.INVAL.I for icache invalidation is necessary. Since it will be used in a loop, it is important not to lock up the pipeline for performance reasons.

You believe CBO.INVAL.I must flush instruction pipes, otherwise such an instruction would be architecturally incomplete. To me, you are trying to turn a CBO into a fence instruction. As I understand it, CBOs do not execute synchronously/block, meaning they can be queued up somewhere in the system (but they should follow the memory system ordering semantics).

Perhaps we should simply add new FENCE.I instructions, or new semantics to the existing instruction, which flush instruction pipes and sync any outstanding CBO.INVAL.I operations, but do not dump the entire i-cache in noncoherent systems?

Such an instruction may be used ONCE at the end of the loop that issues all of the CBO.INVAL.I instructions, thus maintaining performance and imposing correctness.

The commentary in the FENCE.I instruction suggests there is encoding room for alternatives which use imm[11:0], rs1, and rd operands which "are reserved for finer-grain fences in future extensions". Here is one way to use that encoding space to add the required FENCE.I semantics:

-- FENCE.I with imm[11:0]==0x0 and rs1!=0: flushes instr pipelienes and fences against only those outstanding CBO.INVAL.I that specified the cache block named in rs1 (ie, it does not dump the full i-cache in non-coherent systems; in coherent systems it does a fine-grained FENCE.I against any visible updates to that cache block)

-- FENCE.I with imm[11:0]==0xFFF and rs1==0: flushes instr pipelienes and fences against ALL outstanding CBO.INVAL.I (ie, does not dump full i-cache in noncoherent systems; in coherent systems it behaves as FENCE.I with imm[11:0]==0).

Guy


On Thu, Aug 11, 2022 at 10:20 PM Derek Williams <striker@...> wrote:

Thank you for this email Guy. It clarifies many things. 


From: Mark Himelstein <markhimelstein@...>
Sent: Thursday, August 11, 2022 9:01 AM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>; tech-privileged <tech-privileged@...>
Cc: Derek Williams <striker@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
+tech-privileged  On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote: On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote: > You say below you state you are asking for a way to guarantee
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with the D-side's latest values.  I'm still don't think that works. 

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it 🙂.

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think that matters at all. 

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment> 

So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe. I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers. 

So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally. 

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs. 

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.


OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years). 

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use. 

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

I'm not at all a fan of hack like non-guarantees. 

This problem is not the same as the one the J-group is attempting to solve.

Actually, I think it is.  Since we disagree here, please explain the difference.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

As I say above, I think you will need the J group extension in most of its  totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it all with just the I-cache invalidation instruction. 

The J extension isn't that costly in complexity or performance. 

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here. 

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.

A few discussion forums:




There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:


https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.

Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point. 

I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation, that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions. 

For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years). However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now? No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to. 

Derek 


Guy






Sean Halle
 


Hi Guy, thanks for the email thread.  Very interesting.

I am hoping to get a precise understanding of 1) Your high level need, and 2) The details of the semantics you are asking for.

I do need to say up front that I'm not familiar with the current proposals from J Group, nor the details of other proposals on this subject, but this email is more to get precise details of what you're after.

Taking 2 first.. could you correct this interpretation of what, precisely you would like (independent of the use case or reason)?

some instruction (referred to as "target instruction") that:
- supplies an address (the "target addr")
- invalidates any cache line that is in the core's instr cache and contains the target address at the point that instr "takes effect".  (In practice, it takes effect when the instruction has completed update of the i-cache's tags, yes?  So the execution pipeline sees this.. in the mem stage for a 5 stage pipe?)
- Any instructions that appear in code order after the target instruction are stalled (or killed and restarted) until the target instr takes effect?
- no other effect.  At the cycle the target instruction takes effect, if a valid cache line exists (with either read or write permissions) in the same core's d-cache or in any other cache in the system, none are accessed nor affected in any way by this instruction.
- after the instruction takes effect, the next time the core's i-cache performs a read or write on the target cache line.. what?  Does the i-cache fetch only from the DRAM (or LLC), ignoring the core's d-cache and all other d-caches?  Or do the semantics of the target instruction have a side effect that causes behavior in the d-caches as well?  If so, could you say with similar precision, the semantics of interaction between the target instruction and d-caches?

===

So, with some clarity on the desired semantics, would you be up for saying a bit more about 1) -- the high level need?

My inference is that you are creating hardware that has, what, some alternate mechanism for synchronizing caches?  Say, at the software level?  If so, I'm guessing that what you want is the bare minimum logic in the caches and pipelines?  So, the goal is minimal logic, and minimal time spent to invalidate a single i-cache line.. because you have something else going on that handles the updates from d-caches to.. DRAM/LLC?  So all you need is something that just invalidates an i-cache line, then this other thing going on ensures that the next time that i-cache line is accessed, the contents in DRAM/LLC will be correct?  Something like that?

If so, then semantics that are designed for OoO cores, or support for JITs, or hardware with coherent caches, etc, are too heavy weight?  That's why you're proposing the above alternate semantics?

Thanks Guy, and Derek, et al, really interesting discussion :-)

Sean

P.S. If I am way off base, and this is not a constructive addition to the thread, I apologise, please ignore in that case.


On Fri, Aug 12, 2022 at 6:15 AM Guy Lemieux <guy.lemieux@...> wrote:
Derek, let's try to move the discussion forward instead of just back and forth.

If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?

Let's summarize this thread:

CBO.INVAL.I for icache invalidation is necessary. Since it will be used in a loop, it is important not to lock up the pipeline for performance reasons.

You believe CBO.INVAL.I must flush instruction pipes, otherwise such an instruction would be architecturally incomplete. To me, you are trying to turn a CBO into a fence instruction. As I understand it, CBOs do not execute synchronously/block, meaning they can be queued up somewhere in the system (but they should follow the memory system ordering semantics).

Perhaps we should simply add new FENCE.I instructions, or new semantics to the existing instruction, which flush instruction pipes and sync any outstanding CBO.INVAL.I operations, but do not dump the entire i-cache in noncoherent systems?

Such an instruction may be used ONCE at the end of the loop that issues all of the CBO.INVAL.I instructions, thus maintaining performance and imposing correctness.

The commentary in the FENCE.I instruction suggests there is encoding room for alternatives which use imm[11:0], rs1, and rd operands which "are reserved for finer-grain fences in future extensions". Here is one way to use that encoding space to add the required FENCE.I semantics:

-- FENCE.I with imm[11:0]==0x0 and rs1!=0: flushes instr pipelienes and fences against only those outstanding CBO.INVAL.I that specified the cache block named in rs1 (ie, it does not dump the full i-cache in non-coherent systems; in coherent systems it does a fine-grained FENCE.I against any visible updates to that cache block)

-- FENCE.I with imm[11:0]==0xFFF and rs1==0: flushes instr pipelienes and fences against ALL outstanding CBO.INVAL.I (ie, does not dump full i-cache in noncoherent systems; in coherent systems it behaves as FENCE.I with imm[11:0]==0).

Guy


On Thu, Aug 11, 2022 at 10:20 PM Derek Williams <striker@...> wrote:

Thank you for this email Guy. It clarifies many things. 


From: Mark Himelstein <markhimelstein@...>
Sent: Thursday, August 11, 2022 9:01 AM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>; tech-privileged <tech-privileged@...>
Cc: Derek Williams <striker@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
+tech-privileged  On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote: On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote: > You say below you state you are asking for a way to guarantee
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with the D-side's latest values.  I'm still don't think that works. 

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it 🙂.

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think that matters at all. 

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment> 

So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe. I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers. 

So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally. 

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs. 

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.


OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years). 

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use. 

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

I'm not at all a fan of hack like non-guarantees. 

This problem is not the same as the one the J-group is attempting to solve.

Actually, I think it is.  Since we disagree here, please explain the difference.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

As I say above, I think you will need the J group extension in most of its  totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it all with just the I-cache invalidation instruction. 

The J extension isn't that costly in complexity or performance. 

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here. 

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.

A few discussion forums:




There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:


https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.

Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point. 

I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation, that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions. 

For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years). However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now? No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to. 

Derek 


Guy






John Ingalls
 

Folks --
This email thread is getting long and my inbox is getting full.  I suggest that interested parties get the J group's proposal slide deck from Derek, and frame their questions and needs specifically relative to that proposal.
-- John


On Fri, Aug 12, 2022 at 8:28 AM Sean Halle <seanhalle@...> wrote:

Hi Guy, thanks for the email thread.  Very interesting.

I am hoping to get a precise understanding of 1) Your high level need, and 2) The details of the semantics you are asking for.

I do need to say up front that I'm not familiar with the current proposals from J Group, nor the details of other proposals on this subject, but this email is more to get precise details of what you're after.

Taking 2 first.. could you correct this interpretation of what, precisely you would like (independent of the use case or reason)?

some instruction (referred to as "target instruction") that:
- supplies an address (the "target addr")
- invalidates any cache line that is in the core's instr cache and contains the target address at the point that instr "takes effect".  (In practice, it takes effect when the instruction has completed update of the i-cache's tags, yes?  So the execution pipeline sees this.. in the mem stage for a 5 stage pipe?)
- Any instructions that appear in code order after the target instruction are stalled (or killed and restarted) until the target instr takes effect?
- no other effect.  At the cycle the target instruction takes effect, if a valid cache line exists (with either read or write permissions) in the same core's d-cache or in any other cache in the system, none are accessed nor affected in any way by this instruction.
- after the instruction takes effect, the next time the core's i-cache performs a read or write on the target cache line.. what?  Does the i-cache fetch only from the DRAM (or LLC), ignoring the core's d-cache and all other d-caches?  Or do the semantics of the target instruction have a side effect that causes behavior in the d-caches as well?  If so, could you say with similar precision, the semantics of interaction between the target instruction and d-caches?

===

So, with some clarity on the desired semantics, would you be up for saying a bit more about 1) -- the high level need?

My inference is that you are creating hardware that has, what, some alternate mechanism for synchronizing caches?  Say, at the software level?  If so, I'm guessing that what you want is the bare minimum logic in the caches and pipelines?  So, the goal is minimal logic, and minimal time spent to invalidate a single i-cache line.. because you have something else going on that handles the updates from d-caches to.. DRAM/LLC?  So all you need is something that just invalidates an i-cache line, then this other thing going on ensures that the next time that i-cache line is accessed, the contents in DRAM/LLC will be correct?  Something like that?

If so, then semantics that are designed for OoO cores, or support for JITs, or hardware with coherent caches, etc, are too heavy weight?  That's why you're proposing the above alternate semantics?

Thanks Guy, and Derek, et al, really interesting discussion :-)

Sean

P.S. If I am way off base, and this is not a constructive addition to the thread, I apologise, please ignore in that case.


On Fri, Aug 12, 2022 at 6:15 AM Guy Lemieux <guy.lemieux@...> wrote:
Derek, let's try to move the discussion forward instead of just back and forth.

If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?

Let's summarize this thread:

CBO.INVAL.I for icache invalidation is necessary. Since it will be used in a loop, it is important not to lock up the pipeline for performance reasons.

You believe CBO.INVAL.I must flush instruction pipes, otherwise such an instruction would be architecturally incomplete. To me, you are trying to turn a CBO into a fence instruction. As I understand it, CBOs do not execute synchronously/block, meaning they can be queued up somewhere in the system (but they should follow the memory system ordering semantics).

Perhaps we should simply add new FENCE.I instructions, or new semantics to the existing instruction, which flush instruction pipes and sync any outstanding CBO.INVAL.I operations, but do not dump the entire i-cache in noncoherent systems?

Such an instruction may be used ONCE at the end of the loop that issues all of the CBO.INVAL.I instructions, thus maintaining performance and imposing correctness.

The commentary in the FENCE.I instruction suggests there is encoding room for alternatives which use imm[11:0], rs1, and rd operands which "are reserved for finer-grain fences in future extensions". Here is one way to use that encoding space to add the required FENCE.I semantics:

-- FENCE.I with imm[11:0]==0x0 and rs1!=0: flushes instr pipelienes and fences against only those outstanding CBO.INVAL.I that specified the cache block named in rs1 (ie, it does not dump the full i-cache in non-coherent systems; in coherent systems it does a fine-grained FENCE.I against any visible updates to that cache block)

-- FENCE.I with imm[11:0]==0xFFF and rs1==0: flushes instr pipelienes and fences against ALL outstanding CBO.INVAL.I (ie, does not dump full i-cache in noncoherent systems; in coherent systems it behaves as FENCE.I with imm[11:0]==0).

Guy


On Thu, Aug 11, 2022 at 10:20 PM Derek Williams <striker@...> wrote:

Thank you for this email Guy. It clarifies many things. 


From: Mark Himelstein <markhimelstein@...>
Sent: Thursday, August 11, 2022 9:01 AM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>; tech-privileged <tech-privileged@...>
Cc: Derek Williams <striker@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
+tech-privileged  On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote: On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote: > You say below you state you are asking for a way to guarantee
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with the D-side's latest values.  I'm still don't think that works. 

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it 🙂.

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think that matters at all. 

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment> 

So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe. I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers. 

So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally. 

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs. 

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.


OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years). 

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use. 

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

I'm not at all a fan of hack like non-guarantees. 

This problem is not the same as the one the J-group is attempting to solve.

Actually, I think it is.  Since we disagree here, please explain the difference.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

As I say above, I think you will need the J group extension in most of its  totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it all with just the I-cache invalidation instruction. 

The J extension isn't that costly in complexity or performance. 

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here. 

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.

A few discussion forums:




There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:


https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.

Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point. 

I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation, that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions. 

For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years). However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now? No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to. 

Derek 


Guy






striker@...
 


OK Guy, I agree, let's try to move this forward. For that to happen you're going to need to do a fair amount of reading.

Smiles, I've been anything but opaque on this subject, please see the attached 107-page presentation.  The people in the J group wish I were more opaque 🙂. Yes, you'll need to read it all. 

It will thoroughly explain how we deal with I/D consistency (please know that at the margins there are inaccuracies and things have changed somewhat since this draft, but the main thrust is correct. Please bring me your point questions as to content and I'll happily explain). There's another presentation that is shorter and I need to polish up a bit and send out that reflects what we're going to be doing on the barriers more precisely than this one does. 

For the record, I do NOT, repeat NOT, for a minute believe in anything as dumb as the notion of having the I-cache invalidate instruction flush the instruction pipes (I'm not wiling to call it CBO.INVAL.I because I'm pretty sure wherever it winds up -- a free standing extension or in the J extension -- it won't be landing in the existing CBO extension since we don't want to have to buy off all of CBO to do I/D consistency).  

I believe the opposite: that you need another instruction to clear out the post I-cache pipes. I'm certainly NOT trying to turn the I-cache invalidation into a FENCE, and even the instruction I want (IMPORT.I -- analogous to ISYNC/ISB in Power/ARM) isn't really a memory operations fence.

The notion of having new FENCE.I instructions is an even worse idea than architecting the I-cache invalidation in isolation, so I'll just ignore all of that in the note other than to say NO. 

Again, there is no real rush or urgency here, so we can work out all of these issues over time and there's no need to fast-track a incomplete answer.

Derek Williams 


From: Guy Lemieux <guy.lemieux@...>
Sent: Friday, August 12, 2022 8:14 AM
To: Derek Williams <striker@...>
Cc: Mark Himelstein <markhimelstein@...>; tech-cmo@... Group Moderators <tech-cmo@...>; tech-privileged <tech-privileged@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
Derek, let's try to move the discussion forward instead of just back and forth. If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd
Derek, let's try to move the discussion forward instead of just back and forth.

If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?

Let's summarize this thread:

CBO.INVAL.I for icache invalidation is necessary. Since it will be used in a loop, it is important not to lock up the pipeline for performance reasons.

You believe CBO.INVAL.I must flush instruction pipes, otherwise such an instruction would be architecturally incomplete. To me, you are trying to turn a CBO into a fence instruction. As I understand it, CBOs do not execute synchronously/block, meaning they can be queued up somewhere in the system (but they should follow the memory system ordering semantics).

Perhaps we should simply add new FENCE.I instructions, or new semantics to the existing instruction, which flush instruction pipes and sync any outstanding CBO.INVAL.I operations, but do not dump the entire i-cache in noncoherent systems?

Such an instruction may be used ONCE at the end of the loop that issues all of the CBO.INVAL.I instructions, thus maintaining performance and imposing correctness.

The commentary in the FENCE.I instruction suggests there is encoding room for alternatives which use imm[11:0], rs1, and rd operands which "are reserved for finer-grain fences in future extensions". Here is one way to use that encoding space to add the required FENCE.I semantics:

-- FENCE.I with imm[11:0]==0x0 and rs1!=0: flushes instr pipelienes and fences against only those outstanding CBO.INVAL.I that specified the cache block named in rs1 (ie, it does not dump the full i-cache in non-coherent systems; in coherent systems it does a fine-grained FENCE.I against any visible updates to that cache block)

-- FENCE.I with imm[11:0]==0xFFF and rs1==0: flushes instr pipelienes and fences against ALL outstanding CBO.INVAL.I (ie, does not dump full i-cache in noncoherent systems; in coherent systems it behaves as FENCE.I with imm[11:0]==0).

Guy


On Thu, Aug 11, 2022 at 10:20 PM Derek Williams <striker@...> wrote:


Thank you for this email Guy. It clarifies many things. 


From: Mark Himelstein <markhimelstein@...>
Sent: Thursday, August 11, 2022 9:01 AM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>; tech-privileged <tech-privileged@...>
Cc: Derek Williams <striker@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
+tech-privileged  On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote: On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote: > You say below you state you are asking for a way to guarantee
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with the D-side's latest values.  I'm still don't think that works. 

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it 🙂.

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think that matters at all. 

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment> 

So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe. I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers. 

So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally. 

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs. 

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.


OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years). 

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use. 

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

I'm not at all a fan of hack like non-guarantees. 

This problem is not the same as the one the J-group is attempting to solve.

Actually, I think it is.  Since we disagree here, please explain the difference.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

As I say above, I think you will need the J group extension in most of its  totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it all with just the I-cache invalidation instruction. 

The J extension isn't that costly in complexity or performance. 

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here. 

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.

A few discussion forums:




There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:


https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.

Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point. 

I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation, that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions. 

For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years). However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now? No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to. 

Derek 


Guy






Andy Glew (Gmail) <andyglew@...>
 

Derek,  what's the status of taking your slides and turning them into an architecture spec?  Who is actually doing that?  How many years has it been already? 



__________________________________
| www.emclient.com

------ Original Message ------
To "Guy Lemieux" <guy.lemieux@...>
Cc "Mark Himelstein" <markhimelstein@...>; "tech-cmo@... Group Moderators" <tech-cmo@...>; "tech-privileged" <tech-privileged@...>; "Andrew Waterman" <andrew@...>; "allen.baum@..." <allen.baum@...>; "Martin Maas" <mmaas@...>; "John Ingalls" <john.ingalls@...>; "David Kruckemyer" <dkruckemyer@...>
Date 8/12/2022 08:53:24
Subject Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs


OK Guy, I agree, let's try to move this forward. For that to happen you're going to need to do a fair amount of reading.

Smiles, I've been anything but opaque on this subject, please see the attached 107-page presentation.  The people in the J group wish I were more opaque 🙂. Yes, you'll need to read it all. 

It will thoroughly explain how we deal with I/D consistency (please know that at the margins there are inaccuracies and things have changed somewhat since this draft, but the main thrust is correct. Please bring me your point questions as to content and I'll happily explain). There's another presentation that is shorter and I need to polish up a bit and send out that reflects what we're going to be doing on the barriers more precisely than this one does. 

For the record, I do NOT, repeat NOT, for a minute believe in anything as dumb as the notion of having the I-cache invalidate instruction flush the instruction pipes (I'm not wiling to call it CBO.INVAL.I because I'm pretty sure wherever it winds up -- a free standing extension or in the J extension -- it won't be landing in the existing CBO extension since we don't want to have to buy off all of CBO to do I/D consistency).  

I believe the opposite: that you need another instruction to clear out the post I-cache pipes. I'm certainly NOT trying to turn the I-cache invalidation into a FENCE, and even the instruction I want (IMPORT.I -- analogous to ISYNC/ISB in Power/ARM) isn't really a memory operations fence.

The notion of having new FENCE.I instructions is an even worse idea than architecting the I-cache invalidation in isolation, so I'll just ignore all of that in the note other than to say NO. 

Again, there is no real rush or urgency here, so we can work out all of these issues over time and there's no need to fast-track a incomplete answer.

Derek Williams 


From: Guy Lemieux <guy.lemieux@...>
Sent: Friday, August 12, 2022 8:14 AM
To: Derek Williams <striker@...>
Cc: Mark Himelstein <markhimelstein@...>; tech-cmo@... Group Moderators <tech-cmo@...>; tech-privileged <tech-privileged@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
Derek, let's try to move the discussion forward instead of just back and forth. If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd
Derek, let's try to move the discussion forward instead of just back and forth.

If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?

Let's summarize this thread:

CBO.INVAL.I for icache invalidation is necessary. Since it will be used in a loop, it is important not to lock up the pipeline for performance reasons.

You believe CBO.INVAL.I must flush instruction pipes, otherwise such an instruction would be architecturally incomplete. To me, you are trying to turn a CBO into a fence instruction. As I understand it, CBOs do not execute synchronously/block, meaning they can be queued up somewhere in the system (but they should follow the memory system ordering semantics).

Perhaps we should simply add new FENCE.I instructions, or new semantics to the existing instruction, which flush instruction pipes and sync any outstanding CBO.INVAL.I operations, but do not dump the entire i-cache in noncoherent systems?

Such an instruction may be used ONCE at the end of the loop that issues all of the CBO.INVAL.I instructions, thus maintaining performance and imposing correctness.

The commentary in the FENCE.I instruction suggests there is encoding room for alternatives which use imm[11:0], rs1, and rd operands which "are reserved for finer-grain fences in future extensions". Here is one way to use that encoding space to add the required FENCE.I semantics:

-- FENCE.I with imm[11:0]==0x0 and rs1!=0: flushes instr pipelienes and fences against only those outstanding CBO.INVAL.I that specified the cache block named in rs1 (ie, it does not dump the full i-cache in non-coherent systems; in coherent systems it does a fine-grained FENCE.I against any visible updates to that cache block)

-- FENCE.I with imm[11:0]==0xFFF and rs1==0: flushes instr pipelienes and fences against ALL outstanding CBO.INVAL.I (ie, does not dump full i-cache in noncoherent systems; in coherent systems it behaves as FENCE.I with imm[11:0]==0).

Guy


On Thu, Aug 11, 2022 at 10:20 PM Derek Williams <striker@...> wrote:

Thank you for this email Guy. It clarifies many things. 


From: Mark Himelstein <markhimelstein@...>
Sent: Thursday, August 11, 2022 9:01 AM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>; tech-privileged <tech-privileged@...>
Cc: Derek Williams <striker@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
+tech-privileged  On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote: On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote: > You say below you state you are asking for a way to guarantee
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with the D-side's latest values.  I'm still don't think that works. 

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it 🙂.

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think that matters at all. 

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment> 

So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe. I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers. 

So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally. 

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs. 

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.


OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years). 

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use. 

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

I'm not at all a fan of hack like non-guarantees. 

This problem is not the same as the one the J-group is attempting to solve.

Actually, I think it is.  Since we disagree here, please explain the difference.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

As I say above, I think you will need the J group extension in most of its  totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it all with just the I-cache invalidation instruction. 

The J extension isn't that costly in complexity or performance. 

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here. 

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.

A few discussion forums:




There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:


https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.

Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point. 

I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation, that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions. 

For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years). However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now? No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to. 

Derek 


Guy






Greg Favor
 

On Fri, Aug 12, 2022 at 10:09 AM Andy Glew (Gmail) <andyglew@...> wrote:
Derek,  what's the status of taking your slides and turning them into an architecture spec?  Who is actually doing that?  How many years has it been already? 

This is a now active effort again in the J group.  Derek has led discussions on selected aspects in some of the recent meetings.

Greg
 


striker@...
 


I'm getting there Andy. We have picked this back up recently and had a few meetings on it. 

How many years has it been? Too bloody many, but this will end this year. That sort of thing happens when I have a teenage daughter to raise and a day job. The RISC I/D job I have to do it on the side when I can between crises. 

Fortunately, I'm entering a work and kiddo down swing that is starting to allow me to spend more time on this already and will get even better in the next few weeks. 

Who's doing it? Me right now. Once I get a little further here, I'll be in a place where I can rationally draft people to help me write. Sadly, at the state I'm at now, others trying to write/help will make it worse not better. I need to get the initial stake finished and in the ground. 

Derek 

P.S. Glad to see you back... or have you been back and this just got on your radar since it went over to privileged? 



From: tech-cmo@... <tech-cmo@...> on behalf of Andy Glew (Gmail) <andyglew@...>
Sent: Friday, August 12, 2022 12:09 PM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>
Cc: Mark Himelstein <markhimelstein@...>; tech-cmo@... Group Moderators <tech-cmo@...>; tech-privileged <tech-privileged@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 

Derek, what's the status of taking your slides and turning them into an architecture spec? Who is actually doing that? How many years has it been already? __________________________________ | www.emclient.com ------ Original Message ------
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd
Derek,  what's the status of taking your slides and turning them into an architecture spec?  Who is actually doing that?  How many years has it been already? 



__________________________________
| www.emclient.com

------ Original Message ------
To "Guy Lemieux" <guy.lemieux@...>
Cc "Mark Himelstein" <markhimelstein@...>; "tech-cmo@... Group Moderators" <tech-cmo@...>; "tech-privileged" <tech-privileged@...>; "Andrew Waterman" <andrew@...>; "allen.baum@..." <allen.baum@...>; "Martin Maas" <mmaas@...>; "John Ingalls" <john.ingalls@...>; "David Kruckemyer" <dkruckemyer@...>
Date 8/12/2022 08:53:24
Subject Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs


OK Guy, I agree, let's try to move this forward. For that to happen you're going to need to do a fair amount of reading.

Smiles, I've been anything but opaque on this subject, please see the attached 107-page presentation.  The people in the J group wish I were more opaque 🙂. Yes, you'll need to read it all. 

It will thoroughly explain how we deal with I/D consistency (please know that at the margins there are inaccuracies and things have changed somewhat since this draft, but the main thrust is correct. Please bring me your point questions as to content and I'll happily explain). There's another presentation that is shorter and I need to polish up a bit and send out that reflects what we're going to be doing on the barriers more precisely than this one does. 

For the record, I do NOT, repeat NOT, for a minute believe in anything as dumb as the notion of having the I-cache invalidate instruction flush the instruction pipes (I'm not wiling to call it CBO.INVAL.I because I'm pretty sure wherever it winds up -- a free standing extension or in the J extension -- it won't be landing in the existing CBO extension since we don't want to have to buy off all of CBO to do I/D consistency).  

I believe the opposite: that you need another instruction to clear out the post I-cache pipes. I'm certainly NOT trying to turn the I-cache invalidation into a FENCE, and even the instruction I want (IMPORT.I -- analogous to ISYNC/ISB in Power/ARM) isn't really a memory operations fence.

The notion of having new FENCE.I instructions is an even worse idea than architecting the I-cache invalidation in isolation, so I'll just ignore all of that in the note other than to say NO. 

Again, there is no real rush or urgency here, so we can work out all of these issues over time and there's no need to fast-track a incomplete answer.

Derek Williams 


From: Guy Lemieux <guy.lemieux@...>
Sent: Friday, August 12, 2022 8:14 AM
To: Derek Williams <striker@...>
Cc: Mark Himelstein <markhimelstein@...>; tech-cmo@... Group Moderators <tech-cmo@...>; tech-privileged <tech-privileged@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
Derek, let's try to move the discussion forward instead of just back and forth. If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd
Derek, let's try to move the discussion forward instead of just back and forth.

If the J extension will solve this problem, please describe how? You've been very opaque -- is that because it hasn't been addressed yet by the committee?

Let's summarize this thread:

CBO.INVAL.I for icache invalidation is necessary. Since it will be used in a loop, it is important not to lock up the pipeline for performance reasons.

You believe CBO.INVAL.I must flush instruction pipes, otherwise such an instruction would be architecturally incomplete. To me, you are trying to turn a CBO into a fence instruction. As I understand it, CBOs do not execute synchronously/block, meaning they can be queued up somewhere in the system (but they should follow the memory system ordering semantics).

Perhaps we should simply add new FENCE.I instructions, or new semantics to the existing instruction, which flush instruction pipes and sync any outstanding CBO.INVAL.I operations, but do not dump the entire i-cache in noncoherent systems?

Such an instruction may be used ONCE at the end of the loop that issues all of the CBO.INVAL.I instructions, thus maintaining performance and imposing correctness.

The commentary in the FENCE.I instruction suggests there is encoding room for alternatives which use imm[11:0], rs1, and rd operands which "are reserved for finer-grain fences in future extensions". Here is one way to use that encoding space to add the required FENCE.I semantics:

-- FENCE.I with imm[11:0]==0x0 and rs1!=0: flushes instr pipelienes and fences against only those outstanding CBO.INVAL.I that specified the cache block named in rs1 (ie, it does not dump the full i-cache in non-coherent systems; in coherent systems it does a fine-grained FENCE.I against any visible updates to that cache block)

-- FENCE.I with imm[11:0]==0xFFF and rs1==0: flushes instr pipelienes and fences against ALL outstanding CBO.INVAL.I (ie, does not dump full i-cache in noncoherent systems; in coherent systems it behaves as FENCE.I with imm[11:0]==0).

Guy


On Thu, Aug 11, 2022 at 10:20 PM Derek Williams <striker@...> wrote:

Thank you for this email Guy. It clarifies many things. 


From: Mark Himelstein <markhimelstein@...>
Sent: Thursday, August 11, 2022 9:01 AM
To: tech-cmo@... Group Moderators <tech-cmo@...>; Guy Lemieux <guy.lemieux@...>; tech-privileged <tech-privileged@...>
Cc: Derek Williams <striker@...>; Andrew Waterman <andrew@...>; allen.baum@... <allen.baum@...>; Martin Maas <mmaas@...>; John Ingalls <john.ingalls@...>; David Kruckemyer <dkruckemyer@...>
Subject: [EXTERNAL] Re: [RISC-V] [tech-cmo] [riscv-CMOs:master] reported: Can CMO extension support icache management? #github #risv #CMOs
 
+tech-privileged  On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote: On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote: > You say below you state you are asking for a way to guarantee
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
> You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're asking for.

Yes it's what I'm asking for.

OK, I can believe you are asking for that instruction (or a variant -- see below), but you also say below you're going to use it in a way that really does replace what FENCE.I (and some IPIs and such) do... which is to bring the I-fetch side up to date with the D-side's latest values.  I'm still don't think that works. 

> I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given address, that would be enough to get you there.

No, we already have FENCE.I, and I understand how to use it. I
understand that in a non-coherent system, it may not do anything at
all to the i-cache (eg, if an IODMA channel replaces executable code
in memory, and those writes are not observable by the hart executing
FENCE.I).

I'll skip over this para since I don't really understand FENCE.I and I am going to try really hard to never learn it 🙂.

  > My point is that is NOT enough unless you make a whole mess of
other assumptions about the system and somehow clean out the post
I-cache pipes of stale instructions which is very bad architecture. A
cache invalidate instruction is one leg at best of the three (or more)
legs you need to hold this three-legged stool up.

I'm not trying to run self-modifying code or JIT code. I'm trying to
load new code from a non-coherent I/O device, and I want to ensure
there are no other copies floating in i-caches anywhere. I will settle
for an instruction that removes a cache block from the local hart
i-cache, because there is no coherence mechanism. In a coherent
system, an instruction that also removes a cache block in other harts'
i-caches is fine, but I don't care about that use-case; I know others
do care, so it should be defined as part of the CBO.INVAL.I operation,
and this would make it consistent with other CBO.* instructions. IODMA
is likely to be non-coherent in either case.

Huh... ok. I'm not sure that it matters all that much if the D-side updates to create new instructions comes from store instructions from a HART or from an I/O device (incoherent or not) -- at least as far as the post I-cache buffer flushing goes. I don't think that matters at all. 

There is no need to worry about the post i-cache pipes because the OS
can guarantee that the code in that thread has stopped running and
therefore has nothing in flight. However, it has no way of knowing the
state of the i-caches.

This is we depart controlled flight. Just because the OS stops running the thread doesn't necessarily guarantee that there isn't something stale lurking in a loop cache or some other exotic structure down-wind of the I-cache that isn't necessarily cleared by the I-cache invalidation instruction. So, while I'm quite certain that your application might, I can dream up ones where that doesn't necessarily hold. <spoiler alert, I'm lying just a bit here. There is one subtle point in the JIT stuff in the j-extension that might make this case have to work out, but even there you need more than just the I-cache invalidate instruction and the point remains... but we hold that aside for the moment> 

So, in the end, I think by hanging everything on the i-cahce invalidate instruction, you're inducing some exceptionally subtle additional requirements on the implementation of that i-cache invalidation instruction that are hard to even pin down or describe. I think you really have to has some equivalent of ISYNC/ISB/IMPORT.I to cleanly and architecturally close the hole in the post I-cache buffers. 

So, yes, I think in the end you really do need to bite off the full J-extension if you want something that is architecture that works everywhere. Just using the I-cache invalidate might work in your application, but I don't think it closes generally. 

I'm not asking for it, but I don't believe there is an instruction
that flushes the entire i-cache of a hart. I also believe there is no
way to flush all i-caches of a system. I'm not sure if those would be
useful; they are not on my radar, and they would be very disruptive to
performance.

I'm going to skip this paragraph over as well other than to say that's a different debate for another day and my initial reaction is those instructions are really hard to define cleanly as architecture, have limited usefulness, and are much better left to implementation specific instructions in the implementations that can actually justify needing them. They are not mainline architecture everyone needs. 

> My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs this defined right now.

I am starting a new project around IODMA and coherence issues.
Although it is academic, it will exist physically (real logic on an
FPGA) and run real code. This is not so urgent that I need the
instruction yesterday, or even in 12 months, as I can always create my
own instruction. I do not normally do OS-level work, but this is where
such an instruction would be used the most, and consultation should be
made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
Zephyr, etc to see how they handle this problem.


OK... good. So I would suggest we stop trying to fasttrack anything. We're going to be done with this J group extension if it kills be before your 12- month timeline (and it might). I am perfectly happy for whomever from all of those groups to show up at the J meetings and provide there input/tell us what we have that doesn't fit their needs (but that worked fine for ARM and Power for 25+ years). 

The problem, to be clear, is inadvertent execution of stale code
because there is no i-cache coherence and there are no i-cache
management instructions.

Yes, though arguably FENCE.I is an i-cache management instruction, just one none of us like to use. 

One "workaround" to this problem is for the OS to never re-use a
physical address for new code until it has to wrap around. This is a
"lazy" way that hopes i-cache contents are eventually replaced on
their own. However, it is not a guarantee.

I'm not at all a fan of hack like non-guarantees. 

This problem is not the same as the one the J-group is attempting to solve.

Actually, I think it is.  Since we disagree here, please explain the difference.

> If this is just an academic critique, I see no real reason to fast track an I-cache invalidate instruction on it's own, especially when that is not a complete solution to the problem and you'll be getting that soon (along with all the other necessary parts) with the J group proposal.

I have no insight into what is coming with the J group proposal.
However, I don't think it should be required to adopt the entire J
extension to get this capability. The J group is concerned with
self-modifying / dynamic generation of code on the fly, which is a
different use case which may care about what is in the current
execution pipeline.

As I say above, I think you will need the J group extension in most of its  totality to have a solution that closes in all cases and doesn't rely on subtle unstated requirements/assumptions. I would be loath to try and architect something that tries to do it all with just the I-cache invalidation instruction. 

The J extension isn't that costly in complexity or performance. 

> If you have something more than an academic critique here, please share that to the extent possible, but even if you do, the I-cache invalidate on it's own isn't enough to provide a full solution, so I'm still not sure we should be fast-tracking anything.

My request for a fast-track wasn't due to a sense of urgency, but due
to a belief this is relatively simple and easy to define. I am not
attempting to fully define it at this point, but trying to see if
there is any support from others.

I agree that an instruction that invalidates the I-caches is well within the know state of the art to define. I just think defining that on its own is useless for the purposes we're trying to get to here. 

> We also need to know the expectations and use cases for the others you've seen request it.

I'll use Google to help us out, but I won't carefully read each link below.

Thank you for this. It's 11:50pm though and I need to give up. I won't do a deeper dive on this until later.

A few discussion forums:




There are also several more, including the most recent one (a CMO
github issue, I believe) that spurred this conversation (but not this
thread).

You can also find code that expects to use icache flush instructions:


https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/riscv/flush-icache.c.html

(this link assumes FENCE.I flushes the i-cache, which non-coherent
systems don't have to do, so it is technically incorrect)

----

> My concern is that the I-cache invalidate instruction isn't enough to do good architecture here, so it unhelpful to ratify that without the rest of the pieces.

You've stated this several times, and I think you have become
entrenched in your position. To you, it seems that "good architecture"
means worrying about flushing the pipeline. However, there are other use-cases you are failing to recognize.

Entrenched position.. yes.. for 25 years across two major architectures (Power and ARM) for deeply held reasons and I will continue to repeat myself on that point. 

I really don't think the use cases are different and I don't believe I'm failing to recognize your use case. Instead, I think you are looking at a specific "system design" point that gives you subtle characteristics, that aren't supported in every implementation, that would let you do what you're hoping to do. Totally fair if you want to set up your system design that way and document that and use your own instructions. 

For me, good architecture means the architecture language prevents an implementer from building an implementation that breaks if they don't follow the architecture rules. I think the extension we're going to propose they can't break (it's held for 25+ years). However, I think I can build an implementation that would break the I-cache invalidate instruction only architecture. To the extent that is possible, I don't think that should ever be architecture.

To me, I'm worried about "good system design", and as you can see by
the Linux link above it has a bug in its assumptions about i-cache
flushes. Instead, as the code links above show, we are already getting
incorrect software show up.

I don't have time right now to look at all that, and in the end you'll probably have point a bit more specifically to the places inside those web pages that support what you're trying to get across. But, would it shock me that people are a bit confused now? No. If this isn't clearly defined going in, it's very easy to confuse folks. And just because they may make an assumption about what an I-cache invalidate does, does not mean that that assumption is rational or should be pandered to. 

Derek 


Guy