Guy Lemieux <guy.lemieux@...>
Dumping the entire i-cache via FENCE.I is different. I am requesting invalidation of a single cache block from the i-cache.
The Zifencei spec recognizes Zifencei is expensive to implement:
"on some systems, FENCE.I will be expensive to implement and alternate mechanisms are being discussed in the memory model task group. In particular, for designs that have an incoherent instruction cache and an incoherent data cache, or where the instruction cache refill does not snoop a coherent data cache, both caches must be completely flushed when a FENCE.I instruction is encountered. This problem is exacerbated when there are multiple levels of I and D cache in front of a unified cache or outer memory system."
In addition to being "expensive to implement", it is also rather costly to runtime.
Since Zifencei is an optional extension, it makes sense that other optional things might be used to replace it. The spec itself suggests that another mechanism may come about in the future, namely: "Currently, this instruction is the only standard mechanism to ensure that stores visible to a hart will also be visible to its instruction fetches."
As an aside, the FENCE.I instruction has been misinterpreted many times, and there are many posts correcting people (including correcting me) about its operation. This is partly because the spec is ambiguously written, and partly because it is unexpected that it could have an impact on an entire i-cache.
At the heart of the ambiguity is in the introductory paragraph of the Zifencei spec:
"This chapter defines the “Zifencei” extension ... explicit synchronization between writes to instruction memory and instruction fetches on the same hart."
This could be interpreted as "between (writes to instruction memory and instruction fetches) on the same hart", or it could be interpreted as "between writes to instruction memory and (instruction fetches on the same hart)". If the latter, then what is the other thing that is being done to warrant the expression "on the same hart"?
Some of the subsequent text can be read in either context, thus propagating the ambiguity. In particular, the key line of the spec is ambiguous: "FENCE.I does not ensure that other RISC-V harts’ instruction fetches will observe the local hart’s stores in a multiprocessor system." In this case, the spec should read: "Executing FENCE.I on a local hart does not ensure that other RISC-V harts' instruction fetches will observe the local hart's stores." However, an equally valid but opposite interpretation would be: "Executing FENCE.I on a multiprocessor system does not ensure that other RISC-V harts' instruction fetches will observe a local hart's stores."
At some point, the Zifencei spec should be updated to be more clear.
On Thu, Aug 11, 2022 at 7:43 AM <krste@...
>>>>> On Thu, 11 Aug 2022 07:01:58 -0700, "mark" <markhimelstein@...> said:
| On Thu, Aug 11, 2022 at 6:48 AM Guy Lemieux <guy.lemieux@...> wrote:
| On Wed, Aug 10, 2022 at 10:39 PM Derek Williams <striker@...> wrote:
|| You say below you state you are asking for a way to guarantee that no I-caches have stale data (for a given block -- my words there). I don't think that's really what you're
| asking for.
| Yes it's what I'm asking for.
|| I think what you're really trying to do is replace FENCE.I and you believe that if you just had an instruction that would blow away all the I-cache copies of a given
| address, that would be enough to get you there.
| No, we already have FENCE.I, and I understand how to use it. I
| understand that in a non-coherent system, it may not do anything at
| all to the i-cache (eg, if an IODMA channel replaces executable code
| in memory, and those writes are not observable by the hart executing
Assuming you have a way of knowing when the IODMA channel has made all
its writes visible to the local hart (e.g., an interrupt on
completion), then a FENCE.I should make any writes made by any agent
in the memory system visible to the local hart.
If you have an incoherent I-cache, then most likely the implementation
will have to flush the I-cache as well as the instruction pipeline to
implement FENCE.I correctly.
I believe this handles your use case below.
|| My question wasn't that, but was more along the lines of is this just an academic critique of the architecture as it exists now, or is there some real project that needs
| this defined right now.
| I am starting a new project around IODMA and coherence issues.
| Although it is academic, it will exist physically (real logic on an
| FPGA) and run real code. This is not so urgent that I need the
| instruction yesterday, or even in 12 months, as I can always create my
| own instruction. I do not normally do OS-level work, but this is where
| such an instruction would be used the most, and consultation should be
| made with maintainers of RobotOS (ROS2), Embedded Linux, FreeRTOS,
| Zephyr, etc to see how they handle this problem.
| The problem, to be clear, is inadvertent execution of stale code
| because there is no i-cache coherence and there are no i-cache
| management instructions.
| One "workaround" to this problem is for the OS to never re-use a
| physical address for new code until it has to wrap around. This is a
| "lazy" way that hopes i-cache contents are eventually replaced on
| their own. However, it is not a guarantee.