Specifying Cache Granule Size in Platform
During the Platform HSC the topic of specifying an expected cache granule size for a platform was brought up. Below are some thoughts/observations on the topic. The purpose of this email is to provide a basis for discussion.
Philip indicated cbo.zero would be the main target for specifying the size. If we went down that path, I think it'd be important to also specify the size for cbo.flush as well.
If the granule size is only determined at run time there inherently are solutions that need to be constructed to deal with a number of scenarios. Some cases to think about:
1. Userland drivers
2. VM migration
3. object layout in memory for optimization purposes
4. inconsistent granule size within different harts in the system
Ved brought up how it's important to interrogate the cache hierarchy for these attributes, and that it's important to know on a per-hart basis what cache granule size it deals with.
We can certainly engineer solutions for the various cases in either scenario, but I think the discussion around specifying the granule size should be in understanding how we'd deal with those solutions going forward. And please offer any other scenarios that may come to mind. Lastly, I apologize if I missed any particular commentary on this subject.
toggle quoted messageShow quoted text
A few comments to add on here:
- The expectation in the CMO group was that there *could* be different cache block sizes for different operations. As Aaron points out, one may have one block size for Zicboz instructions, and another for Zicbom instructions (and maybe even another for Zicbop instructions, too).
- v1.0 of these specs requires that the block size is fixed across the system
- The block size is discoverable via to-be-specified standard discovery structure, but the thinking was that software at all privilege levels could access that information (though such information may be virtualized as necessary)
- Though not formally specified (yet), the working definition of "cache block" is the NAPOT region that a CMO instruction operates on, independent of the implemented cache line size; so in some sense, a cache block abstracts away cache lines. As I mentioned, that's a working definition, and there are of course issues that arise when there is a mismatch between the abstract block size and the actual line size (and one of the items on the to-do list for the CMO TG is to address some of these issues). Anyway, the key takeaway is that the terminology for block, line, granule, etc. should be well-defined, and the platform group and the CMO TG should align.
- Note that there is often a close relationship between the cache line size and the coherence granule size (I know, introducing a new term here), and though this may not be the right forum for the discussion, the architecture needs to understand whether different line sizes implies different coherence granule sizes (or more generally, whether one implies constraints on the other).
- Another related issue is LR/SC reservation set size.
- Philosophical thought: Trap and emulate allows one to commit a bunch of sins, and it may be pragmatic for certain platform definitions to be performant only for "common" implementations and to just be compatible for other implementations. For example, the block size for CBO.ZERO could be defined to be 64B in a platform. On harts that support 64B zero writes, executing CBO.ZERO performs the operation without a trap; on harts that do not, executing CBO.ZERO traps to a handler that performs the equivalent of the 64B zero write. (The trapping behavior can be controlled by xenvcfg.CBZE.)
That was a bit more than I expected to write. Hope some of it is helpful....
On Mon, Jan 24, 2022 at 7:15 PM Aaron Durbin <adurbin@...> wrote:
toggle quoted messageShow quoted text
Are we or will we be using any benchmarks here?
I worry that this is based on experience. Workloads have evolved.
Things like traversing workloads (hadoop, cpu encrypt, cpu compress, ...) and sparse workloads (uniq query hash tables, sparse matrices, ...) have many times become more important than the principles of locality.
I know we are just starting the perf modeling sig but I suggest we will have to settle on a tool we use to identify issues and mark progress over an agreed set of cache/memory configurations. Of course vendors will test on their hardware. While I understand some basics are just needed, when I hear David talk in his comments I suggest we pick ways to grade ourselves and our choices objectively.
In my experience getting these things wrong (locks and latches, cache prefetch, cache fragmentation, cache occupancy, etc.) can easily exceed any benefit we get from instruction architecture improvements.
These issues cross many group boundaries.
On Mon, Jan 24, 2022 at 10:28 PM David Kruckemyer <dkruckemyer@...> wrote: