Hi all,
A few comments to add on here:
- The expectation in the CMO group was that there *could* be different cache block sizes for different operations. As Aaron points out, one may have one block size for Zicboz instructions, and another for Zicbom instructions (and maybe even another for Zicbop instructions, too).
- v1.0 of these specs requires that the block size is fixed across the system
- The block size is discoverable via to-be-specified standard discovery structure, but the thinking was that software at all privilege levels could access that information (though such information may be virtualized as necessary)
- Though not formally specified (yet), the working definition of "cache block" is the NAPOT region that a CMO instruction operates on, independent of the implemented cache line size; so in some sense, a cache block abstracts away cache lines. As I mentioned, that's a working definition, and there are of course issues that arise when there is a mismatch between the abstract block size and the actual line size (and one of the items on the to-do list for the CMO TG is to address some of these issues). Anyway, the key takeaway is that the terminology for block, line, granule, etc. should be well-defined, and the platform group and the CMO TG should align.
- Note that there is often a close relationship between the cache line size and the coherence granule size (I know, introducing a new term here), and though this may not be the right forum for the discussion, the architecture needs to understand whether different line sizes implies different coherence granule sizes (or more generally, whether one implies constraints on the other).
- Another related issue is LR/SC reservation set size.
- Philosophical thought: Trap and emulate allows one to commit a bunch of sins, and it may be pragmatic for certain platform definitions to be performant only for "common" implementations and to just be compatible for other implementations. For example, the block size for CBO.ZERO could be defined to be 64B in a platform. On harts that support 64B zero writes, executing CBO.ZERO performs the operation without a trap; on harts that do not, executing CBO.ZERO traps to a handler that performs the equivalent of the 64B zero write. (The trapping behavior can be controlled by xenvcfg.CBZE.)
That was a bit more than I expected to write. Hope some of it is helpful....
Cheers,
David