Are pages allowed to cross PMA regions?
Andres Amaya Garcia
Hello,
There is something unclear to me after reading the PMA section or the Privileged ISA manual (i.e. Section 3.6). Can a virtual paged be mapped to addresses that cross PMA regions? For example, is it acceptable to map a 1GB page such that half its physical addresses have the (e.g.) cacheable attribute but the other half of physical addresses are uncacheable? You could think about this with every attribute: vacant, idempotent, etc. This sounds odd, but the ISA does not explicitly allow or forbid it. Is it something that must to be supported? If so, are there example use-cases? Thanks for the help! |
|
Andy Glew (Gmail) <andyglew@...>
I cannot say what the RISC-V rule is but I can provide example use cases for similar issues from other architectures. (1) Legacy MMIO map (2) non-legacy MMIOmap with huge, larger and larger pages / (3) device vendors that wish to pack all of their device memory into a compact region (4) security issues === (1) for example, x86 has an extremely fragmented legacy MMIO map below 1 MB: some regions are 4K granular, some 16, some 64... ; but OS vendors wanted to use a single large page, whether originally 4M/2M or eventually 1G etc. to map it, because big mappings reduced TLB pressure, and in particular because they might also want to access DRAM or ROM in that area efficiently. to deal with this Intel x86 has the ability to "splinter" large TLB entries (2M/4M/1G/...) into smaller 4K entries, and has the ability to indicate subregions of large TLB entries not being present. e.g. one might have a 4M TLB entry marked such that only [1M,4M) are valid, and accesses in [0,1M) most lookup splintered entries in the 4K TLBs. this was done because many if not most or all Intel x86 implementations cache memory types, the things you get from PMAs, in the TLB BTW more and more I wish that I had not decided to store memory types in the TLB, since there was plenty of time to do an MTRR lookup on a cache miss. I'm not saying the RISC-V has to do this. I'm just describing a use case (1'): if you allow such fragmentation of memory attributes, and implementation may choose to separate TLBs for translation from a protection look aside buffer for protection and memory attributes - all that a PLB or perhaps a APLB, attribute and protection look aside buffer. TLB entries are quite big since they require both physical and virtual addresses, whereas one may get away with only a few bits per granule, e.g. 4K granule, with many such granules sharing the same APLB entry. the implementation can hide the APLB, behaving as if it is nothing except TLBs that the US needs to manage
(2) all the RISC-V people may deprecate legacy memory map issues, the same arises even if it's not legacy... (3) another use case is less legacy related: I/O device vendors sometimes want to constrain all of the physical memory addresses related to their devices to a single naturally aligned power-of-2 region. but I/O device vendors often have multiple different memory types for a single device. E.g. a GPU might want to have 1 GB or 16 GB of frame buffer memory, mapped something like write combining, and a far smaller amount of active MMIO memory. e.g. given a base address B which is a multiple of a gigabyte, the I/O device vendor might want [K,K+1G-16K) mapped write combing Optimize for frame buffer, and [K+1G-16K,K+1G) mapped non-idempotent uncacheable. there is much less need for this nowadays, since PCI now allows I/O devices to declare a list of their memory requirements, e.g. 1G WC and 16K UC in the example above. PCI then allows the physical addresses associated with the I/O device to be changed, so that the WC memory from this device and others is nicely aligned, as is the MMIO UC. however, not everybody likes the idea of physical addresses being able to change. Moreover, bus bridges from between different physical address widths may prefer not to waste physical address ranges. (4) if you wish to legislate that virtual memory translations cannot cross PMA boundaries, the question is how do you enforce it. if the operating system or hypervisor that controls the virtual memory translations is the most privileged software in the system, you can probably do this, risking mainly accidental bugs however, quite a few secure systems have privilege domains that are more privileged than the operating system or hypervisor, but which do not want to manage the virtual memory translations. more they want to allow the operating system or hypervisor to control the page tables as much as possible for performance reasons. but if there is then a correctness problem if the operating system or hypervisor has allowed a large page translation to cross PMA boundaries, it must be trapped at least, and possibly emulated if it's transparent. __________________________________ | www.emclient.com ------ Original Message ------
From "andres.amaya via lists.riscv.org" <andres.amaya=codasip.com@...>
Date 8/12/2022 07:10:40
Subject [RISC-V] [tech-privileged] Are pages allowed to cross PMA regions? Hello, |
|
Greg Favor
The PMA architecture allows a lot of implementation flexibility - including for example having small 4B regions. In that example one could easily have one 4KB page overlap multiple PMA regions. Conversely, in a typical OS-A class system using demand-paged virtual memory, the implementor will probably choose to have a minimum 4KB granularity to PMA regions. Although this still allows 2MB, 1GB, and 512GB pages to overlap multiple PMA regions. (Which in typical TLB implementations leads to what some would call "atomization" of page mappings into smaller TLB entry mappings.) In short, if a page overlaps multiple regions, then that needs to be handled properly. Typically any given load/store/ifetch/implicit access that is being checked will fall in one page and in one PMA region - in which case the behavior is obvious. But if that access straddles multiple pages and/or PMA regions, then each byte of the access must pass its MMU and PMA checks for the whole access to be allowed. |
|
| Can a virtual paged be mapped to addresses that cross PMA regions? For example, is it acceptable to map a 1GB page such that half its physical addresses have the (e.g.) cacheableOn Fri, 12 Aug 2022 10:35:15 -0700, "Greg Favor" <gfavor@...> said: | attribute but the other half of physical addresses are uncacheable? You could think about this with every attribute: vacant, idempotent, etc. | This sounds odd, but the ISA does not explicitly allow or forbid it. Is it something that must to be supported? If so, are there example use-cases? | The PMA architecture allows a lot of implementation flexibility - including for example having small 4B regions. In that example one could easily have one 4KB page overlap multiple | PMA regions. | Conversely, in a typical OS-A class system using demand-paged virtual memory, the implementor will probably choose to have a minimum 4KB granularity to PMA regions. Although this | still allows 2MB, 1GB, and 512GB pages to overlap multiple PMA regions. (Which in typical TLB implementations leads to what some would call "atomization" of page mappings into | smaller TLB entry mappings.) Even in a RISC-V OS-A platform, the implementor might be stuck with using IP peripherals where PMAs vary at the sub-page granularity. | In short, if a page overlaps multiple regions, then that needs to be handled properly. Typically any given load/store/ifetch/implicit access that is being checked will fall in one | page and in one PMA region - in which case the behavior is obvious. But if that access straddles multiple pages and/or PMA regions, then each byte of the access must pass its MMU | and PMA checks for the whole access to be allowed. Yes. We have some text for this in some places, but these concepts should really be factored out somewhere central. Krste | |
|
Andy Glew (Gmail) <andyglew@...>
It would be nice if it was architecturally defined/permitted for such straddling accesses to be performed a byte at a time. That makes the trap and emulate handler easier to code.
If not a byte at a time, then whatever is the largest possible NAPOT size that the access can be decomposed into.
But anything coarser grained than a byte, or whatever the finest granule of PMA is, either requires the trap and emulate handler to probe permissions to guarantee that the transactions it emits are not themselves straddling, or you have to be ready to handle nested such trap and emulations. Or at least tail recursive.
------ Original Message ------
From "Krste Asanovic" <krste@...>
To "Greg Favor" <gfavor@...>
Date 8/12/2022 12:57:55
Subject Re: [RISC-V] [tech-privileged] Are pages allowed to cross PMA regions?
|
|
Greg Favor
That could be ok for accesses to idempotent memory, but would likely be problematic for a non-idempotent location (e.g. a memory-mapped I/O register), and byte accesses to a word MMIO register might not even be allowed by the PMAs for that location. |
|
There are at least 3 potential boundaries: MMU pages, PMP regions, and PMA regions. All bytes of an access must be contained within a single PMP region. The operative word there is "access", because a misaligned load /store may be (and is typically) split into two separate accesses. Ordering of those accesses is not spec'ed, so it's possible to get various exceptions with either the lower or upper part of the load/store, (or both). When that happens on a store, the trap may occur after either the low hor high alf has been written. (non-determinsitically even, so it's a bear to test). I don't know if that specific rule applies to PMA's or MMU page crossings, but if a misaligned access is split into two (or more, eventually) accesses that don't cross a boundary, then it's moot; you treat them individually. .That split is hard to avoid But an implementation isn't required to split a misaligned address, and outside of the PMP spec, I don't think that case is mentioned An implementation is free to always trap on a misaligned access and perform it byte-by-byte (while ensuring no interrupt can occur in the middle, lest someone see a stale value) I believe it is also legal to handle it entirely in HW excecpt when it crosses a various boundaries (e.g. cacheline, page, etc), and signal a misalign exception if it does. Or even signal a misalign exception depending on the phase of the moon (or other non-architecural state). Personally, I'd be really happy if we could tighten those rules up a lot. On Fri, Aug 12, 2022 at 2:28 PM Greg Favor <gfavor@...> wrote:
|
|
>In particular, a portion of a misaligned store that passes the PMP check may become visible, even if another portion fails the PMP check I had no idea this was in the spec - so I'm glad you added that comment Allen. yes - between MMU pages, PMP regions and PMA regions it's all pretty complex. In systems with an MMU do people typically also implement the PMP? And if so why? As the granularity of PMA and PMP regions are implementation defined - I'm wondering if a nice simplification would be to specify them both with 64-byte granularity, and 64-byte alignment to match the cache-block size for the CMOs. At least then the PMAs can't cross the boundary of a TLB page. Tariq On Sat, 13 Aug 2022 at 09:02, Allen Baum <allen.baum@...> wrote:
--
Tariq Kurd | Lead IP Architect | Codasip UK Design Centre | www.codasip.com |
|
|| In particular, a portion of a misaligned store that passes the PMP check may become visible, even if another portion fails the PMP checkOn Mon, 15 Aug 2022 10:14:59 +0200, Tariq Kurd <tariq.kurd@...> said: | I had no idea this was in the spec - so I'm glad you added that comment Allen. | yes - between MMU pages, PMP regions and PMA regions it's all pretty complex. | In systems with an MMU do people typically also implement the PMP? And if so why? Yes. To contain < M-mode code running on the hart (including implicit references such as page-table walkers). M-mode+PMP can provide a monitor that isolates and multiplexes multiple S-mode stacks, as in Keystone enclave work. | As the granularity of PMA and PMP regions are implementation defined - I'm wondering if a nice simplification would be to specify them | both with 64-byte granularity, and 64-byte alignment to match the cache-block size for the CMOs. At least then the PMAs can't cross the | boundary of a TLB page. For TLBs, the important simplification is PMP/PMA aren't <4KiB in granularity, as then existing TLB entires can be used to cache permissions. Having PMP/PMA granules larger than a page is fine, as these would only be checked on a TLB miss. If < page, then easiest solution is to not cache these regions in TLB, forcing a TLB miss+check on every access, for example. Of course, other alternative microarch schemes are possible. Krste | Tariq | On Sat, 13 Aug 2022 at 09:02, Allen Baum <allen.baum@...> wrote: | There are at least 3 potential boundaries: MMU pages, PMP regions, and PMA regions. | All bytes of an access must be contained within a single PMP region. The operative word there is "access", because a misaligned load | /store may be (and is typically) split into two separate accesses. | Ordering of those accesses is not spec'ed, so it's possible to get various exceptions with either the lower or upper part of the load | /store, (or both). | When that happens on a store, the trap may occur after either the low hor high alf has been written. (non-determinsitically even, so | it's a bear to test). | I don't know if that specific rule applies to PMA's or MMU page crossings, | but if a misaligned access is split into two (or more, eventually) accesses that don't cross a boundary, then it's moot; | you treat them individually. .That split is hard to avoid | But an implementation isn't required to split a misaligned address, and outside of the PMP spec, I don't think that case is mentioned | An implementation is free to always trap on a misaligned access and perform it byte-by-byte (while ensuring no interrupt can occur in | the middle, lest someone see a stale value) | I believe it is also legal to handle it entirely in HW excecpt when it crosses a various boundaries (e.g. cacheline, page, etc), and | signal a misalign exception if it does. | Or even signal a misalign exception depending on the phase of the moon (or other non-architecural state). | Personally, I'd be really happy if we could tighten those rules up a lot. | On Fri, Aug 12, 2022 at 2:28 PM Greg Favor <gfavor@...> wrote: | It would be nice if it was architecturally defined/permitted for such straddling accesses to be performed a byte at a time. | That could be ok for accesses to idempotent memory, but would likely be problematic for a non-idempotent location (e.g. a | memory-mapped I/O register), and byte accesses to a word MMIO register might not even be allowed by the PMAs for that location. | | -- | Tariq Kurd | Lead IP Architect | Codasip UK Design Centre | www.codasip.com |
|
>For TLBs, the important simplification is PMP/PMA aren't <4KiB in >granularity, as then existing TLB entires can be used to cache >permissions. Yes - this makes a lot of sense. What about the case where the software updates the PMP entries though? This would then require an sfence.vma to clear the micro-TLBs as the PMP permissions may be out-of-date. The architecture doesn't require this, so can we add this requirement? How is this typically done? Tariq On Tue, 16 Aug 2022 at 00:41, <krste@...> wrote:
--
Tariq Kurd | Lead IP Architect | Codasip UK Design Centre | www.codasip.com |
|
>For TLBs, the important simplification is PMP/PMA aren't <4KiB in >granularity, as then existing TLB entires can be used to cache >permissions. Yes - this makes a lot of sense. What about the case where the software updates the PMP entries though? This would then require an sfence.vma to clear the micro-TLBs as the PMP permissions may be out-of-date. The architecture doesn't require this, so can we add this requirement? How is this typically done? I've found this text now, so please disregard my previous email: "Hence, when the PMP settings are modified, M-mode software must synchronize the PMP settings with the virtual memory system and any PMP or address-translation caches. This is accomplished by executing an SFENCE.VMA instruction with rs1=x0 and rs2=x0, after the PMP CSRs are written." Thanks Tariq
--
Tariq Kurd | Lead IP Architect | Codasip UK Design Centre | www.codasip.com |
|
Andres Amaya Garcia
Thank you all for the valuable input! Once again, thanks for the help! |
|