Date
1 - 5 of 5
Virtualization of "main memory" and "I/O" regions
Josh Scheid
The proposal appears to address the VSEE and in the "I/O to memory" direction, and just for the channel 0 implicit strong ordering / no fence case. It's not clear to me that that is the only problem. A solution of this nature might be needed, but it seems that it's best addressed as part of the current virt-mem TG activities surrounding attributes in PTEs (e.g., a "TSO" or "Channel0IO" attribute override).
Cross-device "fence io, io" gets ignored if the devices, or at least those pages, are memory, even with this addition. Same for mixed memory-I/O such as producer-consumer patterns that might explicitly think they want "fence w,o", "fence i,r", etc.
I think we can steer clear of the broader "SW portability across ordering models" problem. So not the problem of how to run SW that assumes a stronger ordering model and therefore omits particular Fences. It seems more tractable to focus on explicit FENCE for now.
I think the general description of the problem is how to handle code written for one I/O or R/W being used for the other attribute type, either because the code is generic (library) or because it is running small "v" virtualized (any mode other than M). What SW should be certain of its use case and so correctly use the minimal fence set? What SW should be conservative and explicitly specify both (e.g., always "IR" instead of "I" or "R")? Or should the architecture have per page or per level or global controls to strengthen fences, or should the platform specify that implementations should not differentiate (treat "I" or "R" always as "IR")?
For context, the flexibility of the FENCE arguments presents unique challenges to RISC-V as compared to other architectures. In the short term, I don't think we need to solve these problems any better than other architectures do. That certainly helps in bounding the "problem" definition.
-Josh
andrew@...
On Mon, Feb 8, 2021 at 8:38 AM John Ingalls <john.ingalls@...> wrote:
Good catch. I agree with the problem and solution.I'll at least ask, though: could the Hypervisor host side-step this problem entirely by only "providing" relaxed-ordering IO regions to the guest? I realize that this would have to omit the default strongly-ordered regions 0 (point-to-point) and 1 (global) from the Privileged ISA, and I suppose that guests could be written relying on that default behavior.
The existing Linux platform expects strongly ordered point-to-point I/O, so while convenient for the HW, I don't think it's a workable solution.
On Mon, Feb 8, 2021 at 1:04 AM Andrew Waterman <andrew@...> wrote:On Wed, Feb 3, 2021 at 1:52 PM David Kruckemyer <dkruckemyer@...> wrote:Hi all,The FENCE instruction exposes the difference between accesses to main memory and accesses to I/O devices via the predecessor and successor sets; however, the distinction is really only defined by PMAs, which describe "main memory" and "I/O" regions. So how does the architecture support virtualization of those regions so that the FENCE instruction behaves appropriately?John Hauser and I spoke about this topic tonight. We thought it useful to divide this problem into two cases: first, where the I/O access is trapped and emulated by the hypervisor, and second, the case that you raise below. (You might've already reasoned through the first case, but we thought it would be helpful to others to make it explicit.)If the I/O access is trapped and emulated, the concern is that the emulation code might take actions that don't honor the FENCEs used by the guest OS: e.g., accessing main memory instead of I/O. In this case, it suffices to place a full FENCE before and after the emulation routine (but of course full FENCEs might not be necessary in some cases). Since the hypervisor has control, this is trivial. The performance impact will be extant but will be dwarfed by the other overheads involved.Suppose, for example, that a hypervisor virtualizes the memory system for its guest OS, mapping some guest "I/O" regions to hypervisor "main memory" regions. The guest believes that portions of its address space are I/O, and executes the following sequence:ST [x] // to guest I/O region, hypervisor main memory regionFENCE O,OST [y] // to guest I/O region, hypervisor I/O regionOne could presume that, since the PMA for [x] indicates that [x] is "main memory," the store to [y] could be performed before the store to [x]. If ST [y] initiates a side-effect, like a DMA read, and the expectation is that ST [x] is observable to the side-effect, problems may ensue. Is it the responsibility of the hypervisor to ensure the correct behavior, e.g. trap and emulate accesses to [x], or something else? If so, how can a hypervisor reasonably handle all the various combinations of ordering efficiently?Some RISC-V platforms (including the ones you're probably thinking about) require strongly ordered point-to-point I/O, in which case the FENCE won't even be present. This is a blessing disguised as a curse. The bad news is that we can't hook into the FENCE instruction. The good news is we can prune part of the design space a priori :)Instead, we need to strengthen the memory-ordering constraints for certain memory accesses. Obviously, we could just trap them and insert FENCEs, but the overhead would probably be unacceptable. We think a viable alternative is to augment the hypervisor's G-stage translation scheme to add an option to treat accesses to some pages as I/O accesses for memory-ordering purposes. (We propose appropriating the G bit in the PTE for this purpose, since it's otherwise unused for G-stage translation, though there are other options. (Recall there are no free bits in Sv32 PTEs.))For the microarchitects in the audience, I'll briefly note that implementing this functionality is straightforward, since it need not be a super-high-performance path. It would suffice to turn all accesses to those pages into aqrl accesses. Since the aqrl property isn't known until after address translation, additional pipeline flushes will be necessary in some designs, but that's OK--it's still far faster than trapping the access. (And, needless to say, this property is usually highly predictable as a function of the program counter.)In addition, what happens in the case that the "guest" is S-mode and the "hypervisor" is M-mode?I think this case reasonably falls into the "don't do that" category, unless the machine has some other mechanism to provide strong ordering.Perhaps there's an implied "don't do that" in the architecture; in which case, should there be, at a minimum, some commentary text to that effect?Or perhaps there should be a hardware mechanism that "strengthens" fences?As mentioned above, this option isn't sufficient because of strongly ordered I/O regions in RVWMO systems.Thanks,David
John Ingalls
Good catch. I agree with the problem and solution.
I'll at least ask, though: could the Hypervisor host side-step this problem entirely by only "providing" relaxed-ordering IO regions to the guest? I realize that this would have to omit the default strongly-ordered regions 0 (point-to-point) and 1 (global) from the Privileged ISA, and I suppose that guests could be written relying on that default behavior.
On Mon, Feb 8, 2021 at 1:04 AM Andrew Waterman <andrew@...> wrote:
On Wed, Feb 3, 2021 at 1:52 PM David Kruckemyer <dkruckemyer@...> wrote:Hi all,The FENCE instruction exposes the difference between accesses to main memory and accesses to I/O devices via the predecessor and successor sets; however, the distinction is really only defined by PMAs, which describe "main memory" and "I/O" regions. So how does the architecture support virtualization of those regions so that the FENCE instruction behaves appropriately?John Hauser and I spoke about this topic tonight. We thought it useful to divide this problem into two cases: first, where the I/O access is trapped and emulated by the hypervisor, and second, the case that you raise below. (You might've already reasoned through the first case, but we thought it would be helpful to others to make it explicit.)If the I/O access is trapped and emulated, the concern is that the emulation code might take actions that don't honor the FENCEs used by the guest OS: e.g., accessing main memory instead of I/O. In this case, it suffices to place a full FENCE before and after the emulation routine (but of course full FENCEs might not be necessary in some cases). Since the hypervisor has control, this is trivial. The performance impact will be extant but will be dwarfed by the other overheads involved.Suppose, for example, that a hypervisor virtualizes the memory system for its guest OS, mapping some guest "I/O" regions to hypervisor "main memory" regions. The guest believes that portions of its address space are I/O, and executes the following sequence:ST [x] // to guest I/O region, hypervisor main memory regionFENCE O,OST [y] // to guest I/O region, hypervisor I/O regionOne could presume that, since the PMA for [x] indicates that [x] is "main memory," the store to [y] could be performed before the store to [x]. If ST [y] initiates a side-effect, like a DMA read, and the expectation is that ST [x] is observable to the side-effect, problems may ensue. Is it the responsibility of the hypervisor to ensure the correct behavior, e.g. trap and emulate accesses to [x], or something else? If so, how can a hypervisor reasonably handle all the various combinations of ordering efficiently?Some RISC-V platforms (including the ones you're probably thinking about) require strongly ordered point-to-point I/O, in which case the FENCE won't even be present. This is a blessing disguised as a curse. The bad news is that we can't hook into the FENCE instruction. The good news is we can prune part of the design space a priori :)Instead, we need to strengthen the memory-ordering constraints for certain memory accesses. Obviously, we could just trap them and insert FENCEs, but the overhead would probably be unacceptable. We think a viable alternative is to augment the hypervisor's G-stage translation scheme to add an option to treat accesses to some pages as I/O accesses for memory-ordering purposes. (We propose appropriating the G bit in the PTE for this purpose, since it's otherwise unused for G-stage translation, though there are other options. (Recall there are no free bits in Sv32 PTEs.))For the microarchitects in the audience, I'll briefly note that implementing this functionality is straightforward, since it need not be a super-high-performance path. It would suffice to turn all accesses to those pages into aqrl accesses. Since the aqrl property isn't known until after address translation, additional pipeline flushes will be necessary in some designs, but that's OK--it's still far faster than trapping the access. (And, needless to say, this property is usually highly predictable as a function of the program counter.)In addition, what happens in the case that the "guest" is S-mode and the "hypervisor" is M-mode?I think this case reasonably falls into the "don't do that" category, unless the machine has some other mechanism to provide strong ordering.Perhaps there's an implied "don't do that" in the architecture; in which case, should there be, at a minimum, some commentary text to that effect?Or perhaps there should be a hardware mechanism that "strengthens" fences?As mentioned above, this option isn't sufficient because of strongly ordered I/O regions in RVWMO systems.Thanks,David
andrew@...
On Wed, Feb 3, 2021 at 1:52 PM David Kruckemyer <dkruckemyer@...> wrote:
Hi all,The FENCE instruction exposes the difference between accesses to main memory and accesses to I/O devices via the predecessor and successor sets; however, the distinction is really only defined by PMAs, which describe "main memory" and "I/O" regions. So how does the architecture support virtualization of those regions so that the FENCE instruction behaves appropriately?
John Hauser and I spoke about this topic tonight. We thought it useful to divide this problem into two cases: first, where the I/O access is trapped and emulated by the hypervisor, and second, the case that you raise below. (You might've already reasoned through the first case, but we thought it would be helpful to others to make it explicit.)
If the I/O access is trapped and emulated, the concern is that the emulation code might take actions that don't honor the FENCEs used by the guest OS: e.g., accessing main memory instead of I/O. In this case, it suffices to place a full FENCE before and after the emulation routine (but of course full FENCEs might not be necessary in some cases). Since the hypervisor has control, this is trivial. The performance impact will be extant but will be dwarfed by the other overheads involved.
Suppose, for example, that a hypervisor virtualizes the memory system for its guest OS, mapping some guest "I/O" regions to hypervisor "main memory" regions. The guest believes that portions of its address space are I/O, and executes the following sequence:ST [x] // to guest I/O region, hypervisor main memory regionFENCE O,OST [y] // to guest I/O region, hypervisor I/O regionOne could presume that, since the PMA for [x] indicates that [x] is "main memory," the store to [y] could be performed before the store to [x]. If ST [y] initiates a side-effect, like a DMA read, and the expectation is that ST [x] is observable to the side-effect, problems may ensue. Is it the responsibility of the hypervisor to ensure the correct behavior, e.g. trap and emulate accesses to [x], or something else? If so, how can a hypervisor reasonably handle all the various combinations of ordering efficiently?
Some RISC-V platforms (including the ones you're probably thinking about) require strongly ordered point-to-point I/O, in which case the FENCE won't even be present. This is a blessing disguised as a curse. The bad news is that we can't hook into the FENCE instruction. The good news is we can prune part of the design space a priori :)
Instead, we need to strengthen the memory-ordering constraints for certain memory accesses. Obviously, we could just trap them and insert FENCEs, but the overhead would probably be unacceptable. We think a viable alternative is to augment the hypervisor's G-stage translation scheme to add an option to treat accesses to some pages as I/O accesses for memory-ordering purposes. (We propose appropriating the G bit in the PTE for this purpose, since it's otherwise unused for G-stage translation, though there are other options. (Recall there are no free bits in Sv32 PTEs.))
For the microarchitects in the audience, I'll briefly note that implementing this functionality is straightforward, since it need not be a super-high-performance path. It would suffice to turn all accesses to those pages into aqrl accesses. Since the aqrl property isn't known until after address translation, additional pipeline flushes will be necessary in some designs, but that's OK--it's still far faster than trapping the access. (And, needless to say, this property is usually highly predictable as a function of the program counter.)
In addition, what happens in the case that the "guest" is S-mode and the "hypervisor" is M-mode?
I think this case reasonably falls into the "don't do that" category, unless the machine has some other mechanism to provide strong ordering.
Perhaps there's an implied "don't do that" in the architecture; in which case, should there be, at a minimum, some commentary text to that effect?Or perhaps there should be a hardware mechanism that "strengthens" fences?
As mentioned above, this option isn't sufficient because of strongly ordered I/O regions in RVWMO systems.
Thanks,David
David Kruckemyer
Hi all,
The FENCE instruction exposes the difference between accesses to main memory and accesses to I/O devices via the predecessor and successor sets; however, the distinction is really only defined by PMAs, which describe "main memory" and "I/O" regions. So how does the architecture support virtualization of those regions so that the FENCE instruction behaves appropriately?
Suppose, for example, that a hypervisor virtualizes the memory system for its guest OS, mapping some guest "I/O" regions to hypervisor "main memory" regions. The guest believes that portions of its address space are I/O, and executes the following sequence:
ST [x] // to guest I/O region, hypervisor main memory region
FENCE O,O
ST [y] // to guest I/O region, hypervisor I/O region
One could presume that, since the PMA for [x] indicates that [x] is "main memory," the store to [y] could be performed before the store to [x]. If ST [y] initiates a side-effect, like a DMA read, and the expectation is that ST [x] is observable to the side-effect, problems may ensue. Is it the responsibility of the hypervisor to ensure the correct behavior, e.g. trap and emulate accesses to [x], or something else? If so, how can a hypervisor reasonably handle all the various combinations of ordering efficiently?
In addition, what happens in the case that the "guest" is S-mode and the "hypervisor" is M-mode?
Perhaps there's an implied "don't do that" in the architecture; in which case, should there be, at a minimum, some commentary text to that effect?
Or perhaps there should be a hardware mechanism that "strengthens" fences?
Thanks,
David