Re: Virtualization of "main memory" and "I/O" regions

John Ingalls

Good catch.  I agree with the problem and solution.
I'll at least ask, though: could the Hypervisor host side-step this problem entirely by only "providing" relaxed-ordering IO regions to the guest?  I realize that this would have to omit the default strongly-ordered regions 0 (point-to-point) and 1 (global) from the Privileged ISA, and I suppose that guests could be written relying on that default behavior.

On Mon, Feb 8, 2021 at 1:04 AM Andrew Waterman <andrew@...> wrote:

On Wed, Feb 3, 2021 at 1:52 PM David Kruckemyer <dkruckemyer@...> wrote:
Hi all,

The FENCE instruction exposes the difference between accesses to main memory and accesses to I/O devices via the predecessor and successor sets; however, the distinction is really only defined by PMAs, which describe "main memory" and "I/O" regions. So how does the architecture support virtualization of those regions so that the FENCE instruction behaves appropriately?

John Hauser and I spoke about this topic tonight.  We thought it useful to divide this problem into two cases: first, where the I/O access is trapped and emulated by the hypervisor, and second, the case that you raise below.  (You might've already reasoned through the first case, but we thought it would be helpful to others to make it explicit.)

If the I/O access is trapped and emulated, the concern is that the emulation code might take actions that don't honor the FENCEs used by the guest OS: e.g., accessing main memory instead of I/O.  In this case, it suffices to place a full FENCE before and after the emulation routine (but of course full FENCEs might not be necessary in some cases).  Since the hypervisor has control, this is trivial.  The performance impact will be extant but will be dwarfed by the other overheads involved.

Suppose, for example, that a hypervisor virtualizes the memory system for its guest OS, mapping some guest "I/O" regions to hypervisor "main memory" regions. The guest believes that portions of its address space are I/O, and executes the following sequence:

ST [x] // to guest I/O region, hypervisor main memory region
ST [y] // to guest I/O region, hypervisor I/O region

One could presume that, since the PMA for [x] indicates that [x] is "main memory," the store to [y] could be performed before the store to [x]. If ST [y] initiates a side-effect, like a DMA read, and the expectation is that ST [x] is observable to the side-effect, problems may ensue. Is it the responsibility of the hypervisor to ensure the correct behavior, e.g. trap and emulate accesses to [x], or something else? If so, how can a hypervisor reasonably handle all the various combinations of ordering efficiently?

Some RISC-V platforms (including the ones you're probably thinking about) require strongly ordered point-to-point I/O, in which case the FENCE won't even be present.  This is a blessing disguised as a curse.  The bad news is that we can't hook into the FENCE instruction.  The good news is we can prune part of the design space a priori :)

Instead, we need to strengthen the memory-ordering constraints for certain memory accesses.  Obviously, we could just trap them and insert FENCEs, but the overhead would probably be unacceptable.  We think a viable alternative is to augment the hypervisor's G-stage translation scheme to add an option to treat accesses to some pages as I/O accesses for memory-ordering purposes.  (We propose appropriating the G bit in the PTE for this purpose, since it's otherwise unused for G-stage translation, though there are other options.  (Recall there are no free bits in Sv32 PTEs.))

For the microarchitects in the audience, I'll briefly note that implementing this functionality is straightforward, since it need not be a super-high-performance path.  It would suffice to turn all accesses to those pages into aqrl accesses.  Since the aqrl property isn't known until after address translation, additional pipeline flushes will be necessary in some designs, but that's OK--it's still far faster than trapping the access.  (And, needless to say, this property is usually highly predictable as a function of the program counter.)

In addition, what happens in the case that the "guest" is S-mode and the "hypervisor" is M-mode?

I think this case reasonably falls into the "don't do that" category, unless the machine has some other mechanism to provide strong ordering.

Perhaps there's an implied "don't do that" in the architecture; in which case, should there be, at a minimum, some commentary text to that effect?

Or perhaps there should be a hardware mechanism that "strengthens" fences?

As mentioned above, this option isn't sufficient because of strongly ordered I/O regions in RVWMO systems.


Join to automatically receive all group messages.