Re: Virtualization of "main memory" and "I/O" regions
andrew@...
On Wed, Feb 3, 2021 at 1:52 PM David Kruckemyer <dkruckemyer@...> wrote:
John Hauser and I spoke about this topic tonight. We thought it useful to divide this problem into two cases: first, where the I/O access is trapped and emulated by the hypervisor, and second, the case that you raise below. (You might've already reasoned through the first case, but we thought it would be helpful to others to make it explicit.) If the I/O access is trapped and emulated, the concern is that the emulation code might take actions that don't honor the FENCEs used by the guest OS: e.g., accessing main memory instead of I/O. In this case, it suffices to place a full FENCE before and after the emulation routine (but of course full FENCEs might not be necessary in some cases). Since the hypervisor has control, this is trivial. The performance impact will be extant but will be dwarfed by the other overheads involved.
Some RISC-V platforms (including the ones you're probably thinking about) require strongly ordered point-to-point I/O, in which case the FENCE won't even be present. This is a blessing disguised as a curse. The bad news is that we can't hook into the FENCE instruction. The good news is we can prune part of the design space a priori :) Instead, we need to strengthen the memory-ordering constraints for certain memory accesses. Obviously, we could just trap them and insert FENCEs, but the overhead would probably be unacceptable. We think a viable alternative is to augment the hypervisor's G-stage translation scheme to add an option to treat accesses to some pages as I/O accesses for memory-ordering purposes. (We propose appropriating the G bit in the PTE for this purpose, since it's otherwise unused for G-stage translation, though there are other options. (Recall there are no free bits in Sv32 PTEs.)) For the microarchitects in the audience, I'll briefly note that implementing this functionality is straightforward, since it need not be a super-high-performance path. It would suffice to turn all accesses to those pages into aqrl accesses. Since the aqrl property isn't known until after address translation, additional pipeline flushes will be necessary in some designs, but that's OK--it's still far faster than trapping the access. (And, needless to say, this property is usually highly predictable as a function of the program counter.)
I think this case reasonably falls into the "don't do that" category, unless the machine has some other mechanism to provide strong ordering.
As mentioned above, this option isn't sufficient because of strongly ordered I/O regions in RVWMO systems.
|
|