Proposal: Supervisor Timer CSR and Virtual Supervisor Timer CSR


Siqi Zhao
 

Hi Everyone,

 

This is an updated version of our previous proposal on the clock source and clock event source. We have aligned our ideas with the latest hypervisor extension specs, removed the redundant parts, and uncovered some issues if we are going to implement this proposal with the current design. We give an analysis together with the proposed new CSRs in the attached document.

 

BTW, we have learned recently that there is an on-going work on an improved version of the PLIC which is virtualization-aware, is there any documents available?

 

Regards,

Siqi


Greg Favor
 

I haven't gone through all of sections 3.1 and 3.2 yet, but it seems like 3.1 starts off on the wrong foot.  It states that "the current RISC-V spec states that vsie.STIE is an alias of hie.VSTIE".  I believe this (and the subsequent observations) are flawed for a combination of reasons.

First, the Priv spec says "When bit 6 of hideleg is zero, vsip.STIP and vsie.STIE are read-only zeros. Else, vsip.STIP and vsie.STIE are aliases of hip.VSTIP and hie.VSTIE."   Not quite the same as vsie.STIE and hie.VSTIE always being aliases of each other.  This will matter down below.

Second, note that the Priv spec says, regarding vsie.STIE, that:
"The vsip and vsie registers are VSXLEN-bit read/write registers that are VS-mode’s versions of supervisor CSRs sip and sie, formatted as shown in Figures 5.22 and 5.23 respectively. When V=1, vsip and vsie substitute for the usual sip and sie, so instructions that normally read or modify sip/sie actually access vsip/vsie instead."
In other words, vsie.STIE substitutes for sie.STIE when (and only when) V=1.
And sie.STIE is not an alias or anything else of hie.VSTIE.  The spec says that "the nonzero bits in sie and hie are always mutually exclusive".

Thirdly, note that hie.VSTIE is the enable for hip.VSTIP and hip.VSTIP "is the logical-OR of hvip.VSTIP and any other platform-specific timer interrupt signal directed to VS-level".  A vstimecmp interrupt would fall in that latter category.

Fourthly, when hideleg bit 6 is zero, then hip.VSTIP=1 is directed to be serviced by HS-mode.  But when hideleg bit 6 is one, then hip.VSTIP=1 is directed to be serviced by VS-mode - and then (and only then) is vsip.STIP an alias of hip.VSTIP (and vsie.STIE an alias of hie.VSTIE).

So, in this latter case, since a vstimecmp interrupt factors into hip.VSTIP, it then also factors into vsip.STIP.  And when the vstimecmp interrupt is recognized, it is taken in VS-mode and not in HS-mode (i.e. not into the hypervisor) due to hideleg.

Conversely, if hideleg directs the interrupt to the hypervisor, then vsie/vsip.STIP are not aliases of hie/hip.VSTIP.  And hence clearing of hie.VSTIE does not not affect vsie.STIE (vsie.STIE, in fact, is read-only zeros).

So it seems like the stated problem case in section 3.1 cannot arise.

Greg


On Wed, Sep 2, 2020 at 8:24 PM zhaosiqi (A) via lists.riscv.org <zhaosiqi3=huawei.com@...> wrote:

Hi Everyone,

 

This is an updated version of our previous proposal on the clock source and clock event source. We have aligned our ideas with the latest hypervisor extension specs, removed the redundant parts, and uncovered some issues if we are going to implement this proposal with the current design. We give an analysis together with the proposed new CSRs in the attached document.

 

BTW, we have learned recently that there is an on-going work on an improved version of the PLIC which is virtualization-aware, is there any documents available?

 

Regards,

Siqi


Greg Favor
 

A few other small nits:

- Besides by S-mode and M-mode, the stimecmp CSR is accessible by HS-mode; and is accessible by VS-mode as well (during which the vstimecmp CSR contents substitute for stimecmp contents).

- The vstimecmp CSR is indirectly read/write accessible during VS-mode (as well as directly during M-mode and HS-mode).

- The *stimecmp CSR's are shown as having bits [11:0] and they are shown as TBD.  They instead should have the same format as mtimecmp, i.e. a full 64-bit unsigned value.

- Section 2.2 mistakenly says that "instructions that access stimecmp when V==0 access vstimecmp instead".  That should be for when V==1.

- From M/S/HS/VS modes, access to stimecmp does not trap.  But it does trap from U/VU modes.

- From M/S/HS modes, access to vstimecmp does not trap.  But it does trap from VS/U/VU modes.

- Lastly, what section 4 asks for, I believe is already provided via hideleg bit 6.

Greg




On Wed, Sep 2, 2020 at 8:24 PM zhaosiqi (A) via lists.riscv.org <zhaosiqi3=huawei.com@...> wrote:

Hi Everyone,

 

This is an updated version of our previous proposal on the clock source and clock event source. We have aligned our ideas with the latest hypervisor extension specs, removed the redundant parts, and uncovered some issues if we are going to implement this proposal with the current design. We give an analysis together with the proposed new CSRs in the attached document.

 

BTW, we have learned recently that there is an on-going work on an improved version of the PLIC which is virtualization-aware, is there any documents available?

 

Regards,

Siqi


John Hauser
 

zhaosiqi (Siqi) wrote:
BTW, we have learned recently that there is an on-going work on an
improved version of the PLIC which is virtualization-aware, is there
any documents available?
There is an informal group that is working on a proposal for a RISC-V
"Advanced Interrupt Architecture", which includes replacements for
the current PLIC. No document is available yet because the group's
proposal isn't complete yet; the document is still being written. As
soon as the authors are satisfied with it, the proposal will be shared
with everyone so it can begin receiving wider evaluation and feedback.
I expect that will happen around the end of 2020 or early in 2021.

(Yes, I know, everybody wants it to be sooner. It's already being done
as fast as time allows.)

Regards,

- John Hauser


Phil McCoy
 

One of the main reasons for making mtime/mtimecmp memory-mapped rather than CSRs is to support systems where the CPU(s) do not run at a constant clock frequency.  In such systems, the mtime counter must reside in a different clock domain (and often voltage domain) from the CPU.  If the *timecmp registers are implemented as CSRs, the 64-bit compare value must be passed from the CPU clock domain to the mtime clock domain, which is a costly overhead for the CPU design.  (The alternative of passing the mtime value into the CPU clock domain is much worse for power, since it changes much more frequently than the compare value.)

Another alternative is for the CPU to translate the CSR write instruction into a memory write.  This is impractical for a CPU IP, because the CPU designers do not know the SoC memory map (which might vary between implementations/customers, or even be programmable at runtime).  There is also the overhead of handling interactions with PMP/PMA, etc.

How feasible is it to generate timer-interrupts for VS-mode software from a counter in the CPU clock domain?  The hypervisor could apply an appropriate scaling factor to align with wall-clock time, perhaps using an htimescale register somewhat analogous to htimedelta.


Phil McCoy
 

Another option is to create memory-mapped registers for stimecmp/vstimecmp at addresses that are accessible to S-mode/VS-mode software.


Allen Baum
 

Scale factors don't work if the clock is constantly changing e.g. DVFS

On Wed, Sep 9, 2020 at 10:36 AM <pnm@...> wrote:
Another option is to create memory-mapped registers for stimecmp/vstimecmp at addresses that are accessible to S-mode/VS-mode software.


Phil McCoy
 

Fair enough.  I was thinking of a system where the DVFS would be under the control of M-mode software, but in some chips it could be done by a more autonomous DVFS controller of some sort.

Sounds like memory-mapped stimecmp/vstimecmp registers might be the best solution.  I don't think anything in the existing spec would prevent this, but it would be nice to have it standardized (at least to the extent that mtime/mtimecmp are standardized...)

Cheers,
Phil


Siqi Zhao
 

Hi Greg,

Thanks for the comments.

It seems the proposal is not explicit enough about the type of interrupt. So to answer your question, with vstimecmp, there is in fact a new type of interrupt types. The new type is triggered by vstimecmp but received when the hart is at V==0, might be called SGTI (Supervisor Guest Timer Interrupt, in the same spirit as SGEI). The proposal didn't really distinguish this interrupt with VSTI. With this new interrupt type, this is how things conceptually work: the HS-mode code first receives a SGTI triggered by vstimecmp, consequently, a VSTI is generated by the HS-mode code for VS-mode to handle.

With the current specs, hip.VSTIP is used to represent pending state for both VSTI and SGTI, which can be made to work as shown in the document. Since once SGTI is pending, then the next step is naturally to make a pending VSTI for the guest. What causes the issue is that the enable bits are also shared in existing specs. I believe your comments was caused by another shared aspect which is the delegation bits. The current specs only provide bit 2 in hideleg, which can't be used to control delegation for two types of interrupt.

A better solution might be to introduce a new interrupt type called SGTI and the corresponding pending bits and delegation bits. M-mode delegate SGTI to HS-mode. HS-mode still delegates VSTI to VS-mode. When an interrupt is triggered by vstimecmp, the hart sets SGTI to be pending and generates a trap. The hypervisor then sets VSTI to be pending, and executes the vCPU.

Hope this explains.

Regards,
Siqi


Greg Favor
 

Siqi,

Thanks for your clarification.  If I understand correctly, your goal (analogous to SGEI as you mention) is to enable a pending virtual timer interrupt, even for a VM not currently context switched in, to cause an interrupt to the hypervisor so that it can context switch in that VM.  (As you recognize, for virtual external interrupts this is based on the hgeip / hgeie registers and hip.SGEIP / hie.SGEIE register bits.)

In contrast, for the VM currently switched in, a virtual timer interrupt can be directly reflected - based on hideleg - in either hip.VSTIP or vsip.STIP, i.e. the hypervisor can choose whether the pending interrupt goes to itself or directly to the guest.  (In the latter case the hypervisor can also poll hip.VSTIP.)  There's no need for an SGEI-like mechanism here.

The problem is that while a VM is context switched out, its vstimecmp is also switched out.  Or, put differently, if there are N VM's currently assigned to a hart and one wants hardware to inform the hypervisor when a VM's virtual timer "fires", then there would need to be N vstimecmp registers, N associated htimedelta registers, N time comparators, and one would need the equivalent of the hgeip/hgeie registers (i.e. hgtip/hgtie), and hip.SGTIP and hie.SGTIE bits.  In essence, the full analogue of what the H-extension provides for virtual external interrupts.

This obviously is a lot more expensive that the analogous support for virtual external interrupts.  One can also observe that other architectures (using ARMv8 as an example) don't provide hardware support for sending virtual timer interrupts to the hypervisor for VM's not currently context-switched in.  Instead the hypervisor can keep track of the stimecmp's for VM's, recognize when one of them "fires", and context switch in that VM.  (I can't say specifically how ARMv8 hypervisors deal with this, but one could imagine a scheme similar to what RISC-V imagines for multiplexing OS-level timers onto the one hardware M-mode timer.)

Am I addressing your intended goal or do you have in mind a different goal behind your suggested arch changes?

Greg

On Thu, Sep 10, 2020 at 12:14 AM zhaosiqi (A) via lists.riscv.org <zhaosiqi3=huawei.com@...> wrote:
Hi Greg,

Thanks for the comments.

It seems the proposal is not explicit enough about the type of interrupt. So to answer your question, with vstimecmp, there is in fact a new type of interrupt types. The new type is triggered by vstimecmp but received when the hart is at V==0, might be called SGTI (Supervisor Guest Timer Interrupt, in the same spirit as SGEI). The proposal didn't really distinguish this interrupt with VSTI. With this new interrupt type, this is how things conceptually work: the HS-mode code first receives a SGTI triggered by vstimecmp, consequently, a VSTI is generated by the HS-mode code for VS-mode to handle.

With the current specs, hip.VSTIP is used to represent pending state for both VSTI and SGTI, which can be made to work as shown in the document. Since once SGTI is pending, then the next step is naturally to make a pending VSTI for the guest. What causes the issue is that the enable bits are also shared in existing specs. I believe your comments was caused by another shared aspect which is the delegation bits. The current specs only provide bit 2 in hideleg, which can't be used to control delegation for two types of interrupt.

A better solution might be to introduce a new interrupt type called SGTI and the corresponding pending bits and delegation bits. M-mode delegate SGTI to HS-mode. HS-mode still delegates VSTI to VS-mode. When an interrupt is triggered by vstimecmp, the hart sets SGTI to be pending and generates a trap. The hypervisor then sets VSTI to be pending, and executes the vCPU.

Hope this explains.

Regards,
Siqi


Siqi Zhao
 

Hi Greg,

I understand your concern. In fact, the goal is not to provide more than one vstimecmp CSR.

You are right that when a VM is context switched out, it's vstimecmp CSR is also switched out. However, there exists time when the VM, or a vCPU, is not executing, however, its CSR context is still 'bound' to a hart. In other words, the vs- CSRs still contain values for that vCPU (including vstimecmp), however, the hart is executing hypervisor code. For example, when the hypervisor is handling certain exception caused by the VM such as a guest page fault, there is no need to switch the vCPU context out.

During this time, since vstimecmp still contains the value for that vCPU, it can fire interrupts. When this interrupt is received by the hart, then obviously it should be the hypervisor that handles it, thus the VGTI. Without VGTI, this interrupt will be delayed until the hypervisor returns to the vCPU. This delay can be long because the hypervisor may not immediately return to the vCPU, i.e. it may well decide that the vCPU needs to yield and schedule something else. Of course, the hypervisor can check at strategic points the value in vstimecmp and make proper decision, but that makes software more complex because those checks might be tricky to be inserted.

So VGTI is purely for the prompt handling of the vstimecmp interrupt when the vCPU is still bound to a hart but not executing.

Lastly, when the hypervisor finally decides to switch a vCPU out, then vstimecmp gets saved and hypervisor uses its own timer to track the timer set by that vCPU, which goes to stimecmp. In this way the only vstimecmp is multiplexed among vCPUs, there's no need for more than one vstimecmp.

Regards,
Siqi


Greg Favor
 

Siqi,

Thanks.  That narrows down where we are differing.  

If the hypervisor has set hideleg bit 6 (so that VS-level timer interrupts can be received by VS-mode while V=1), then any new pending VS-level timer interrupt won't be taken until the hypervisor returns back into the VM.  But at that moment the hypervisor is busy servicing whatever caused the exit from the VM and into the hypervisor.  Even if the hypervisor got interrupted by the pending VS-level interrupt, it would defer doing anything about it until it reached a point where it could consider returning to this VM.

Then, at junctures like that, the hypervisor may decide to context switch to another VM instead of returning into this one.  But the hypervisor code that considers whether to switch away to another VM or not would be the natural place for it to check (via hip.VSTIP) whether the current VM has a pending VS-level timer interrupt and decide whether it should not switch to another VM.  This check should not be appearing in many places in hypervisor code.  I believe it would be in the one or few pieces of code that handle deciding whether to return to the current VM or to context switch to a different VM (i.e this decision-making process should not be spread across many places in the code).

Conversely, if you had some form of VGTI, then when the hypervisor receives the VS-level timer interrupt, it would generally have to defer doing something about it until it reaches a suitable juncture where it can consider whether to return to this (or another) VM.  At which point the code at that juncture could simply poll hip.VSTIP instead of having to remember that a VGTI was received earlier.

I'm guessing this may not get us to a point of agreement yet :), but hopefully we're narrowing in.

Greg

P.S. If necessary, we can put this to the people doing two of the RISC-V hypervisor ports.


On Thu, Sep 10, 2020 at 8:04 PM zhaosiqi (A) via lists.riscv.org <zhaosiqi3=huawei.com@...> wrote:
Hi Greg,

I understand your concern. In fact, the goal is not to provide more than one vstimecmp CSR.

You are right that when a VM is context switched out, it's vstimecmp CSR is also switched out. However, there exists time when the VM, or a vCPU, is not executing, however, its CSR context is still 'bound' to a hart. In other words, the vs- CSRs still contain values for that vCPU (including vstimecmp), however, the hart is executing hypervisor code. For example, when the hypervisor is handling certain exception caused by the VM such as a guest page fault, there is no need to switch the vCPU context out.

During this time, since vstimecmp still contains the value for that vCPU, it can fire interrupts. When this interrupt is received by the hart, then obviously it should be the hypervisor that handles it, thus the VGTI. Without VGTI, this interrupt will be delayed until the hypervisor returns to the vCPU. This delay can be long because the hypervisor may not immediately return to the vCPU, i.e. it may well decide that the vCPU needs to yield and schedule something else. Of course, the hypervisor can check at strategic points the value in vstimecmp and make proper decision, but that makes software more complex because those checks might be tricky to be inserted.

So VGTI is purely for the prompt handling of the vstimecmp interrupt when the vCPU is still bound to a hart but not executing.

Lastly, when the hypervisor finally decides to switch a vCPU out, then vstimecmp gets saved and hypervisor uses its own timer to track the timer set by that vCPU, which goes to stimecmp. In this way the only vstimecmp is multiplexed among vCPUs, there's no need for more than one vstimecmp.

Regards,
Siqi


John Hauser
 

zhaosiqi (Siqi) wrote:
So VGTI is purely for the prompt handling of the vstimecmp interrupt
when the vCPU is still bound to a hart but not executing.
I believe a hypervisor can get the same effect by saving and clearing
bit 6 of hideleg on entry to a trap handler in HS mode. On trap exit,
restore the saved value of bit 6 of hideleg. (Usually, it should be
possible just to save and restore all of hideleg.) While bit 6 of
hideleg is zero, a VS-level timer interrupt will trap to HS mode,
assuming bit 6 (VSTIE) of hie is also set to enable the trap.

- John Hauser