[tech-privileged] hypervisor extension: seL4 experience and feedback
Andy Glew Si5
As Andy already pointed out, the RDINSTRET could be quite useful for other purposes as well (e.g., record-and-reply or redundant execution). Would it be possible to add a filter or mask so that user-mode or kernel-mode retired instructions could be counted separately?
I like the filter/mask idea, as I will explain below, but I think it belongs more to generic performance event counters, not RDINSTRET or RDCYCLE. I think those instructions should do one thing and one thing well. If they can be configured, then it will be harder to use them locally, e.g. in a library, without knowledge of the global setting.
As for filtering of generic performance counters:
x86 EMON has generic filtering:
Therefore you count cache misses, instructions retired, instruction speculatively decoded, etc. etc. user/kernel/hypervisor/any.
Further filtering:
Note: “instructions retired” vs “speculative instructions” is not generic, since there are many possible places where one can count speculative instructions. Similarly speculative cache misses.
These are all great things, great for performance analysis. But there are never enough performance counters to count everything in one pass. So they need to be managed globally. Which, as Jack Dennis (static dataflow guy) says “violates software engineering modularity”.
Providing fixed well-characterized definitions of RDCYCLE and RDINSTRET allows at least these events to be used locally, e.g. for usage aware algorithms, within functions and classes. Without having to mess with a global management infrastructure.
-----Original Message-----
Hi John,
See my responses inline below.
Regards, Yanyan
On Thu, 2020-02-27 at 13:54 -0800, John Hauser wrote: > Hi Gernot and Yanyan, > > It's been a couple of months since you first sent (Dec. 4) your > document reporting your experience adapting the seL4 microkernel to > draft 0.4 of the RISC-V hypervisor extension, with some questions > about the then-current 0.5 draft. I earlier responded in detail to > your feedback from sections 4 and 5 of your document. I'd like to > respond finally to a couple remaining issues raised in sections 6 and > 7. >
Thanks very much for your installments, which clarify things and help us to understand the extension.
> > Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated by > > the hypervisor extension? Are they going to return the cycles > > consumed and instructions retired by the current running VM only? > > Without additional "delta" registers like RDTIME's htimedelta, the > expectation currently is that bits CY and IR in hcounteren for the > cycle and instret counters will normally be set to zero. The > hypervisor thus gets to emulate these counters for the virtual > machine, adjusting the global cycle and instret counts as necessary. >
So, it is expected that the instructions return the cycles consumed and instructions retired by the calling VM. However, it is up to the hypervisor to decide the accuracy of the values returned.
> It's perfectly reasonable to question whether emulating the cycle and > instret counters will be too expensive in practice. The official line > for now is that emulation should be tolerable. RDCYCLE and RDINSTRET > are expected to be used only for performance measurements, and should > not be executed too frequently.
I agree that the trap-and-emulate will work, and the performance may be acceptable if the registers are accessed infrequently.
As Andy already pointed out, the RDINSTRET could be quite useful for other purposes as well (e.g., record-and-reply or redundant execution). Would it be possible to add a filter or mask so that user-mode or kernel-mode retired instructions could be counted separately?
A related question is the accuracy of RDINSTRET. Are over-counting or under-counting allowed for certain conditions? What is the degree of freedom an implementation could have to interpret the meaning and accuracy of the RDINSTRET instruction?
> > > The v0.5 draft states that the accesses to the VS CSRs in VS-mode > > cause illegal instructions, so nested virtualization could be built > > on trap-and-emulate. Similarly, accesses to HS-mode CSRs from the > > second-level hypervisor also need to be trapped and emulated. This > > approach naturally raises concerns about the overhead of trapping, > > decoding, and handling the CSR accesses. As Arm and x86 already > > added hardware support for nested virtualisation, are we > > anticipating similar hardware support in RISC-V? > > Additional optional hardware for nested hypervisors is being > considered. More about this may come out later in 2020 or next year. > Right now, other components that are needed for a server-class RISC-V > platform are probably a higher priority.
Good to know that nested virtualisation is being considered. I understand there are higher priority tasks.
> > Regards, > > - John Hauser > > >
|
|
Shen, Yanyan (Data61, Kensington NSW) <yanyan.shen@...>
Hi John,
See my responses inline below. Regards, Yanyan On Thu, 2020-02-27 at 13:54 -0800, John Hauser wrote: Hi Gernot and Yanyan,Thanks very much for your installments, which clarify things and help us to understand the extension. So, it is expected that the instructions return the cycles consumed andQ6: How are the two instructions, RDCYCLE and RDINSTRET, treatedWithout additional "delta" registers like RDTIME's htimedelta, instructions retired by the calling VM. However, it is up to the hypervisor to decide the accuracy of the values returned. It's perfectly reasonable to question whether emulating the cycle andI agree that the trap-and-emulate will work, and the performance may be acceptable if the registers are accessed infrequently. As Andy already pointed out, the RDINSTRET could be quite useful for other purposes as well (e.g., record-and-reply or redundant execution). Would it be possible to add a filter or mask so that user-mode or kernel-mode retired instructions could be counted separately? A related question is the accuracy of RDINSTRET. Are over-counting or under-counting allowed for certain conditions? What is the degree of freedom an implementation could have to interpret the meaning and accuracy of the RDINSTRET instruction? Good to know that nested virtualisation is being considered. IThe v0.5 draft states that the accesses to the VS CSRs in VS-modeAdditional optional hardware for nested hypervisors is being understand there are higher priority tasks.
|
|
Yea, I remember Non-Stop folks explaining how they were going to considerably simplify their implementation by not relying on lockstep, but instead relying on counting retired instructions.
toggle quoted message
Show quoted text
But they had to very carefully define what a retired instruction was, and I’m pretty sure that doesn’t match RISC-V. E.g. if you get any kind of trap on an access that is later replayed- those only want to get counted once. You also want to be able to account for trap handlers separately, etc. -Allen On Feb 28, 2020, at 2:58 PM, Andy Glew Si5 <andy.glew@...> wrote:
|
|
Andy Glew Si5
Let me withdraw the part about RDTSC - I confused RISC-V RDCYCLE and RDTSC.
However, my point about people using instruction retired count in real life for real functionality remains.
From: Andy Glew <andy.glew@...>
Sent: Friday, February 28, 2020 14:57 To: 'John Hauser' <jh.riscv@...>; 'tech-privileged@...' <tech-privileged@...> Subject: RE: [RISC-V] [tech-privileged] [tech-privileged] hypervisor extension: seL4 experience and feedback
Intel's RDTSC is used not just for performance measurements, but also as timestamps, not just for databases, but also for enough generic Linux code that Intel was forced to ensure that RDTSC was globally synchronized in multiprocessor, and I think also in multiple CPU chip, systems.
RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.
Intel's equivalent of RDINSTRET is used by fault-tolerant code. Not really realtime lockstep fault tolerance, but the sort of fault tolerance that Tandem used to do. Banking level fault tolerance. Checkpoint restart.
Let me sketch such a system:
Multiprocessor UNIX, but no shared memory communication. Message passing only.
For every message passed that communicates between processes, record the system call, information provided/returned, and the instruction count. This is essentially a recovery log.
Periodically checkpoint.
After a failure, restore from checkpoint. Start executing. Arrange to stop when instruction count reaches the instruction count of the next item in the log. (Do some checks.) Insert the data. Repeat until you have consumed the log.
Instruction count provides a well-characterized place where external interactions can be replayed into a process to get it to the state where it can pick up.
You can do this to replay shared memory interactions, but there are far too many possible interaction points to be practical in general. Constraining is possible, but is fragile, because you can always break the rules by accident.
There are other ways to accomplish the same thing. E.g. you could just count system calls ... but that doesn't work if you have message passing at user level. You could instrument the message passing libraries. But it's really nice if it works for arbitrary binaries (at least as long as they are using shared memory). Spin loops are always an issue.
But, anyway, although there are other ways to do the same thing, this sort of thing is the most elegant way I have seen to do it. Modulo the no shared memory restriction. And it is real, not hypothetical.
I.e. reading the time and reading the instruction count has been used for real functionality, not just for performance measurement.
-----Original Message-----
Hi Gernot and Yanyan,
It's been a couple of months since you first sent (Dec. 4) your document reporting your experience adapting the seL4 microkernel to draft 0.4 of the RISC-V hypervisor extension, with some questions about the then-current 0.5 draft. I earlier responded in detail to your feedback from sections 4 and 5 of your document. I'd like to respond finally to a couple remaining issues raised in sections 6 and 7.
> Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated by > the hypervisor extension? Are they going to return the cycles consumed > and instructions retired by the current running VM only?
Without additional "delta" registers like RDTIME's htimedelta, the expectation currently is that bits CY and IR in hcounteren for the cycle and instret counters will normally be set to zero. The hypervisor thus gets to emulate these counters for the virtual machine, adjusting the global cycle and instret counts as necessary.
It's perfectly reasonable to question whether emulating the cycle and instret counters will be too expensive in practice. The official line for now is that emulation should be tolerable. RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.
> The v0.5 draft states that the accesses to the VS CSRs in VS-mode > cause illegal instructions, so nested virtualization could be built on > trap-and-emulate. Similarly, accesses to HS-mode CSRs from the > second-level hypervisor also need to be trapped and emulated. This > approach naturally raises concerns about the overhead of trapping, > decoding, and handling the CSR accesses. As Arm and x86 already added > hardware support for nested virtualisation, are we anticipating > similar hardware support in RISC-V?
Additional optional hardware for nested hypervisors is being considered. More about this may come out later in 2020 or next year. Right now, other components that are needed for a server-class RISC-V platform are probably a higher priority.
Regards,
- John Hauser
|
|
Andy Glew Si5
Intel's RDTSC is used not just for performance measurements, but also as timestamps, not just for databases, but also for enough generic Linux code that Intel was forced to ensure that RDTSC was globally synchronized in multiprocessor, and I think also in multiple CPU chip, systems.
RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.
Intel's equivalent of RDINSTRET is used by fault-tolerant code. Not really realtime lockstep fault tolerance, but the sort of fault tolerance that Tandem used to do. Banking level fault tolerance. Checkpoint restart.
Let me sketch such a system:
Multiprocessor UNIX, but no shared memory communication. Message passing only.
For every message passed that communicates between processes, record the system call, information provided/returned, and the instruction count. This is essentially a recovery log.
Periodically checkpoint.
After a failure, restore from checkpoint. Start executing. Arrange to stop when instruction count reaches the instruction count of the next item in the log. (Do some checks.) Insert the data. Repeat until you have consumed the log.
Instruction count provides a well-characterized place where external interactions can be replayed into a process to get it to the state where it can pick up.
You can do this to replay shared memory interactions, but there are far too many possible interaction points to be practical in general. Constraining is possible, but is fragile, because you can always break the rules by accident.
There are other ways to accomplish the same thing. E.g. you could just count system calls ... but that doesn't work if you have message passing at user level. You could instrument the message passing libraries. But it's really nice if it works for arbitrary binaries (at least as long as they are using shared memory). Spin loops are always an issue.
But, anyway, although there are other ways to do the same thing, this sort of thing is the most elegant way I have seen to do it. Modulo the no shared memory restriction. And it is real, not hypothetical.
I.e. reading the time and reading the instruction count has been used for real functionality, not just for performance measurement.
-----Original Message-----
Hi Gernot and Yanyan,
It's been a couple of months since you first sent (Dec. 4) your document reporting your experience adapting the seL4 microkernel to draft 0.4 of the RISC-V hypervisor extension, with some questions about the then-current 0.5 draft. I earlier responded in detail to your feedback from sections 4 and 5 of your document. I'd like to respond finally to a couple remaining issues raised in sections 6 and 7.
> Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated by > the hypervisor extension? Are they going to return the cycles consumed > and instructions retired by the current running VM only?
Without additional "delta" registers like RDTIME's htimedelta, the expectation currently is that bits CY and IR in hcounteren for the cycle and instret counters will normally be set to zero. The hypervisor thus gets to emulate these counters for the virtual machine, adjusting the global cycle and instret counts as necessary.
It's perfectly reasonable to question whether emulating the cycle and instret counters will be too expensive in practice. The official line for now is that emulation should be tolerable. RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.
> The v0.5 draft states that the accesses to the VS CSRs in VS-mode > cause illegal instructions, so nested virtualization could be built on > trap-and-emulate. Similarly, accesses to HS-mode CSRs from the > second-level hypervisor also need to be trapped and emulated. This > approach naturally raises concerns about the overhead of trapping, > decoding, and handling the CSR accesses. As Arm and x86 already added > hardware support for nested virtualisation, are we anticipating > similar hardware support in RISC-V?
Additional optional hardware for nested hypervisors is being considered. More about this may come out later in 2020 or next year. Right now, other components that are needed for a server-class RISC-V platform are probably a higher priority.
Regards,
- John Hauser
|
|
John Hauser
Hi Gernot and Yanyan,
It's been a couple of months since you first sent (Dec. 4) your document reporting your experience adapting the seL4 microkernel to draft 0.4 of the RISC-V hypervisor extension, with some questions about the then-current 0.5 draft. I earlier responded in detail to your feedback from sections 4 and 5 of your document. I'd like to respond finally to a couple remaining issues raised in sections 6 and 7. Q6: How are the two instructions, RDCYCLE and RDINSTRET, treatedWithout additional "delta" registers like RDTIME's htimedelta, the expectation currently is that bits CY and IR in hcounteren for the cycle and instret counters will normally be set to zero. The hypervisor thus gets to emulate these counters for the virtual machine, adjusting the global cycle and instret counts as necessary. It's perfectly reasonable to question whether emulating the cycle and instret counters will be too expensive in practice. The official line for now is that emulation should be tolerable. RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently. The v0.5 draft states that the accesses to the VS CSRs in VS-modeAdditional optional hardware for nested hypervisors is being considered. More about this may come out later in 2020 or next year. Right now, other components that are needed for a server-class RISC-V platform are probably a higher priority. Regards, - John Hauser |
|