Re: [tech-privileged] hypervisor extension: seL4 experience and feedback
Andy Glew Si5
toggle quoted messageShow quoted text
Let me withdraw the part about RDTSC - I confused RISC-V RDCYCLE and RDTSC.
However, my point about people using instruction retired count in real life for real functionality remains.
From: Andy Glew <andy.glew@...>
Sent: Friday, February 28, 2020 14:57
To: 'John Hauser' <jh.riscv@...>; 'tech-privileged@...' <tech-privileged@...>
Subject: RE: [RISC-V] [tech-privileged] [tech-privileged] hypervisor extension: seL4 experience and feedback
Intel's RDTSC is used not just for performance measurements, but also as timestamps, not just for databases, but also for enough generic Linux code that Intel was forced to ensure that RDTSC was globally synchronized in multiprocessor, and I think also in multiple CPU chip, systems.
RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.
Intel's equivalent of RDINSTRET is used by fault-tolerant code. Not really realtime lockstep fault tolerance, but the sort of fault tolerance that Tandem used to do. Banking level fault tolerance. Checkpoint restart.
Let me sketch such a system:
Multiprocessor UNIX, but no shared memory communication. Message passing only.
For every message passed that communicates between processes, record the system call, information provided/returned, and the instruction count. This is essentially a recovery log.
After a failure, restore from checkpoint. Start executing. Arrange to stop when instruction count reaches the instruction count of the next item in the log. (Do some checks.) Insert the data. Repeat until you have consumed the log.
Instruction count provides a well-characterized place where external interactions can be replayed into a process to get it to the state where it can pick up.
You can do this to replay shared memory interactions, but there are far too many possible interaction points to be practical in general. Constraining is possible, but is fragile, because you can always break the rules by accident.
There are other ways to accomplish the same thing. E.g. you could just count system calls ... but that doesn't work if you have message passing at user level. You could instrument the message passing libraries. But it's really nice if it works for arbitrary binaries (at least as long as they are using shared memory). Spin loops are always an issue.
But, anyway, although there are other ways to do the same thing, this sort of thing is the most elegant way I have seen to do it. Modulo the no shared memory restriction. And it is real, not hypothetical.
I.e. reading the time and reading the instruction count has been used for real functionality, not just for performance measurement.
Hi Gernot and Yanyan,
It's been a couple of months since you first sent (Dec. 4) your document reporting your experience adapting the seL4 microkernel to draft 0.4 of the RISC-V hypervisor extension, with some questions about the then-current 0.5 draft. I earlier responded in detail to your feedback from sections 4 and 5 of your document. I'd like to respond finally to a couple remaining issues raised in sections 6 and 7.
> Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated by
> the hypervisor extension? Are they going to return the cycles consumed
> and instructions retired by the current running VM only?
Without additional "delta" registers like RDTIME's htimedelta, the expectation currently is that bits CY and IR in hcounteren for the cycle and instret counters will normally be set to zero. The hypervisor thus gets to emulate these counters for the virtual machine, adjusting the global cycle and instret counts as necessary.
It's perfectly reasonable to question whether emulating the cycle and instret counters will be too expensive in practice. The official line for now is that emulation should be tolerable. RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.
> The v0.5 draft states that the accesses to the VS CSRs in VS-mode
> cause illegal instructions, so nested virtualization could be built on
> trap-and-emulate. Similarly, accesses to HS-mode CSRs from the
> second-level hypervisor also need to be trapped and emulated. This
> approach naturally raises concerns about the overhead of trapping,
> decoding, and handling the CSR accesses. As Arm and x86 already added
> hardware support for nested virtualisation, are we anticipating
> similar hardware support in RISC-V?
Additional optional hardware for nested hypervisors is being considered. More about this may come out later in 2020 or next year.
Right now, other components that are needed for a server-class RISC-V platform are probably a higher priority.
- John Hauser