Date   

RISC-V Hypervisor Updates

Anup Patel
 

Hi All,

We have updated QEMU RISC-V, KVM RISC-V and Xvisor RISC-V for RISC-V
H-Extension v0.6 spec.

The QEMU repo with RISC-V H-Extension v0.6 support can be found here:
https://github.com/kvm-riscv/qemu.git

To try KVM RISC-V, refer:
https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
https://github.com/kvm-riscv/linux.git
https://github.com/kvm-riscv/kvmtool.git

To try Xvisor RISC-V, refer:
https://github.com/avpatel/xvisor-next/blob/master/docs/riscv/riscv-virt-qemu.txt
https://github.com/avpatel/xvisor-next.git

Regards,
Anup


proposal to add "virtual instruction exception" to the hypervisor extension

John Hauser
 

Hello tech-privileged guys,

I've created a pull request for the RISC-V privileged spec in response
to requests from our hypervisor software authors:
https://github.com/riscv/riscv-isa-manual/pull/518

For those with an interest, please review.

This change would add to the hypervisor extension a new kind of
exception, "virtual instruction exception". The following is copied
from new text added to the hypervisor chapter:

--------------------

When V = 1, a virtual instruction trap (not an illegal instruction
trap) is taken for:

- attempts to access a counter CSR when the corresponding bit in
hcounteren is 0 and the same bit in mcounteren is 1;

- attempts to execute WFI, unless the instruction completes within an
implementation-specific, bounded time;

- attempts to execute a virtual-machine load/store instruction, HLV,
HLVX, or HSV;

- in VS-mode, attempts to execute an HFENCE instruction or to access
an implemented hypervisor CSR or VS CSR;

- in VS-mode, attempts to execute SRET when hstatus.VTSR = 1; or

- in VS-mode, attempts to execute an SFENCE instruction or to access
satp, when hstatus.VTVM = 1.

On a virtual instruction trap, mtval or stval is written the same as
for an illegal instruction trap.

\begin{commentary}
When V = 1, circumstances that might otherwise cause an illegal
instruction trap instead cause a virtual instruction trap if a
hypervisor is normally expected to emulate the instruction. Notably,
for VS-mode this includes the hypervisor instructions (HLV, HLVX, HSV,
and HFENCE) and accesses to the hypervisor-level CSRs, all of which
must be emulated for nested hypervisors. A hypervisor that does not
support nested hypervisors should convert many virtual instruction
traps into illegal instruction exceptions for the guest virtual
machine.

Machine level is expected ordinarily to delegate virtual instruction
traps directly to HS-level, whereas illegal instruction traps are
likely to be processed first in M-mode before being conditionally
delegated (by software) to HS-level. Consequently, virtual instruction
traps are expected typically to be handled faster than illegal
instruction traps.
\end{commentary}

--------------------

Regards,

- John Hauser


Re: proposal to add "virtual instruction exception" to the hypervisor extension

Jonathan Behrens <behrensj@...>
 

I like this plan. The one comment I have is that it seems unnecessarily opinionated about which operations trigger virtual instruction traps vs illegal instruction traps when run in VU mode. I think we should error more on having things trigger virtual instruction traps everywhere that it is unlikely to require M-mode emulation. To give one example of where the current design might go wrong: analogously to HLV and friends, HFENCE from U-mode might be allowed via a CSR in the future, in which case it would now require hypervisor emulation when a nested VM tried to run it in VU-mode.

Jonathan


On Tue, May 5, 2020 at 5:16 PM John Hauser via lists.riscv.org <jh.riscv=jhauser.us@...> wrote:
Hello tech-privileged guys,

I've created a pull request for the RISC-V privileged spec in response
to requests from our hypervisor software authors:
https://github.com/riscv/riscv-isa-manual/pull/518

For those with an interest, please review.

This change would add to the hypervisor extension a new kind of
exception, "virtual instruction exception".  The following is copied
from new text added to the hypervisor chapter:

--------------------

When V = 1, a virtual instruction trap (not an illegal instruction
trap) is taken for:

  - attempts to access a counter CSR when the corresponding bit in
    hcounteren is 0 and the same bit in mcounteren is 1;

  - attempts to execute WFI, unless the instruction completes within an
    implementation-specific, bounded time;

  - attempts to execute a virtual-machine load/store instruction, HLV,
    HLVX, or HSV;

  - in VS-mode, attempts to execute an HFENCE instruction or to access
    an implemented hypervisor CSR or VS CSR;

  - in VS-mode, attempts to execute SRET when hstatus.VTSR = 1; or

  - in VS-mode, attempts to execute an SFENCE instruction or to access
    satp, when hstatus.VTVM = 1.

On a virtual instruction trap, mtval or stval is written the same as
for an illegal instruction trap.

\begin{commentary}
When V = 1, circumstances that might otherwise cause an illegal
instruction trap instead cause a virtual instruction trap if a
hypervisor is normally expected to emulate the instruction.  Notably,
for VS-mode this includes the hypervisor instructions (HLV, HLVX, HSV,
and HFENCE) and accesses to the hypervisor-level CSRs, all of which
must be emulated for nested hypervisors.  A hypervisor that does not
support nested hypervisors should convert many virtual instruction
traps into illegal instruction exceptions for the guest virtual
machine.

Machine level is expected ordinarily to delegate virtual instruction
traps directly to HS-level, whereas illegal instruction traps are
likely to be processed first in M-mode before being conditionally
delegated (by software) to HS-level.  Consequently, virtual instruction
traps are expected typically to be handled faster than illegal
instruction traps.
\end{commentary}

--------------------

Regards,

    - John Hauser




Re: proposal to add "virtual instruction exception" to the hypervisor extension

John Hauser
 

Jonathan Behrens wrote:
I like this plan. The one comment I have is that it seems unnecessarily
opinionated about which operations trigger virtual instruction traps vs
illegal instruction traps when run in VU mode. I think we should error more
on having things trigger virtual instruction traps everywhere that it is
unlikely to require M-mode emulation. To give one example of where the
current design might go wrong: analogously to HLV and friends, HFENCE from
U-mode might be allowed via a CSR in the future, in which case it would now
require hypervisor emulation when a nested VM tried to run it in VU-mode.
Point taken. Would you like to take a stab at listing every case you
think should be included, besides HFENCE?

- John Hauser


Re: proposal to add "virtual instruction exception" to the hypervisor extension

Jonathan Behrens <behrensj@...>
 

It mostly just comes down to crossing out "In VS-mode" in your list of cases for trapping when V=1:

  - attempts to execute an HFENCE instruction or to access
    an implemented hypervisor CSR or VS CSR;

  - attempts to execute SRET when hstatus.VTSR = 1 or in VU-mode;

  - attempts to execute an SFENCE instruction or to access
    satp, when hstatus.VTVM = 1 or in VU-mode

A few more cases I'm less sure if they make sense:

  - attempts to access an unimplemented hypervisor CSR

  - attempts to access a supervisor CSR in VU-mode

  - attempts to execute MRET or access an M-mode CSR

Jonathan


On Tue, May 5, 2020 at 8:31 PM John Hauser via lists.riscv.org <jh.riscv=jhauser.us@...> wrote:
Jonathan Behrens wrote:
> I like this plan. The one comment I have is that it seems unnecessarily
> opinionated about which operations trigger virtual instruction traps vs
> illegal instruction traps when run in VU mode. I think we should error more
> on having things trigger virtual instruction traps everywhere that it is
> unlikely to require M-mode emulation. To give one example of where the
> current design might go wrong: analogously to HLV and friends, HFENCE from
> U-mode might be allowed via a CSR in the future, in which case it would now
> require hypervisor emulation when a nested VM tried to run it in VU-mode.

Point taken.  Would you like to take a stab at listing every case you
think should be included, besides HFENCE?

    - John Hauser




Re: proposal to add "virtual instruction exception" to the hypervisor extension

Paolo Bonzini
 

On 05/05/20 23:15, John Hauser wrote:
- attempts to execute WFI, unless the instruction completes within an
implementation-specific, bounded time;
It would be great to have this controlled by a bit in the hstatus CSR.
For example a hypervisor that does not overcommit CPUs will probably
want to delegate WFI to the guest. Guest-delegated interrupts then
would not incur the overhead of world switching, while host-handled
interrupts would exit to the hypervisor anyway.

(It may even make sense to add this to mstatus, but I wouldn't care much
about it right now except perhaps when choosing which bit to use in
hstatus).

Paolo


Re: proposal to add "virtual instruction exception" to the hypervisor extension

Anup Patel
 

-----Original Message-----
From: tech-privileged@... <tech-privileged@...> On
Behalf Of Paolo Bonzini
Sent: 06 May 2020 14:34
To: John Hauser <jh.riscv@...>; tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] proposal to add "virtual instruction
exception" to the hypervisor extension

On 05/05/20 23:15, John Hauser wrote:
- attempts to execute WFI, unless the instruction completes within an
implementation-specific, bounded time;
It would be great to have this controlled by a bit in the hstatus CSR.
For example a hypervisor that does not overcommit CPUs will probably want
to delegate WFI to the guest. Guest-delegated interrupts then would not
incur the overhead of world switching, while host-handled interrupts would
exit to the hypervisor anyway.

(It may even make sense to add this to mstatus, but I wouldn't care much
about it right now except perhaps when choosing which bit to use in
hstatus).
Good suggestion.

A HSTATUS.VTW bit for trapping WFI when executed with V=1 would
be good (just like HSTATUS.VTSR and HSTATUS.VTVM bits) ?

Regards,
Anup


Re: proposal to add "virtual instruction exception" to the hypervisor extension

John Hauser
 

Paolo Bonzini wrote:
On 05/05/20 23:15, John Hauser wrote:
- attempts to execute WFI, unless the instruction completes within an
implementation-specific, bounded time;
It would be great to have this controlled by a bit in the hstatus CSR.
For example a hypervisor that does not overcommit CPUs will probably
want to delegate WFI to the guest. Guest-delegated interrupts then
would not incur the overhead of world switching, while host-handled
interrupts would exit to the hypervisor anyway.
I'm afraid this makes no sense to me. Ordinary user-mode applications
don't execute WFI. In U mode, WFI is useful only in connection
with support for user-level interrupts provided by the may-never-be-
standardized N extension.

And if a guest OS itself executes WFI, I find it hard to believe it
expects the WFI to trap as an illegal instruction, much less that it
has a performance-sensitive response coded in its illegal instruction
trap handler.

So I don't understand "delegate WFI to the guest". What's the scenario
where a guest OS expects WFI to trap?

The only thing I can think of is when the guest is a nested
hypervisor. But I'm not sure that specifically focusing on optimizing
WFI for a nested hypervisor is really on the agenda. Why not make many
other improvements in support of nested hypervisors, besides this one
thing? We intend eventually to propose a whole package of optional
added support for nested hypervisors. But outside of that add-on
extension, nested hypervisors aren't catered to beyond a bare-bones
level of support.

- John Hauser


Re: proposal to add "virtual instruction exception" to the hypervisor extension

Paolo Bonzini
 

On 06/05/20 18:56, John Hauser wrote:
Paolo Bonzini wrote:
On 05/05/20 23:15, John Hauser wrote:
- attempts to execute WFI, unless the instruction completes within an
implementation-specific, bounded time;
It would be great to have this controlled by a bit in the hstatus CSR.
For example a hypervisor that does not overcommit CPUs will probably
want to delegate WFI to the guest. Guest-delegated interrupts then
would not incur the overhead of world switching, while host-handled
interrupts would exit to the hypervisor anyway.
I'm afraid this makes no sense to me. Ordinary user-mode applications
don't execute WFI.
But this would be a VS-mode application, not an ordinary user-mode
application. I'm not sure why U-mode matters. Anyway...

And if a guest OS itself executes WFI, I find it hard to believe it
expects the WFI to trap as an illegal instruction, much less that it
has a performance-sensitive response coded in its illegal instruction
trap handler.

So I don't understand "delegate WFI to the guest". What's the scenario
where a guest OS expects WFI to trap?
... I guess this is the misunderstanding. I'm not proposing to delegate
the WFI trap to the guest, but rather *the WFI instruction*: HS-mode
could optionally let WFI run in VS-mode, even if it wouldn't complete
within a bounded time. This is because, if you don't overcommit CPUs,
there's no advantage in getting out of VS-mode and doing the wait for
interrupts in HS-mode.

When an interrupt arrives, it would either be delivered to VS-mode or
cause an HS-mode trap, depending on the contents of hideleg.

Nested virtualization does not matter.

Is this clearer?

Paolo


Re: hstatus.VTW for WFI

John Hauser
 

Paolo Bonzini wrote:
... I guess this is the misunderstanding. I'm not proposing to delegate
the WFI trap to the guest, but rather *the WFI instruction*: HS-mode
could optionally let WFI run in VS-mode, even if it wouldn't complete
within a bounded time. This is because, if you don't overcommit CPUs,
there's no advantage in getting out of VS-mode and doing the wait for
interrupts in HS-mode.
Okay, I get it now. You're proposing we bring back hstatus.VTW,
the HS-mode analog to mstatus.TW. (We had the VTW bit originally in
hstatus, but dropped it long ago.)

Anyone else have an opinion about that?

- John Hauser


Microarchitectural state flush for timing-channel prevention

Gernot <gernot.heiser@...>
 

Dear Privspec members,

You may recall that I had argued for an operation to flush microarchitectural state in order to allow the OS to prevent timing channels. I believe the need for this was not disputed, but the open question is what is the right abstraction.

I’m not offering an answer to the abstraction question ;-) However, I’d like to share our recent work with the Ariane folks at ETH, where we implemented such a flush and evaluated its effectiveness as well as overhead. The results are extremely encouraging, and written up here: https://arxiv.org/abs/2005.02193.

In a nutshell
- we could show that all channels could be closed
- we learned that you have to be really thorough and invalidate more than just cache-line tags
- we found the overhead negligible (given that a switch of security-domain is happening at no more than a millisecond rate).

The details are somewhat specific to the Ariane, which is a simple, in-order pipeline and presently has a write-through L1D. On a more high-performance design there would be more to flush and the cost would be higher. However, I do not expect this to change the overall picture.

Gernot


Question on the new hvip register

Siqi Zhao
 

Hi,

 

Reading through the hypervisor extension v0.6, I noticed the new register called hvip. The spec says that this register is intended for the hypervisor to write to indicate pending interrupts for the VS-mode. However, as how I understood the older version of the hypervisor extension, this purposed has already been fulfilled by writing to the vsip register. Why make another register? Am I missing something?

 

Regards,

Siqi


Re: Question on the new hvip register

John Hauser
 

Siqi (zhaosiqi3@...) wrote:
Reading through the hypervisor extension v0.6, I noticed the new
register called hvip. The spec says that this register is intended
for the hypervisor to write to indicate pending interrupts for the
VS-mode. However, as how I understood the older version of the
hypervisor extension, this purposed has already been fulfilled by
writing to the vsip register. Why make another register? Am I missing
something?
The same question was raised not long ago on the RISC-V ISA Dev mailing
list. The following is clipped from that conversation.

--------------------

Jose Martins wrote:
Actually, the purpose of this new hvip register is not entirely clear
to me. In v0.5 it was possible to inject virtual interrupts in vs mode
through hip.
As a general rule, simple injection of interrupts into VS mode can now
be done only through hvip. Technically, VSSIP is an exception, because
its alias in hip is writable, but my advice is to ignore that quirk.

As you know, in draft 0.5 of the hypervisor extension, hip.VSEIP
was defined with an underlying software-writable bit, analogous to
mip.SEIP. At the time, this seemed the most consistent way to handle
injection of interrupts into VS mode. However, it caused a headache
for software, as I explained in December in a message to a RISC-V
working group:

(But beware: If, in the future, there are hardware sources for the
value of hip.VSEIP, a simple save and restore of hip could cause the
software-writable VSEIP bit to become spuriously set when it wasn't
before. The safe way to save hip would be as follows: After reading
the CSR, if the software-writable VSEIP bit is not supposed to be set,
clear the VSEIP bit from the hip value before saving it in your VCPU
control structure. If the hypervisor may ever set the software-
writable VSEIP bit, it must keep track of the intended state of this
bit, and furthermore this knowledge must be accessible to the code that
saves the CSRs on context switches.)
When contemplating additional hardware support for nested hypervisors,
it was realized there could be times when the software doing a context
swap might not know the location of the intended state of the software-
writable VSEIP bit, making it impossible to correctly swap hip.
Adding hvip solved this problem, by putting the information in a known
accessible place and eliminating a CSR bit with odd semantics. (The
odd semantics still exist for mip.SEIP, but with less consequences for
software.)

--------------------

- John Hauser


Re: Question on the new hvip register

John Hauser
 

Siqi (zhaosiqi3@...) wrote:
Reading through the hypervisor extension v0.6, I noticed the new
register called hvip. The spec says that this register is intended
for the hypervisor to write to indicate pending interrupts for the
VS-mode. However, as how I understood the older version of the
hypervisor extension, this purposed has already been fulfilled by
writing to the vsip register. Why make another register? Am I missing
something?
Also, sorry, failed to notice you said vsip, not hip.

The previous draft of the hypervisor extension, 0.5, never allowed
writing directly to vsip from VS mode, except for vsip.SSIP. Any claim
that it did is a misunderstanding of the intention. But that's history
now; no longer important.

Regards,

- John Hauser


Re: hstatus.VTW for WFI

John Hauser
 

I wrote:
Okay, I get it now. You're proposing we bring back hstatus.VTW,
the HS-mode analog to mstatus.TW. (We had the VTW bit originally in
hstatus, but dropped it long ago.)
I've created a pull request for the RISC-V privileged manual:
https://github.com/riscv/riscv-isa-manual/pull/523

Comments welcome.

- John Hauser


Non-idempotent PMA and table walk accesses

David Kruckemyer
 

Hi all,

I have a simple question: does the architecture allow table walk accesses (reads or writes) to regions with the non-idempotent PMA?

The architecture doesn't explicitly disallow it, so the answer is probably "yes." However, I'm having a hard time understanding a system design in which such a table walk would be practical. Can someone provide a practical use-case for walking non-idempotent locations?

If no such use-case exists, would people object to imposing a restriction on table walk accesses to locations with the non-idempotent PMA? Or at least a comment strongly suggesting that platforms won't support that behavior?

Cheers,
David


Re: Non-idempotent PMA and table walk accesses

Andrew Waterman
 



On Mon, May 18, 2020 at 2:58 PM David Kruckemyer <dkruckemyer@...> wrote:
Hi all,

I have a simple question: does the architecture allow table walk accesses (reads or writes) to regions with the non-idempotent PMA?

The architecture doesn't explicitly disallow it, so the answer is probably "yes." However, I'm having a hard time understanding a system design in which such a table walk would be practical. Can someone provide a practical use-case for walking non-idempotent locations?

If no such use-case exists, would people object to imposing a restriction on table walk accesses to locations with the non-idempotent PMA? Or at least a comment strongly suggesting that platforms won't support that behavior?

The specification machinery exists to allow implementations to impose such a restriction: "For systems with page-based virtual memory, I/O and memory regions can specify which combinations of hardware page-table reads and hardware page-table writes are supported."

I'd support adding a note that permitting page-table accesses to idempotent regions is discouraged.  Banning it seems a little harsh, though I see where you're coming from.


Cheers,
David


Re: Non-idempotent PMA and table walk accesses

Nikhil Rishiyur
 

Although I haven't seen any such implementation, I would imagine that a non-idempotent region that was, say, counting accesses to each address as a side-effect of each access may be a "benign" kind of non-idempotency for PTWs.

Nikhil

On Mon, May 18, 2020 at 6:26 PM Andrew Waterman <andrew@...> wrote:


On Mon, May 18, 2020 at 2:58 PM David Kruckemyer <dkruckemyer@...> wrote:
Hi all,

I have a simple question: does the architecture allow table walk accesses (reads or writes) to regions with the non-idempotent PMA?

The architecture doesn't explicitly disallow it, so the answer is probably "yes." However, I'm having a hard time understanding a system design in which such a table walk would be practical. Can someone provide a practical use-case for walking non-idempotent locations?

If no such use-case exists, would people object to imposing a restriction on table walk accesses to locations with the non-idempotent PMA? Or at least a comment strongly suggesting that platforms won't support that behavior?

The specification machinery exists to allow implementations to impose such a restriction: "For systems with page-based virtual memory, I/O and memory regions can specify which combinations of hardware page-table reads and hardware page-table writes are supported."

I'd support adding a note that permitting page-table accesses to idempotent regions is discouraged.  Banning it seems a little harsh, though I see where you're coming from.


Cheers,
David


Re: Non-idempotent PMA and table walk accesses

David Kruckemyer
 

That sounds a bit like a performance counter to me, but it does raise an interesting question whether "idempotent" in the architectural sense is idempotent in a mathematical sense (i.e. operations are repeatable with the same result) or in a broader sense (e.g. inclusive of any side-effects even if the values at the location don't change).

I've always assumed the "non-idempotent" attribute meant that a read may not return the last value written or that repeated reads may not return the same value, not that the behavior included side-effects that are observable elsewhere. What is the consensus regarding this?

Cheers,
David


On Mon, May 18, 2020 at 4:16 PM Rishiyur Nikhil <nikhil@...> wrote:
Although I haven't seen any such implementation, I would imagine that a non-idempotent region that was, say, counting accesses to each address as a side-effect of each access may be a "benign" kind of non-idempotency for PTWs.

Nikhil

On Mon, May 18, 2020 at 6:26 PM Andrew Waterman <andrew@...> wrote:


On Mon, May 18, 2020 at 2:58 PM David Kruckemyer <dkruckemyer@...> wrote:
Hi all,

I have a simple question: does the architecture allow table walk accesses (reads or writes) to regions with the non-idempotent PMA?

The architecture doesn't explicitly disallow it, so the answer is probably "yes." However, I'm having a hard time understanding a system design in which such a table walk would be practical. Can someone provide a practical use-case for walking non-idempotent locations?

If no such use-case exists, would people object to imposing a restriction on table walk accesses to locations with the non-idempotent PMA? Or at least a comment strongly suggesting that platforms won't support that behavior?

The specification machinery exists to allow implementations to impose such a restriction: "For systems with page-based virtual memory, I/O and memory regions can specify which combinations of hardware page-table reads and hardware page-table writes are supported."

I'd support adding a note that permitting page-table accesses to idempotent regions is discouraged.  Banning it seems a little harsh, though I see where you're coming from.


Cheers,
David


Re: Non-idempotent PMA and table walk accesses

Bill Huffman
 


On 5/18/20 5:10 PM, David Kruckemyer wrote:
EXTERNAL MAIL

That sounds a bit like a performance counter to me, but it does raise an interesting question whether "idempotent" in the architectural sense is idempotent in a mathematical sense (i.e. operations are repeatable with the same result) or in a broader sense (e.g. inclusive of any side-effects even if the values at the location don't change).

I've always assumed the "non-idempotent" attribute meant that a read may not return the last value written or that repeated reads may not return the same value, not that the behavior included side-effects that are observable elsewhere. What is the consensus regarding this?

Cheers,
David

I've always assumed that it included any side-effects that mattered to the program.  It obviously does not include bringing the demise of a chip nearer with tiny amounts of electromigration.  I don't think it includes incrementing performance counters or shifting the results of predictors either.  Not sure how many things actually fit between your definition and mine.  Perhaps not many in real implementations.

      Bill



On Mon, May 18, 2020 at 4:16 PM Rishiyur Nikhil <nikhil@...> wrote:
Although I haven't seen any such implementation, I would imagine that a non-idempotent region that was, say, counting accesses to each address as a side-effect of each access may be a "benign" kind of non-idempotency for PTWs.

Nikhil

On Mon, May 18, 2020 at 6:26 PM Andrew Waterman <andrew@...> wrote:


On Mon, May 18, 2020 at 2:58 PM David Kruckemyer <dkruckemyer@...> wrote:
Hi all,

I have a simple question: does the architecture allow table walk accesses (reads or writes) to regions with the non-idempotent PMA?

The architecture doesn't explicitly disallow it, so the answer is probably "yes." However, I'm having a hard time understanding a system design in which such a table walk would be practical. Can someone provide a practical use-case for walking non-idempotent locations?

If no such use-case exists, would people object to imposing a restriction on table walk accesses to locations with the non-idempotent PMA? Or at least a comment strongly suggesting that platforms won't support that behavior?

The specification machinery exists to allow implementations to impose such a restriction: "For systems with page-based virtual memory, I/O and memory regions can specify which combinations of hardware page-table reads and hardware page-table writes are supported."

I'd support adding a note that permitting page-table accesses to idempotent regions is discouraged.  Banning it seems a little harsh, though I see where you're coming from.


Cheers,
David

101 - 120 of 1210