Date   

Re: [RISC-V] [tech-tee] [RISC-V] [tech-privileged] comments on PMP enhancements

Allen Baum
 

Yes, the reason I labelled it separate M&L was because the existing "legacy" proposal essentially combines the functionality into a single "L" bit.

> But: if an entry is created (necessarily by M-mode) that is locked and such that M-mode can no longer touch that region but S/U mode can - that is v either intentional or that M-mode code is buggy, in my opinion.

Such an entry is impossible under the existing standard.  There's no way to encode it.

Of course - I was referring to sep_M&L proposals, (e.g. with MML=1, L=1, M-0).  Sorry for the confusion - too many proposals floating around.
The enhanced proposal can make a region inaccessible to M, accessible to S/U - but doesn't allow it to be locked.
I do understand that if you want to ensure that a locked rule can't be undone, you need ensure that all entries with lower entry numbers than it are also locked.
That's another way of saying that we shouldn't be inventing new modes to prevent this, we just need to do avoid obvious buggy code (I don't know if any of the proposals actually do something like that, so don't take that as criticism).

Mr. Kurd's proposal does not similarly lock down all of M mode's executable regions against modification nor prevent the creation of new PMP entries for executable regions. 

Well, that would be a pretty simple modification of his proposal, then: disallow creation of new Mode X regions when MML=1.
If that modification was made, what drawbacks remain?
I do understand that this might require a particular BootRom coding sequence; I'd like to explore that to see if it would be unreasonable or not.
I don't see an equivalent of the shared data regions that the sepM&L and enhanced proposals have. 
There is something similar that allows S/U to have execution privileges in a shared access region and only if it is unlocked, and I don't know if that should be a concern or not.

I'd still like to simplify either proposal by removing DMC and replacing it with "any entry locked".
Once you've locked a region, there is no need to access unmapped regions, since you can do that from the locked region



Re: comments on PMP enhancements

John Hauser
 

Hi Allen,

The point I was trying to make about locked PMP entries, but failed to
communicate before, is this:

When a system starts up after reset, PMP is enforced according to a
certain set of rules. For two of the proposals, the initial rules
(MML = 0 or MSL = 0) are the same as currently standardized. For Mr.
Kurd's separate M & L proposal, when MML = 0, entries created with
M = 0 follow the currently standardized rules as they should.

After reset, an early stage of boot software might create a locked
PMP entry under the initial PMP rules, intending to restrict M-mode's
access to a certain region of memory. The early boot stage might be
oblivious to the PMP enhancements or (perhaps more likely) it might
want to be compatible with later boot stages that may or may not be
aware of the PMP enhancements.

If a later stage then raises the level of PMP security by setting
MML = 1 or MSL > 0, existing PMP entries are now enforced according
to a new set of rules. For two of the proposals, the new rule for an
existing locked entry is that S/U mode has no access and M mode has
restricted access. But for Mr. Kurd's separate M & L proposal, the
new rule is that S/U mode gets restricted access and M mode gets _no_
access. This is what I find questionable. The original intention
under the initial rules was almost surely to give M mode restricted
access, which has now been revoked. And the worst part is, although
M mode itself no longer has access, it cannot deny access to S/U mode,
because the entry is locked.

Here's what happens, step by step:

1. Machine starts from reset.

2. Early boot stage creates a locked PMP entry under initial PMP
rules, intending to restrict M mode's access to memory.

3. Subsequent boot stage enables PMP enhancement, by setting MML = 1
(or MSL > 0).

Under the separate M & L proposal, at this point M mode has _no_ access
to the memory covered by the PMP entry created in step 2, but S/U mode
_does_ have access. And this situation is unchangeable because the
entry is locked.

Under the task group's working proposal or my four-security-level
proposal, M mode still has restricted access. It is S/U mode that
loses access. I argue that this is better, and the separate M & L
proposal is "wrong" on this point.

On a different topic, I wrote:
Mr. Kurd's proposal does not similarly lock down all of M mode's
executable regions against modification nor prevent the creation of new
PMP entries for executable regions. That is the shortcoming to which I
was referring.
You:
Well, that would be a pretty simple modification of his proposal, then:
disallow creation of new Mode X regions when MML=1.
If that modification was made, what drawbacks remain?
Preventing the creation of new regions with execute permission is only
one piece of the lockdown of M-mode executable regions. The other is
to lock all existing PMP entries that give M mode execute permission,
so their address ranges can't be modified. The task group's working
proposal does that by locking all M-mode-only PMP entries, something
I replicated in my topmost security level, MSL = 3. Currently, the
separate M & L proposal doesn't do this piece either.

Here's what I wrote before about both shortcomings I identified in the
separate M & L proposal:

I believe both of these flaws can be fixed, but only at the expense of
the simple separation of the L and M bits. In fact, you start to get a
design that looks more like my proposal.
And that's what I want to emphasize: Why patch a proposal with several
tweaks when, as I see it, there is another, cleaner proposal on hand
that already covers all the same needs?

You:
I'd still like to simplify either proposal by removing DMC and replacing it
with "any entry locked".
Once you've locked a region, there is no need to access unmapped regions,
since you can do that from the locked region
That's conceivable, but has the disadvantage of sometimes requiring
one or more additional PMP entries. Thus, your desire to save one
flip-flop and a few gates for DMC could sometimes cost 40 or 50
flip-flops and corresponding logic to have at least one more PMP entry.
My own assessment is that, in any RISC-V core that implements standard
PMP, that one flip-flop and small bit of logic is not worth fretting
over.

Regards,

- John Hauser


Huawei review of different PMP enhancement schemes

Mr Tariq Kurd <tariq.kurd@...>
 

Hi everyone,

 

We have spent a considerable amount of time reviewing the different proposals and have come to some conclusions.

 

1. The PMP enhancement proposal can meet our needs with the following modifications

- DMC (default memory closed)

- DPL (delay PMP locking)

- Shared executable regions

--> This feature is missing from all the proposals except for the separate M&L proposal,

 

I've attached another modified version of the PMP enhancement proposal, this time including shared X/X RX/X regions in the reserved programming encodings (and changed the name of the document to include “shared X”)

 

 

We want this to save code size, so we can share (e.g.) the C runtime library between the OS and application code.

 

Allen - I think that the "any region locked" proposal doesn't work for us, because we want to be able to remove access to any unmapped memory without locking a region. Sorry, it’s a nice idea.

 

Joe - The suggestion of running code in U-mode during the boot process to avoid locking regions when MML=1 is interesting, and certainly possible. However it's a lot more work to set up a handler and system calls back to the handler - it's like a light OS just for the boot process. DPL to delay the locking is a much simpler solution.

 

2. John Hauser's proposal is better than (1)

- the programming model is simpler - it's hard to get my head around the state changes when setting MML=1, but it's easier to follow John's

- the modes cover what we need to do

--> Shared executable regions are still missing - I think we also need to use the W and WX permissions as the PMP enhancement proposal does (it’s not clear if these are reserved or not – they are reserved in the spreadsheet but not in the document).

 

When MSL = 1, 2 or 3 and RWX=W   -> shared region with X permission

When MSL = 1, 2 or 3 and RWX=WX -> shared region with M-mode RX and S/U-mode X permission

 

This approach gives us the choice of whether to lock the regions or not, which is good although we think that locked shared X regions are probably sufficient.

 

3. Separate M&L doesn't offer many benefits

- it was a simple proposal to see what would happen by separating out the bits.

- it naturally has shared executable regions

 

In summary

 

-          PMP enhancement + DPL + DMC + shared X is ok for us

-          John’s 4-level scheme + shared X is better for us

 

Tariq

 

PS if we do develop John proposal further then we think there are some corner cases to resolve - we think that the programming model is a bit strange because some of the configuration bits become read-only at different times, which is not good for software which will try to program a configuration and then quietly get a different configuration

 

The two obvious examples are

- trying to program PL=1 or 3 when MSL=0

- trying to program an X region when MSL=3

I think in these cases PL[0] is ignored and the X bit is ignored respectively when writing the configuration. We would prefer to have an exception to say that the software programming the PMP is clearly wrong, to give us a chance to debug it.

 

 

 

 

-----Original Message-----
From: tech-tee@... [mailto:tech-tee@...] On Behalf Of John Hauser
Sent: 22 February 2020 01:19
To: tech-tee@...; tech-privileged@...
Subject: Re: [RISC-V] [tech-tee] comments on PMP enhancements

 

Hi Allen,

 

The point I was trying to make about locked PMP entries, but failed to communicate before, is this:

 

When a system starts up after reset, PMP is enforced according to a certain set of rules.  For two of the proposals, the initial rules (MML = 0 or MSL = 0) are the same as currently standardized.  For Mr.

Kurd's separate M & L proposal, when MML = 0, entries created with M = 0 follow the currently standardized rules as they should.

 

After reset, an early stage of boot software might create a locked PMP entry under the initial PMP rules, intending to restrict M-mode's access to a certain region of memory.  The early boot stage might be oblivious to the PMP enhancements or (perhaps more likely) it might want to be compatible with later boot stages that may or may not be aware of the PMP enhancements.

 

If a later stage then raises the level of PMP security by setting MML = 1 or MSL > 0, existing PMP entries are now enforced according to a new set of rules.  For two of the proposals, the new rule for an existing locked entry is that S/U mode has no access and M mode has restricted access.  But for Mr. Kurd's separate M & L proposal, the new rule is that S/U mode gets restricted access and M mode gets _no_ access.  This is what I find questionable.  The original intention under the initial rules was almost surely to give M mode restricted access, which has now been revoked.  And the worst part is, although M mode itself no longer has access, it cannot deny access to S/U mode, because the entry is locked.

 

Here's what happens, step by step:

 

  1. Machine starts from reset.

 

  2. Early boot stage creates a locked PMP entry under initial PMP

     rules, intending to restrict M mode's access to memory.

 

  3. Subsequent boot stage enables PMP enhancement, by setting MML = 1

     (or MSL > 0).

 

Under the separate M & L proposal, at this point M mode has _no_ access to the memory covered by the PMP entry created in step 2, but S/U mode _does_ have access.  And this situation is unchangeable because the entry is locked.

 

Under the task group's working proposal or my four-security-level proposal, M mode still has restricted access.  It is S/U mode that loses access.  I argue that this is better, and the separate M & L proposal is "wrong" on this point.

 

On a different topic, I wrote:

> Mr. Kurd's proposal does not similarly lock down all of M mode's

> executable regions against modification nor prevent the creation of

> new PMP entries for executable regions.  That is the shortcoming to

> which I was referring.

 

You:

> Well, that would be a pretty simple modification of his proposal, then:

> disallow creation of new Mode X regions when MML=1.

> If that modification was made, what drawbacks remain?

 

Preventing the creation of new regions with execute permission is only one piece of the lockdown of M-mode executable regions.  The other is to lock all existing PMP entries that give M mode execute permission, so their address ranges can't be modified.  The task group's working proposal does that by locking all M-mode-only PMP entries, something I replicated in my topmost security level, MSL = 3.  Currently, the separate M & L proposal doesn't do this piece either.

 

Here's what I wrote before about both shortcomings I identified in the separate M & L proposal:

 

> I believe both of these flaws can be fixed, but only at the expense of

> the simple separation of the L and M bits.  In fact, you start to get

> a design that looks more like my proposal.

 

And that's what I want to emphasize:  Why patch a proposal with several tweaks when, as I see it, there is another, cleaner proposal on hand that already covers all the same needs?

 

You:

> I'd still like to simplify either proposal by removing DMC and

> replacing it with "any entry locked".

> Once you've locked a region, there is no need to access unmapped

> regions, since you can do that from the locked region

 

That's conceivable, but has the disadvantage of sometimes requiring one or more additional PMP entries.  Thus, your desire to save one flip-flop and a few gates for DMC could sometimes cost 40 or 50 flip-flops and corresponding logic to have at least one more PMP entry.

My own assessment is that, in any RISC-V core that implements standard PMP, that one flip-flop and small bit of logic is not worth fretting over.

 

Regards,

 

    - John Hauser

 

 


Re: comments on PMP enhancements

Allen Baum
 

It is just one extra bit if its combined with an existing CSR, otherwise its extra decoding, scan and DVT logic, (not to mention extra readmux and write enable, regardless of  whether  its merged into another CSR.
My issue with "just one extra bit" is state explosion, which makes more corner cases and validation problems - just what you want to avoid for security.
Practically speaking, I 'm trying to come up with a where making DMC contingent on any entry locked will add an entry. 
The only one I can think of is that the initial bootrom image is never mapped, so once any entry is locked, it is unreachable.
Even there, the next stage must be validated and (eventually) locked, so any bootrom code required beyond that point could be included in the next stage 
(that initial bootrom should be pretty small), or locking could be delayed until all the entries that need access to it are defined (including the case where bootrom is contiguous with the next stage).
Can you give an example of a boot sequence that would need an extra entry? 
 

On Fri, Feb 21, 2020 at 5:20 PM John Hauser <jh.riscv@...> wrote:
Hi Allen,

The point I was trying to make about locked PMP entries, but failed to
communicate before, is this:

When a system starts up after reset, PMP is enforced according to a
certain set of rules.  For two of the proposals, the initial rules
(MML = 0 or MSL = 0) are the same as currently standardized.  For Mr.
Kurd's separate M & L proposal, when MML = 0, entries created with
M = 0 follow the currently standardized rules as they should.

After reset, an early stage of boot software might create a locked
PMP entry under the initial PMP rules, intending to restrict M-mode's
access to a certain region of memory.  The early boot stage might be
oblivious to the PMP enhancements or (perhaps more likely) it might
want to be compatible with later boot stages that may or may not be
aware of the PMP enhancements.

If a later stage then raises the level of PMP security by setting
MML = 1 or MSL > 0, existing PMP entries are now enforced according
to a new set of rules.  For two of the proposals, the new rule for an
existing locked entry is that S/U mode has no access and M mode has
restricted access.  But for Mr. Kurd's separate M & L proposal, the
new rule is that S/U mode gets restricted access and M mode gets _no_
access.  This is what I find questionable.  The original intention
under the initial rules was almost surely to give M mode restricted
access, which has now been revoked.  And the worst part is, although
M mode itself no longer has access, it cannot deny access to S/U mode,
because the entry is locked.

Here's what happens, step by step:

  1. Machine starts from reset.

  2. Early boot stage creates a locked PMP entry under initial PMP
     rules, intending to restrict M mode's access to memory.

  3. Subsequent boot stage enables PMP enhancement, by setting MML = 1
     (or MSL > 0).

Under the separate M & L proposal, at this point M mode has _no_ access
to the memory covered by the PMP entry created in step 2, but S/U mode
_does_ have access.  And this situation is unchangeable because the
entry is locked.

Under the task group's working proposal or my four-security-level
proposal, M mode still has restricted access.  It is S/U mode that
loses access.  I argue that this is better, and the separate M & L
proposal is "wrong" on this point.

On a different topic, I wrote:
> Mr. Kurd's proposal does not similarly lock down all of M mode's
> executable regions against modification nor prevent the creation of new
> PMP entries for executable regions.  That is the shortcoming to which I
> was referring.

You:
> Well, that would be a pretty simple modification of his proposal, then:
> disallow creation of new Mode X regions when MML=1.
> If that modification was made, what drawbacks remain?

Preventing the creation of new regions with execute permission is only
one piece of the lockdown of M-mode executable regions.  The other is
to lock all existing PMP entries that give M mode execute permission,
so their address ranges can't be modified.  The task group's working
proposal does that by locking all M-mode-only PMP entries, something
I replicated in my topmost security level, MSL = 3.  Currently, the
separate M & L proposal doesn't do this piece either.

Here's what I wrote before about both shortcomings I identified in the
separate M & L proposal:

> I believe both of these flaws can be fixed, but only at the expense of
> the simple separation of the L and M bits.  In fact, you start to get a
> design that looks more like my proposal.

And that's what I want to emphasize:  Why patch a proposal with several
tweaks when, as I see it, there is another, cleaner proposal on hand
that already covers all the same needs?

You:
> I'd still like to simplify either proposal by removing DMC and replacing it
> with "any entry locked".
> Once you've locked a region, there is no need to access unmapped regions,
> since you can do that from the locked region

That's conceivable, but has the disadvantage of sometimes requiring
one or more additional PMP entries.  Thus, your desire to save one
flip-flop and a few gates for DMC could sometimes cost 40 or 50
flip-flops and corresponding logic to have at least one more PMP entry.
My own assessment is that, in any RISC-V core that implements standard
PMP, that one flip-flop and small bit of logic is not worth fretting
over.

Regards,

    - John Hauser




Re: Huawei review of different PMP enhancement schemes

John Hauser
 

Tariq Kurd wrote:
2. John Hauser's proposal is better than (1)

- the programming model is simpler - it's hard to get my head around
the state changes when setting MML=1, but it's easier to follow
John's

- the modes cover what we need to do

--> Shared executable regions are still missing - I think we also
need to use the W and WX permissions as the PMP enhancement proposal
does (it's not clear if these are reserved or not - they are reserved
in the spreadsheet but not in the document).
Hi Tariq,

I meant for the encodings with W = 1 and R = 0 to continue to be
reserved, as the spreadsheet indicates, but you're right that I forgot
to say so in my document.

I have a different suggestion for adding support for shared executable
regions. I've attached a new spreadsheet file with a new tab showing
my upgraded proposal, version 0.3. You can easily see the changes
by switching between the tabs labeled "4level 0.2" and "4level 0.3".
In my newest version, PMP entries with PL = 0, W = 0, and X = 1
(executable but not writable in S/U mode) are now readable/executable
in M mode. Previously all entries with PL = 0 were readable/writable
for M mode.

When MSL = 3, all PMP entries that give execute permission to M mode
are locked. And new entries that would give execute permission to
M mode cannot be created when MSL = 3, the same as before.

I haven't updated my document yet, but I thought we could debate my
version 0.3 just from the spreadsheet.

I look forward to all feedback.

- John Hauser


Re: Huawei review of different PMP enhancement schemes

Allen Baum
 

Just as I have been asking why DMC is necessary, I have to ask why the DPL bit is necessary.
If there is code that wants to reorder PMP entries while DPL is 1, but the lock bits are set - why don't you instead simply not program any lock bits until you get to the point that you would have changed DPL from 1->0? As the doc mentions: It is noted that this style of boot flow does not prevent the PMP being unlocked again by software, and so the security is lower than if the regions remain locked.
If  you are executing code that has not been authenticated while existing entries are unlocked (or the L bit is set but hasn't taken effect) - then you have a security issue.
The DPL bit doesn't fix that, therefore it seems to me that the sequence above (separate "lock everything that needs locking" phase) gives you equivalent security.
Also note that DPL is really two bits when implemented, since it as 3 states (initially 0, has been set to 1, has transitioned to 0 and is now locked).

Can someone show a sequence that has higher security with DPL compared to a sequence that sets all the lock bits at the point that DPL would have been cleared?
Ditto for DMC: can someone show a sequence (and memory map) that causes an extra entry to be required if the default memory closed is defined as "any entry is locked".
If someone doesn't demonstrate one (that can't be easily modified to avoid the problem with equivalent security), I can't support either.

On Mon, Feb 24, 2020 at 2:09 PM John Hauser <jh.riscv@...> wrote:
Tariq Kurd wrote:
> 2. John Hauser's proposal is better than (1)
>
> - the programming model is simpler - it's hard to get my head around
>   the state changes when setting MML=1, but it's easier to follow
>   John's
>
> - the modes cover what we need to do
>
> --> Shared executable regions are still missing - I think we also
> need to use the W and WX permissions as the PMP enhancement proposal
> does (it's not clear if these are reserved or not - they are reserved
> in the spreadsheet but not in the document).

Hi Tariq,

I meant for the encodings with W = 1 and R = 0 to continue to be
reserved, as the spreadsheet indicates, but you're right that I forgot
to say so in my document.

I have a different suggestion for adding support for shared executable
regions.  I've attached a new spreadsheet file with a new tab showing
my upgraded proposal, version 0.3.  You can easily see the changes
by switching between the tabs labeled "4level 0.2" and "4level 0.3".
In my newest version, PMP entries with PL = 0, W = 0, and X = 1
(executable but not writable in S/U mode) are now readable/executable
in M mode.  Previously all entries with PL = 0 were readable/writable
for M mode.

When MSL = 3, all PMP entries that give execute permission to M mode
are locked.  And new entries that would give execute permission to
M mode cannot be created when MSL = 3, the same as before.

I haven't updated my document yet, but I thought we could debate my
version 0.3 just from the spreadsheet.

I look forward to all feedback.

    - John Hauser




Re: [RISC-V] [tech-tee] [RISC-V] [tech-privileged] Huawei review of different PMP enhancement schemes

Mr Tariq Kurd <tariq.kurd@...>
 

Ø  why don't you instead simply not program any lock bits until you get to the point that you would have changed DPL from 1->0?

 

Because we can’t program the permissions we need without locking the entry, because the two functions are mixed up – changing permissions and locking, we need them to be separated. DPL is a cheap method of doing this, John Hauser’s approach is better.

 

Ø  Also note that DPL is really two bits when implemented, since it as 3 states (initially 0, has been set to 1, has transitioned to 0 and is now locked).

 

Yes it’s clunky but it has to reset to being disabled to match the current standard.

 

Ø  Can someone show a sequence that has higher security with DPL compared to a sequence that sets all the lock bits at the point that DPL would have been cleared?

It’s more related to the fact that we simply can’t program the permissions we need without DPL without locking the page, which is why we currently have a non-standard PMP implementation with separate RWX bits for M-mode and U-mode but we’d like to move to a standard implementation.

 

Ø  Ditto for DMC: can someone show a sequence (and memory map) that causes an extra entry to be required if the default memory closed is defined as "any entry is locked".

At reset the boot vector could be hijacked through fault injection or hardware modification of the die. Having memory closed at reset (except for a small section of the boot ROM) is required to prevent this attack method, so that the core can ONLY boot from the boot ROM. To achieve this without DMC then we need to reset the highest numbered entry to cover all of memory with no access and burn a PMP entry to do it. So we need this in place at reset.

 

Tariq

 

From: tech-tee@... [mailto:tech-tee@...] On Behalf Of Allen Baum
Sent: 24 February 2020 23:50
To: John Hauser <jh.riscv@...>
Cc: tech-tee@...; tech-privileged@...
Subject: Re: [RISC-V] [tech-tee] [RISC-V] [tech-privileged] Huawei review of different PMP enhancement schemes

 

Just as I have been asking why DMC is necessary, I have to ask why the DPL bit is necessary.

If there is code that wants to reorder PMP entries while DPL is 1, but the lock bits are set - why don't you instead simply not program any lock bits until you get to the point that you would have changed DPL from 1->0? As the doc mentions: It is noted that this style of boot flow does not prevent the PMP being unlocked again by software, and so the security is lower than if the regions remain locked.

If  you are executing code that has not been authenticated while existing entries are unlocked (or the L bit is set but hasn't taken effect) - then you have a security issue.

The DPL bit doesn't fix that, therefore it seems to me that the sequence above (separate "lock everything that needs locking" phase) gives you equivalent security.

Also note that DPL is really two bits when implemented, since it as 3 states (initially 0, has been set to 1, has transitioned to 0 and is now locked).

 

Can someone show a sequence that has higher security with DPL compared to a sequence that sets all the lock bits at the point that DPL would have been cleared?

Ditto for DMC: can someone show a sequence (and memory map) that causes an extra entry to be required if the default memory closed is defined as "any entry is locked".

If someone doesn't demonstrate one (that can't be easily modified to avoid the problem with equivalent security), I can't support either.

 

On Mon, Feb 24, 2020 at 2:09 PM John Hauser <jh.riscv@...> wrote:

Tariq Kurd wrote:
> 2. John Hauser's proposal is better than (1)
>
> - the programming model is simpler - it's hard to get my head around
>   the state changes when setting MML=1, but it's easier to follow
>   John's
>
> - the modes cover what we need to do
>
> --> Shared executable regions are still missing - I think we also
> need to use the W and WX permissions as the PMP enhancement proposal
> does (it's not clear if these are reserved or not - they are reserved
> in the spreadsheet but not in the document).

Hi Tariq,

I meant for the encodings with W = 1 and R = 0 to continue to be
reserved, as the spreadsheet indicates, but you're right that I forgot
to say so in my document.

I have a different suggestion for adding support for shared executable
regions.  I've attached a new spreadsheet file with a new tab showing
my upgraded proposal, version 0.3.  You can easily see the changes
by switching between the tabs labeled "4level 0.2" and "4level 0.3".
In my newest version, PMP entries with PL = 0, W = 0, and X = 1
(executable but not writable in S/U mode) are now readable/executable
in M mode.  Previously all entries with PL = 0 were readable/writable
for M mode.

When MSL = 3, all PMP entries that give execute permission to M mode
are locked.  And new entries that would give execute permission to
M mode cannot be created when MSL = 3, the same as before.

I haven't updated my document yet, but I thought we could debate my
version 0.3 just from the spreadsheet.

I look forward to all feedback.

    - John Hauser



Re: [tech-privileged] hypervisor extension: seL4 experience and feedback

John Hauser
 

Hi Gernot and Yanyan,

It's been a couple of months since you first sent (Dec. 4) your
document reporting your experience adapting the seL4 microkernel to
draft 0.4 of the RISC-V hypervisor extension, with some questions about
the then-current 0.5 draft. I earlier responded in detail to your
feedback from sections 4 and 5 of your document. I'd like to respond
finally to a couple remaining issues raised in sections 6 and 7.

Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated
by the hypervisor extension? Are they going to return the cycles
consumed and instructions retired by the current running VM only?
Without additional "delta" registers like RDTIME's htimedelta,
the expectation currently is that bits CY and IR in hcounteren for
the cycle and instret counters will normally be set to zero. The
hypervisor thus gets to emulate these counters for the virtual machine,
adjusting the global cycle and instret counts as necessary.

It's perfectly reasonable to question whether emulating the cycle and
instret counters will be too expensive in practice. The official line
for now is that emulation should be tolerable. RDCYCLE and RDINSTRET
are expected to be used only for performance measurements, and should
not be executed too frequently.

The v0.5 draft states that the accesses to the VS CSRs in VS-mode
cause illegal instructions, so nested virtualization could be built
on trap-and-emulate. Similarly, accesses to HS-mode CSRs from the
second-level hypervisor also need to be trapped and emulated. This
approach naturally raises concerns about the overhead of trapping,
decoding, and handling the CSR accesses. As Arm and x86 already
added hardware support for nested virtualisation, are we anticipating
similar hardware support in RISC-V?
Additional optional hardware for nested hypervisors is being
considered. More about this may come out later in 2020 or next year.
Right now, other components that are needed for a server-class RISC-V
platform are probably a higher priority.

Regards,

- John Hauser


Re: Huawei review of different PMP enhancement schemes

Nick Kossifidis
 

Some thoughts on the various proposals on the spreadsheet (v0.3):


M&L proposal:

The purpose of M bit is not clear, I get that the idea is to be able to
mark a rule that applies to M mode without having that rule also locked,
but for example the combination L,M = 1,0 when MML = 0 doesn't follow
that principle, it marks a rule as locked and also as enforced on M-mode
even though M = 0. When MML = 1 we get unlocked M-mode-only regions when
L,M = 0,1 but we also get locked S/U-only-regions when L,M = 1,0 which
doesn't make much sense (I think John also brought this up).


4level0.2:

With MSL=0 we get the current PMP behavior and with MSL=3 we get mostly
the same behavior as with MML=1 on the group's proposal, only with one
extra bit being used on pmpcfg and a redundant encoding (PL=2 and PL=3
are the same thing). Also it's possible to have a shared region that's
executable by S/U mode and RW by M mode, which is not possible with the
group's proposal as is.

With MSL=2 we get rid of the restriction of not being able to add new
executable M-mode-only regions, however that can be achieved by using
non-locked M-mode-only regions that are also available on MSL=2 (with
PL=3) since there is no such restriction defined for them. In other
words non-locked M-mode-only regions allow for this restriction to be
bypassed anyway.

With MSL=1 we get rid of the restriction of not allowing M-mode to
execute a region without a matching rule. However both locked and
non-locked M-mode-only regions allow for this restriction to be bypassed
on MSL=2 anyway since M-mode can just add such a rule and execute the
region, it's even worse with non-locked rules since afterwards M-mode
can also remove the rule and no one will ever know it happened. So to me
MSL=1 is redundant, I don't see any use for it. It's also obviously
redundant when DMC=1 but I'll come back to DMC later on.

So basically the extras we get are:
a) It's possible to have a region that's executable by S/U and RW by
M-mode for MSL > 0
b) It's possible to have removable M-mode-only rules when 0 < MSL < 3


4level0.3:

This is dangerous ! With this revision it's possible to have a region
that's rw by S/U mode and executable by M mode when PL=0, which allows
for an attacker to perform the attack described on the group's proposal
and is exactly what we are trying to prevent. This is possible on all
security levels by the way, even with MSL=3. It's also more complicated
since PL=0 on MSL=3 encodes both locked and non-locked rules. Finally
when MSL=3 and PL=3 we get removable M-mode-only, non-executable
regions, at the highest security level. In terms of security it's a
regression over revision 0.2, not an improvement.


Regarding DMC:

As shown above, restricting M-mode from executing memory regions without
a matching rule, only makes sense if it's not possible to add such a
rule (that allows execution). If it's possible to add a rule that
applies to M-mode then any restrictions regarding regions without a
matching rule, are a few instructions away from being bypassed. Same
applies when restricting r/w/x on M-mode with the DMC bit. In both
proposals DMC can be easily bypassed.

Even if we incorporate DMC on the group's proposal we 'll still be able
to add a rule that gives r/w privileges on M-mode, although this rule
will be a locked one so it'll at least be possible to detect this event.

However DMC to me is orthogonal to the various scenarios we discuss and
given that it's possible to reset the hart with a pre-defined set of PMP
rules, it makes sense to have such a mechanism. That's why my initial
reaction to Tariq's proposal regarding DMC, was to propose to him to
submit a separate proposal for this.


What we discussed on this week's TEE TG call:

a) Incorporate mseccfg.DMC to the group's proposal. It'll be a sticky
bit so when it gets set it can only be unset through hard-reset.

b) Allow for M-mode-only rules to be removable temporarily for debugging
/ flexibility purposes during boot (since this approach weakens PMP it
can't be defined as a security feature), with a big disclaimer/warning
in place, through the proposed DPL bit on mseccfg. This is also going to
be defined as an optional feature.

c) Add another bit for locking DPL, it'll only be possible to lock DPL
to 0 (disabled).

d) Use the remaining 2 encodings L=1,R=0,W=1,X=0 and L=1,R=0,W=1,X=1
when MML=1 to define a locked shared region that's executable by both M
and S/U mode but not writable by anyone (when X is set it's also
readable by M-mode), as Tariq proposed. The use case for this is to
share code between M-mode and S/U-mode, e.g. to support vendor-specific
extensions with custom assembly, without having to go through an ecall
(similar to Linux's VDSO).

e) Get rid of the security exception, use normal access faults instead.
S/U mode can use SBI to request more info from M-mode if needed (since
S/U can't access PMP registers to figure it out).


Regards,
Nick


Re: Huawei review of different PMP enhancement schemes

John Hauser
 

Nick Kossifidis wrote:
4level0.3:

This is dangerous ! With this revision it's possible to have a region
that's rw by S/U mode and executable by M mode when PL=0, [...]
I agree that would be dangerous, but I intentionally excluded that
possibility, so I don't understand. What is the exact encoding that
you think allows this, when MSL > 0?

Finally
when MSL=3 and PL=3 we get removable M-mode-only, non-executable
regions, at the highest security level. In terms of security it's a
regression over revision 0.2, not an improvement.
That detail could easily be changed, if that's the only remaining
complaint about the security.

- John Hauser


Re: Huawei review of different PMP enhancement schemes

Jonathan Behrens <behrensj@...>
 

John Hauser wrote:
Nick Kossifidis wrote:
> Finally
> when MSL=3 and PL=3 we get removable M-mode-only, non-executable
> regions, at the highest security level. In terms of security it's a
> regression over revision 0.2, not an improvement.

That detail could easily be changed, if that's the only remaining
complaint about the security.

I don't understand how having extra bit patterns for the PMP config registers compromise security. Isn't it pretty much a given that the values loaded into the PMP address registers and PMP config registers (and all other security relevant CSRs: mtvec, satp, mideleg, etc.) must be correct? If having a "M-mode-only, non-executable region" doesn't match your security goals, then don't program one?

Nick Kossifidis wrote:
As shown above, restricting M-mode from executing memory regions without
a matching rule, only makes sense if it's not possible to add such a
rule (that allows execution). If it's possible to add a rule that
applies to M-mode then any restrictions regarding regions without a
matching rule, are a few instructions away from being bypassed.

The restriction still makes sense as a form of defense in depth. Plus, "a few instructions" at elevated privilege is a rather high bar. That is all it takes to escape from a Javascript sandbox, to escalate from user mode to kernel mode, or to break out of a VM. Yet, in all of those isolation mechanisms provide a very real security because even the bugs they do have still leave it rather hard to execute specific desired instructions.

Jonathan



Re: [tech-privileged] hypervisor extension: seL4 experience and feedback

Andy Glew Si5
 

Intel's RDTSC is used not just for performance measurements, but also as timestamps, not just for databases, but also for enough generic Linux code that Intel was forced to ensure that RDTSC was globally synchronized in multiprocessor, and I think also in multiple CPU chip, systems.

 

RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.

 

Intel's equivalent of RDINSTRET is used by fault-tolerant code.  Not really realtime lockstep fault tolerance, but the sort of fault tolerance that Tandem used to do. Banking level fault tolerance. Checkpoint restart.

 

Let me sketch such a system:

 

Multiprocessor UNIX, but no shared memory communication. Message passing only.

 

For every message passed that communicates between processes, record the system call, information provided/returned, and the instruction count. This is essentially a recovery log.

 

Periodically checkpoint.

 

After a failure, restore from checkpoint. Start executing. Arrange to stop when instruction count reaches the instruction count of the next item in the log.  (Do some checks.) Insert the data. Repeat until you have consumed the log.

 

Instruction count provides a well-characterized place where external interactions can be replayed into a process to get it to the state where it can pick up.

 

You can do this to replay shared memory interactions, but there are far too many possible interaction points to be practical in general. Constraining is possible, but is fragile, because you can always break the rules by accident.

 

There are other ways to accomplish the same thing. E.g. you could just count system calls ... but that doesn't work if you have message passing at user level.  You could instrument the message passing libraries. But it's really nice if it works for arbitrary binaries (at least as long as they are using shared memory).  Spin loops are always an issue.

 

But, anyway, although there are other ways to do the same thing, this sort of thing is the most elegant way I have seen to do it. Modulo the no shared memory restriction. And it is real, not hypothetical.

 

I.e. reading the time and reading the instruction count has been used for real functionality, not just for performance measurement.

 

 

 

 

-----Original Message-----
From: tech-privileged@... <tech-privileged@...> On Behalf Of John Hauser
Sent: Thursday, February 27, 2020 13:55
To: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] [tech-privileged] hypervisor extension: seL4 experience and feedback

 

Hi Gernot and Yanyan,

 

It's been a couple of months since you first sent (Dec. 4) your document reporting your experience adapting the seL4 microkernel to draft 0.4 of the RISC-V hypervisor extension, with some questions about the then-current 0.5 draft.  I earlier responded in detail to your feedback from sections 4 and 5 of your document.  I'd like to respond finally to a couple remaining issues raised in sections 6 and 7.

 

> Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated by

> the hypervisor extension? Are they going to return the cycles consumed

> and instructions retired by the current running VM only?

 

Without additional "delta" registers like RDTIME's htimedelta, the expectation currently is that bits CY and IR in hcounteren for the cycle and instret counters will normally be set to zero.  The hypervisor thus gets to emulate these counters for the virtual machine, adjusting the global cycle and instret counts as necessary.

 

It's perfectly reasonable to question whether emulating the cycle and instret counters will be too expensive in practice.  The official line for now is that emulation should be tolerable.  RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.

 

> The v0.5 draft states that the accesses to the VS CSRs in VS-mode

> cause illegal instructions, so nested virtualization could be built on

> trap-and-emulate. Similarly, accesses to HS-mode CSRs from the

> second-level hypervisor also need to be trapped and emulated. This

> approach naturally raises concerns about the overhead of trapping,

> decoding, and handling the CSR accesses. As Arm and x86 already added

> hardware support for nested virtualisation, are we anticipating

> similar hardware support in RISC-V?

 

Additional optional hardware for nested hypervisors is being considered.  More about this may come out later in 2020 or next year.

Right now, other components that are needed for a server-class RISC-V platform are probably a higher priority.

 

Regards,

 

    - John Hauser

 

 


Re: [tech-privileged] hypervisor extension: seL4 experience and feedback

Andy Glew Si5
 

Let me withdraw the part about RDTSC - I confused RISC-V RDCYCLE and RDTSC.

 

However, my point about people using instruction retired count in real life for real functionality remains.

 

From: Andy Glew <andy.glew@...>
Sent: Friday, February 28, 2020 14:57
To: 'John Hauser' <jh.riscv@...>; 'tech-privileged@...' <tech-privileged@...>
Subject: RE: [RISC-V] [tech-privileged] [tech-privileged] hypervisor extension: seL4 experience and feedback

 

Intel's RDTSC is used not just for performance measurements, but also as timestamps, not just for databases, but also for enough generic Linux code that Intel was forced to ensure that RDTSC was globally synchronized in multiprocessor, and I think also in multiple CPU chip, systems.

 

RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.

 

Intel's equivalent of RDINSTRET is used by fault-tolerant code.  Not really realtime lockstep fault tolerance, but the sort of fault tolerance that Tandem used to do. Banking level fault tolerance. Checkpoint restart.

 

Let me sketch such a system:

 

Multiprocessor UNIX, but no shared memory communication. Message passing only.

 

For every message passed that communicates between processes, record the system call, information provided/returned, and the instruction count. This is essentially a recovery log.

 

Periodically checkpoint.

 

After a failure, restore from checkpoint. Start executing. Arrange to stop when instruction count reaches the instruction count of the next item in the log.  (Do some checks.) Insert the data. Repeat until you have consumed the log.

 

Instruction count provides a well-characterized place where external interactions can be replayed into a process to get it to the state where it can pick up.

 

You can do this to replay shared memory interactions, but there are far too many possible interaction points to be practical in general. Constraining is possible, but is fragile, because you can always break the rules by accident.

 

There are other ways to accomplish the same thing. E.g. you could just count system calls ... but that doesn't work if you have message passing at user level.  You could instrument the message passing libraries. But it's really nice if it works for arbitrary binaries (at least as long as they are using shared memory).  Spin loops are always an issue.

 

But, anyway, although there are other ways to do the same thing, this sort of thing is the most elegant way I have seen to do it. Modulo the no shared memory restriction. And it is real, not hypothetical.

 

I.e. reading the time and reading the instruction count has been used for real functionality, not just for performance measurement.

 

 

 

 

-----Original Message-----
From: tech-privileged@... <tech-privileged@...> On Behalf Of John Hauser
Sent: Thursday, February 27, 2020 13:55
To: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] [tech-privileged] hypervisor extension: seL4 experience and feedback

 

Hi Gernot and Yanyan,

 

It's been a couple of months since you first sent (Dec. 4) your document reporting your experience adapting the seL4 microkernel to draft 0.4 of the RISC-V hypervisor extension, with some questions about the then-current 0.5 draft.  I earlier responded in detail to your feedback from sections 4 and 5 of your document.  I'd like to respond finally to a couple remaining issues raised in sections 6 and 7.

 

> Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated by

> the hypervisor extension? Are they going to return the cycles consumed

> and instructions retired by the current running VM only?

 

Without additional "delta" registers like RDTIME's htimedelta, the expectation currently is that bits CY and IR in hcounteren for the cycle and instret counters will normally be set to zero.  The hypervisor thus gets to emulate these counters for the virtual machine, adjusting the global cycle and instret counts as necessary.

 

It's perfectly reasonable to question whether emulating the cycle and instret counters will be too expensive in practice.  The official line for now is that emulation should be tolerable.  RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.

 

> The v0.5 draft states that the accesses to the VS CSRs in VS-mode

> cause illegal instructions, so nested virtualization could be built on

> trap-and-emulate. Similarly, accesses to HS-mode CSRs from the

> second-level hypervisor also need to be trapped and emulated. This

> approach naturally raises concerns about the overhead of trapping,

> decoding, and handling the CSR accesses. As Arm and x86 already added

> hardware support for nested virtualisation, are we anticipating

> similar hardware support in RISC-V?

 

Additional optional hardware for nested hypervisors is being considered.  More about this may come out later in 2020 or next year.

Right now, other components that are needed for a server-class RISC-V platform are probably a higher priority.

 

Regards,

 

    - John Hauser

 

 


Re: [tech-privileged] hypervisor extension: seL4 experience and feedback

Allen Baum
 

Yea, I remember Non-Stop folks explaining how they were going to considerably simplify their implementation by not relying on lockstep, but instead relying on counting retired instructions. 

But they had to very carefully define what a retired instruction was, and I’m pretty sure that doesn’t match RISC-V. E.g. if you get any kind of trap on an access that is later replayed- those only want to get counted once. You also want to be able to account for trap handlers separately, etc.

-Allen

On Feb 28, 2020, at 2:58 PM, Andy Glew Si5 <andy.glew@...> wrote:

Let me withdraw the part about RDTSC - I confused RISC-V RDCYCLE and RDTSC.

 

However, my point about people using instruction retired count in real life for real functionality remains.

 

From: Andy Glew <andy.glew@...>
Sent: Friday, February 28, 2020 14:57
To: 'John Hauser' <jh.riscv@...>; 'tech-privileged@...' <tech-privileged@...>
Subject: RE: [RISC-V] [tech-privileged] [tech-privileged] hypervisor extension: seL4 experience and feedback

 

Intel's RDTSC is used not just for performance measurements, but also as timestamps, not just for databases, but also for enough generic Linux code that Intel was forced to ensure that RDTSC was globally synchronized in multiprocessor, and I think also in multiple CPU chip, systems.

 

RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.

 

Intel's equivalent of RDINSTRET is used by fault-tolerant code.  Not really realtime lockstep fault tolerance, but the sort of fault tolerance that Tandem used to do. Banking level fault tolerance. Checkpoint restart.

 

Let me sketch such a system:

 

Multiprocessor UNIX, but no shared memory communication. Message passing only.

 

For every message passed that communicates between processes, record the system call, information provided/returned, and the instruction count. This is essentially a recovery log.

 

Periodically checkpoint.

 

After a failure, restore from checkpoint. Start executing. Arrange to stop when instruction count reaches the instruction count of the next item in the log.  (Do some checks.) Insert the data. Repeat until you have consumed the log.

 

Instruction count provides a well-characterized place where external interactions can be replayed into a process to get it to the state where it can pick up.

 

You can do this to replay shared memory interactions, but there are far too many possible interaction points to be practical in general. Constraining is possible, but is fragile, because you can always break the rules by accident.

 

There are other ways to accomplish the same thing. E.g. you could just count system calls ... but that doesn't work if you have message passing at user level.  You could instrument the message passing libraries. But it's really nice if it works for arbitrary binaries (at least as long as they are using shared memory).  Spin loops are always an issue.

 

But, anyway, although there are other ways to do the same thing, this sort of thing is the most elegant way I have seen to do it. Modulo the no shared memory restriction. And it is real, not hypothetical.

 

I.e. reading the time and reading the instruction count has been used for real functionality, not just for performance measurement.

 

 

 

 

-----Original Message-----
From: tech-privileged@... <tech-privileged@...> On Behalf Of John Hauser
Sent: Thursday, February 27, 2020 13:55
To: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] [tech-privileged] hypervisor extension: seL4 experience and feedback

 

Hi Gernot and Yanyan,

 

It's been a couple of months since you first sent (Dec. 4) your document reporting your experience adapting the seL4 microkernel to draft 0.4 of the RISC-V hypervisor extension, with some questions about the then-current 0.5 draft.  I earlier responded in detail to your feedback from sections 4 and 5 of your document.  I'd like to respond finally to a couple remaining issues raised in sections 6 and 7.

 

> Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated by

> the hypervisor extension? Are they going to return the cycles consumed

> and instructions retired by the current running VM only?

 

Without additional "delta" registers like RDTIME's htimedelta, the expectation currently is that bits CY and IR in hcounteren for the cycle and instret counters will normally be set to zero.  The hypervisor thus gets to emulate these counters for the virtual machine, adjusting the global cycle and instret counts as necessary.

 

It's perfectly reasonable to question whether emulating the cycle and instret counters will be too expensive in practice.  The official line for now is that emulation should be tolerable.  RDCYCLE and RDINSTRET are expected to be used only for performance measurements, and should not be executed too frequently.

 

> The v0.5 draft states that the accesses to the VS CSRs in VS-mode

> cause illegal instructions, so nested virtualization could be built on

> trap-and-emulate. Similarly, accesses to HS-mode CSRs from the

> second-level hypervisor also need to be trapped and emulated. This

> approach naturally raises concerns about the overhead of trapping,

> decoding, and handling the CSR accesses. As Arm and x86 already added

> hardware support for nested virtualisation, are we anticipating

> similar hardware support in RISC-V?

 

Additional optional hardware for nested hypervisors is being considered.  More about this may come out later in 2020 or next year.

Right now, other components that are needed for a server-class RISC-V platform are probably a higher priority.

 

Regards,

 

    - John Hauser

 

 


Re: [RISC-V] [tech-tee] Huawei review of different PMP enhancement schemes

Nick Kossifidis
 

Στις 2020-02-28 21:14, John Hauser έγραψε:
Nick Kossifidis wrote:
4level0.3:
This is dangerous ! With this revision it's possible to have a region
that's rw by S/U mode and executable by M mode when PL=0, [...]
I agree that would be dangerous, but I intentionally excluded that
possibility, so I don't understand. What is the exact encoding that
you think allows this, when MSL > 0?
You've excluded the possibility of having a region that's writeable by S/U and executable by M mode at the same time. However it's possible for S/U to put code on a R/W region and then request from M-mode to mark that region as R-X with PL=0, in which case this will also make the region executable by M-mode. The restriction of not being able to add new regions that are executable by M-mode only applies to MSL=3, not on MSL < 3. Depending on the boot flow it's also possible for such a region to be added there when MSL < 3 and also affect the system even when MSL=3, in which case it won't be possible to remove it. Also on MSL=3 it's not clear to me if this restriction applies to rules with PL=0, you say that when a rule with PL=0 is X or R-X it becomes locked, does this apply to existing rules or is it possible to add new ones ? Because it's otherwise possible to add new rules with PL=0 from what I understand. If that's the case isn't it a bit over-complicated to have PL=0 encode both locked and non-locked rules at the same time on MSL=3, and /or allowing some rules with PL=0 to be registered but not others ?

In contrast the group's proposal will make any region marked with L=0 inaccessible to M mode when MML is set so even if such a rule that provides X privileges is there before MML is set, it won't affect M mode after MML is set. Also it's not possible to define a shared region before MML gets set since the combination RW=01 is reserved in the current spec (which is the same as MML=0) and PMP registers are WARL. When MML gets set it's only possible to add non-executable shared regions, the only way to add a shared region that can be executable by both M and S/U mode, using the remaining encodings (still without using any extra bits on pmpcfg) as Tariq suggested, will be through setting DPL bit first, which is going to be an optional feature as we discussed on the group and will come with a proper warning / disclaimer.

Finally
when MSL=3 and PL=3 we get removable M-mode-only, non-executable
regions, at the highest security level. In terms of security it's a
regression over revision 0.2, not an improvement.
That detail could easily be changed, if that's the only remaining
complaint about the security.
It's not just the security issues mentioned, the redundant encodings and the fact it's using an extra bit on pmpcfg even though we can get what we want without doing so, leaving more options for future use, are also there. The overall complexity of the 4level proposal was also brought up on our last TEE TG call by others as well, and since we reached a consensus on integrating Tariq's proposal to the group's proposal I'm going for that approach instead.

Regards,
Nick


Re: Huawei review of different PMP enhancement schemes

John Hauser
 

Nick Kossifidis wrote:
You've excluded the possibility of having a region that's writeable by
S/U and executable by M mode at the same time. However it's possible for
S/U to put code on a R/W region and then request from M-mode to mark
that region as R-X with PL=0, in which case this will also make the
region executable by M-mode. The restriction of not being able to add
new regions that are executable by M-mode only applies to MSL=3, not on
MSL < 3. [...]
All of the other proposals, including the one you favor, have this
exact same property when MML = 0. As I wrote in my document, and
have tried repeatedly to make clear, the settings of MSL below 3 are
intended to be used only during circumstances when MML is 0 under the
other proposals.

Instead of setting MML = 1, my proposal requires that a programmer set
MSL = 3. I cannot believe that such a substitution is beyond the grasp
of programmers who are otherwise entrusted to write security-conscious
code.

- John Hauser


Re: [tech-privileged] hypervisor extension: seL4 experience and feedback

Shen, Yanyan (Data61, Kensington NSW) <yanyan.shen@...>
 

Hi John,

See my responses inline below.


Regards,
Yanyan


On Thu, 2020-02-27 at 13:54 -0800, John Hauser wrote:
Hi Gernot and Yanyan,

It's been a couple of months since you first sent (Dec. 4) your
document reporting your experience adapting the seL4 microkernel to
draft 0.4 of the RISC-V hypervisor extension, with some questions
about
the then-current 0.5 draft. I earlier responded in detail to your
feedback from sections 4 and 5 of your document. I'd like to respond
finally to a couple remaining issues raised in sections 6 and 7.
Thanks very much for your installments, which clarify things and help
us to understand the extension.


Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated
by the hypervisor extension? Are they going to return the cycles
consumed and instructions retired by the current running VM only?
Without additional "delta" registers like RDTIME's htimedelta,
the expectation currently is that bits CY and IR in hcounteren for
the cycle and instret counters will normally be set to zero. The
hypervisor thus gets to emulate these counters for the virtual
machine,
adjusting the global cycle and instret counts as necessary.
So, it is expected that the instructions return the cycles consumed and
instructions retired by the calling VM. However, it is up to the
hypervisor to decide the accuracy of the values returned.

It's perfectly reasonable to question whether emulating the cycle and
instret counters will be too expensive in practice. The official
line
for now is that emulation should be tolerable. RDCYCLE and RDINSTRET
are expected to be used only for performance measurements, and should
not be executed too frequently.
I agree that the trap-and-emulate will work, and the performance may be
acceptable if the registers are accessed infrequently.

As Andy already pointed out, the RDINSTRET could be quite useful for
other purposes as well (e.g., record-and-reply or redundant execution).
Would it be possible to add a filter or mask so that user-mode or
kernel-mode retired instructions could be counted separately?

A related question is the accuracy of RDINSTRET. Are over-counting or
under-counting allowed for certain conditions? What is the degree of
freedom an implementation could have to interpret the meaning and
accuracy of the RDINSTRET instruction?


The v0.5 draft states that the accesses to the VS CSRs in VS-mode
cause illegal instructions, so nested virtualization could be built
on trap-and-emulate. Similarly, accesses to HS-mode CSRs from the
second-level hypervisor also need to be trapped and emulated. This
approach naturally raises concerns about the overhead of trapping,
decoding, and handling the CSR accesses. As Arm and x86 already
added hardware support for nested virtualisation, are we
anticipating
similar hardware support in RISC-V?
Additional optional hardware for nested hypervisors is being
considered. More about this may come out later in 2020 or next year.
Right now, other components that are needed for a server-class RISC-V
platform are probably a higher priority.
Good to know that nested virtualisation is being considered. I
understand there are higher priority tasks.


Regards,

- John Hauser



Re: [tech-privileged] hypervisor extension: seL4 experience and feedback

Andy Glew Si5
 

As Andy already pointed out, the RDINSTRET could be quite useful for other purposes as well (e.g., record-and-reply or redundant execution). Would it be possible to add a filter or mask so that user-mode or kernel-mode retired instructions could be counted separately?

 

I like the filter/mask idea, as I will explain below, but I think it belongs more to generic performance event counters, not RDINSTRET or RDCYCLE.  I think those instructions should do one thing and one thing well. If they can be configured, then it will be harder to use them locally, e.g. in a library, without knowledge of the global setting.

 

As for filtering of generic performance counters:

 

x86 EMON has generic filtering:

  • any event count can be qualified according to user/kernel. and obviously both.  I would hope that hypervisor/guest qualification was added with VT.

Therefore you count cache misses, instructions retired, instruction speculatively decoded, etc. etc. user/kernel/hypervisor/any.

 

Further filtering:

  • thresholding: in any given cycle the counter can be incremented only if >N events of the specify type have occurred
    • e.g. only if superscalar width of two or above.
  • Edge detect: e.g. determine average duration of idle

 

Note: “instructions retired” vs “speculative instructions” is not generic, since there are many possible places where one can count speculative instructions. Similarly speculative cache misses.

 

These are all great things, great for performance analysis.  But there are never enough performance counters to count everything in one pass. So they need to be managed globally. Which, as Jack Dennis (static dataflow guy) says “violates software engineering modularity”.

 

Providing fixed well-characterized definitions of RDCYCLE and RDINSTRET allows at least these events to be used locally, e.g. for usage aware algorithms, within functions and classes. Without having to mess with a global management infrastructure.

 

 

 

 

 

 

 

-----Original Message-----
From: tech-privileged@... <tech-privileged@...> On Behalf Of Shen, Yanyan (Data61, Kensington NSW)
Sent: Sunday, March 8, 2020 17:41
To: jh.riscv@...; tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] [tech-privileged] hypervisor extension: seL4 experience and feedback

 

Hi John,

 

See my responses inline below.

 

 

Regards,

Yanyan

 

 

On Thu, 2020-02-27 at 13:54 -0800, John Hauser wrote:

> Hi Gernot and Yanyan,

>

> It's been a couple of months since you first sent (Dec. 4) your

> document reporting your experience adapting the seL4 microkernel to

> draft 0.4 of the RISC-V hypervisor extension, with some questions

> about the then-current 0.5 draft.  I earlier responded in detail to

> your feedback from sections 4 and 5 of your document.  I'd like to

> respond finally to a couple remaining issues raised in sections 6 and

> 7.

>

 

Thanks very much for your installments, which clarify things and help us to understand the extension.

 

 

> > Q6: How are the two instructions, RDCYCLE and RDINSTRET, treated by

> > the hypervisor extension? Are they going to return the cycles

> > consumed and instructions retired by the current running VM only?

>

> Without additional "delta" registers like RDTIME's htimedelta, the

> expectation currently is that bits CY and IR in hcounteren for the

> cycle and instret counters will normally be set to zero.  The

> hypervisor thus gets to emulate these counters for the virtual

> machine, adjusting the global cycle and instret counts as necessary.

>

 

So, it is expected that the instructions return the cycles consumed and instructions retired by the calling VM. However, it is up to the hypervisor to decide the accuracy of the values returned.

 

> It's perfectly reasonable to question whether emulating the cycle and

> instret counters will be too expensive in practice.  The official line

> for now is that emulation should be tolerable.  RDCYCLE and RDINSTRET

> are expected to be used only for performance measurements, and should

> not be executed too frequently.

 

I agree that the trap-and-emulate will work, and the performance may be acceptable if the registers are accessed infrequently.

 

As Andy already pointed out, the RDINSTRET could be quite useful for other purposes as well (e.g., record-and-reply or redundant execution).

Would it be possible to add a filter or mask so that user-mode or kernel-mode retired instructions could be counted separately?

 

A related question is the accuracy of RDINSTRET. Are over-counting or under-counting allowed for certain conditions? What is the degree of freedom an implementation could have to interpret the meaning and accuracy of the RDINSTRET instruction?

 

>

> > The v0.5 draft states that the accesses to the VS CSRs in VS-mode

> > cause illegal instructions, so nested virtualization could be built

> > on trap-and-emulate. Similarly, accesses to HS-mode CSRs from the

> > second-level hypervisor also need to be trapped and emulated. This

> > approach naturally raises concerns about the overhead of trapping,

> > decoding, and handling the CSR accesses. As Arm and x86 already

> > added hardware support for nested virtualisation, are we

> > anticipating similar hardware support in RISC-V?

>

> Additional optional hardware for nested hypervisors is being

> considered.  More about this may come out later in 2020 or next year.

> Right now, other components that are needed for a server-class RISC-V

> platform are probably a higher priority.

 

Good to know that nested virtualisation is being considered. I understand there are higher priority tasks.

 

>

> Regards,

>

>     - John Hauser

>

>

>

 

 


Proposal for accelerating nested virtualization on RISC-V

Anup Patel
 

A clarification is required in RISC-V H-Extension spec regarding scope
of HSTATUS.VTVM bit. Currently as-per the spec, all virtual memory
management instructions (both SFENCEs and HFENCEs) will trap to
HS-mode when HSTATUS.VTVM == 1 and V == 1. Rather, only SFENCEs are
required to be trapped to HS-mode when HSTATUS.VTVM == 1 and V == 1
because HFENCEs are only defined for HS-mode (i.e. V==0).

To better describe nested virtualization, we define following dummy
privilege modes:
Host HS-mode
Host hypervisor kernel will run in this mode
Software in this mode will actually run in HW HS-mode
Host U-mode
Host hypervisor user-space will run in this mode
Software in this mode will actually run in HW U-mode
Guest HS-mode
Guest hypervisor kernel will run in this mode
Software in this mode will actually run in HW VS-mode
Guest U-mode => HW VU-mode
Guest hypervisor user-space will run in this mode
Software in this mode will actually run in HW VU-mode
Guest VS-mode => HW VS-mode
Software in this mode will actually run in HW VS-mode
Guest VU-mode => HW VU-mode
Software in this mode will actually run in HW VU-mode

A high-level software approach for nested virtualization in RISC-V
can be as follows:
1. The Host HS-mode (Host hypervisor) will enable HSTATUS.VTSR to
emulate SRET instruction for Guest. This emulation will involve
a CSR world-switch when switching from Guest HS/U-mode to/from
Guest VS/VU-mode.
2. Virtual interrupts will be injected to Guest VS/VU-mode after
doing CSR world-switch (in point1 above) from Guest HS/U-mode
to Guest VS/VU-mode.
3. All accesses to "h<xyz>" and "vs<xyz>" from Guest will trap to
Host HS-mode (Host hypervisor) where:
a) These CSRs will emulated for Guest HS-mode
b) For Guest U-mode and Guest VS/VU-mode, the trap will
be forwarded to Guest HS-mode
4. The Host HS-mode (Host hypervisor) will manage two Stage2 page
tables:
a) Regular Stage2 page table for Guest HS/U-mode
b) Shadow Stage2 page table for Guest VS/VU-mode. Of course,
Host HS-mode (host hypervisor) will have to do software walk
of Guest HS-mode HGATP page table when populating mappings in
Shadow Stage2 page table and it will have mappings which are
combined effect of Guest HS-mode HGATP page table and Regular
Stage2 page table.
5. All HFENCEs will trap to Host HS-mode where the Host HS-mode
(Host hypervisor) will:
a) Trap-n-emulate HFENCE.VVMA and HFENCE.GVMA for Guest HS-mode
b) Redirect HFENCE.VVMA and HFENCE.GVMA traps from Guest VS-mode
to Guest HS-mode irrespective to Guest HS-mode HSTATUS.VTVM
6. All HLV/HSV instructions from Guest HS/U-mode and Guest VS/VU-mode
will trap to Host HS-mode (Host hypervisor) where:
a) HLV/HSV instruction from Guest HS/U-mode will be emulated
by Host HS-mode (Host hypervisor)
b) HLV/HSV instruction from Guest VS/VU-mode will be forwarded
to Guest HS-mode by Host HS-mode (Host hypervisor)

Please suggest if any case is not considered in above high-level
software approach for nested virtualization.

Based on above high-level software approach, we propose a way to
accelerate nested virtualization performance by reducing "h<xyz>" and
"vs<xyz>" CSR access traps from VS-mode to HS-mode (point3 above).

As-per our proposal, we convert "h<xyz>" and "vs<xyz>" CSR accesses
From VS-mode as memory accesses relative to a nested context base
(or <nested_context_base>).

The enable bit (or <nested_enable>) for above described CSR accesses
conversion and the <nested_context_base> can be specified via new
HNESTED CSR.

<nested_enable> = HNESTED[0]
<nested_context_base> = HNESTED[XLEN:1] << (log2 (XLEN / 8))

Note: <nested_context_base> address is always machine word aligned
Note: <nested_enable> = 0 means "h<xyz>" and "vs<xyz>" trap to HS-mode
without any CSR accesses conversion

Various "h<xyz>" and "vs<xyz>" CSRs are accessed at <csr_nested_offset>
relative to <nested_context_base> based on their CSR number as follows:

CSR number 0x2xx
<csr_nested_offset> = 0x0000 + ((CSR_number & 0xff) * (XLEN / 8))
CSR number 0x6xx
<csr_nested_offset> = 0x1000 + ((CSR_number & 0xff) * (XLEN / 8))
CSR number 0xAxx
<csr_nested_offset> = 0x2000 + ((CSR_number & 0xff) * (XLEN / 8))
CSR number 0xExx
<csr_nested_offset> = 0x3000 + ((CSR_number & 0xff) * (XLEN / 8))

The VS-mode accesses to some of the "h<xyz>" CSRs cannot be converted
into memory accesses due to nature of these CSRs. These CSRs include
HGEIP and HGEIE CSRs (any other CSRs ??).

Accesses to the HNESTED CSR (described above) from VS-mode is also
converted to memory access when <nested_enable> = 1 because the
HNESTED CSR can be safely emulated using nested acceleration.

Best Regards,
Anup Patel


Re: Proposal for accelerating nested virtualization on RISC-V

Jonathan Behrens <behrensj@...>
 

Your description of un-accelerated nested virtualization seems workable to me. I'm less sure of the proposal to avoid trapping on h<xyz> and vs<xyz> accesses. Aren't you going to run into issues with any WARL CSR that has hardwired bits?

I'd like to point out another performance pitfall with trap-and-emulate that I've mentioned before but might not be obvious from reading your proposal: the illegal instruction traps triggered by the guest trying to use hypervisor CSRs or run hypervisor instructions will not trap directly to HS-mode. Rather they will be routed to M-mode and then get forwarded to HS-mode, which has about two times higher overhead (forwarding a trap is at least as expensive as emulating most instructions). It is also quite avoidable by adding a bit to let M-mode delegate traps from legal but privileged instructions executed in U/VS/VU modes.

Jonathan

On Tue, Mar 17, 2020 at 6:40 AM Anup Patel via Lists.Riscv.Org <anup.patel=wdc.com@...> wrote:
A clarification is required in RISC-V H-Extension spec regarding scope
of HSTATUS.VTVM bit. Currently as-per the spec, all virtual memory
management instructions (both SFENCEs and HFENCEs) will trap to
HS-mode when HSTATUS.VTVM == 1 and V == 1. Rather, only SFENCEs are
required to be trapped to HS-mode when HSTATUS.VTVM == 1 and V == 1
because HFENCEs are only defined for HS-mode (i.e. V==0).

To better describe nested virtualization, we define following dummy
privilege modes:
Host HS-mode
  Host hypervisor kernel will run in this mode
  Software in this mode will actually run in HW HS-mode
Host U-mode
  Host hypervisor user-space will run in this mode
  Software in this mode will actually run in HW U-mode
Guest HS-mode
  Guest hypervisor kernel will run in this mode
  Software in this mode will actually run in HW VS-mode
Guest U-mode => HW VU-mode
  Guest hypervisor user-space will run in this mode
  Software in this mode will actually run in HW VU-mode
Guest VS-mode => HW VS-mode
  Software in this mode will actually run in HW VS-mode
Guest VU-mode => HW VU-mode
  Software in this mode will actually run in HW VU-mode

A high-level software approach for nested virtualization in RISC-V
can be as follows:
1. The Host HS-mode (Host hypervisor) will enable HSTATUS.VTSR to
   emulate SRET instruction for Guest. This emulation will involve
   a CSR world-switch when switching from Guest HS/U-mode to/from
   Guest VS/VU-mode.
2. Virtual interrupts will be injected to Guest VS/VU-mode after
   doing CSR world-switch (in point1 above) from Guest HS/U-mode
   to Guest VS/VU-mode.
3. All accesses to "h<xyz>" and "vs<xyz>" from Guest will trap to
   Host HS-mode (Host hypervisor) where:
   a) These CSRs will emulated for Guest HS-mode
   b) For Guest U-mode and Guest VS/VU-mode, the trap will
      be forwarded to Guest HS-mode
4. The Host HS-mode (Host hypervisor) will manage two Stage2 page
   tables:
   a) Regular Stage2 page table for Guest HS/U-mode
   b) Shadow Stage2 page table for Guest VS/VU-mode. Of course,
      Host HS-mode (host hypervisor) will have to do software walk
      of Guest HS-mode HGATP page table when populating mappings in
      Shadow Stage2 page table and it will have mappings which are
      combined effect of Guest HS-mode HGATP page table and Regular
      Stage2 page table.
5. All HFENCEs will trap to Host HS-mode where the Host HS-mode
   (Host hypervisor) will:
   a) Trap-n-emulate HFENCE.VVMA and HFENCE.GVMA for Guest HS-mode
   b) Redirect HFENCE.VVMA and HFENCE.GVMA traps from Guest VS-mode
      to Guest HS-mode irrespective to Guest HS-mode HSTATUS.VTVM
6. All HLV/HSV instructions from Guest HS/U-mode and Guest VS/VU-mode
   will trap to Host HS-mode (Host hypervisor) where:
   a) HLV/HSV instruction from Guest HS/U-mode will be emulated
      by Host HS-mode (Host hypervisor)
   b) HLV/HSV instruction from Guest VS/VU-mode will be forwarded
      to Guest HS-mode by Host HS-mode (Host hypervisor)

Please suggest if any case is not considered in above high-level
software approach for nested virtualization.

Based on above high-level software approach, we propose a way to
accelerate nested virtualization performance by reducing "h<xyz>" and
"vs<xyz>" CSR access traps from VS-mode to HS-mode (point3 above).

As-per our proposal, we convert "h<xyz>" and "vs<xyz>" CSR accesses
From VS-mode as memory accesses relative to a nested context base
(or <nested_context_base>).

The enable bit (or <nested_enable>) for above described CSR accesses
conversion and the <nested_context_base> can be specified via new
HNESTED CSR.

<nested_enable> = HNESTED[0]
<nested_context_base> = HNESTED[XLEN:1] << (log2 (XLEN / 8))

Note: <nested_context_base> address is always machine word aligned
Note: <nested_enable> = 0 means "h<xyz>" and "vs<xyz>" trap to HS-mode
without any CSR accesses conversion

Various "h<xyz>" and "vs<xyz>" CSRs are accessed at <csr_nested_offset>
relative to <nested_context_base> based on their CSR number as follows:

CSR number 0x2xx
<csr_nested_offset> = 0x0000 + ((CSR_number & 0xff) * (XLEN / 8))
CSR number 0x6xx
<csr_nested_offset> = 0x1000 + ((CSR_number & 0xff) * (XLEN / 8))
CSR number 0xAxx
<csr_nested_offset> = 0x2000 + ((CSR_number & 0xff) * (XLEN / 8))
CSR number 0xExx
<csr_nested_offset> = 0x3000 + ((CSR_number & 0xff) * (XLEN / 8))

The VS-mode accesses to some of the "h<xyz>" CSRs cannot be converted
into memory accesses due to nature of these CSRs. These CSRs include
HGEIP and HGEIE CSRs (any other CSRs ??).

Accesses to the HNESTED CSR (described above) from VS-mode is also
converted to memory access when <nested_enable> = 1 because the
HNESTED CSR can be safely emulated using nested acceleration.

Best Regards,
Anup Patel

61 - 80 of 1210