Date   

Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Anup,

Thanks.  Comments below.

Greg

On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR


This is up in the air for inclusion or not in the proposal.  As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak.  Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared.  But with this Active bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).

The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit.  For example, in response to a cross-trigger from the debug Trigger Module, e.g. to start counting when a certain instruction executed and to stop counting when another address is executed.  Or to start/stop counting in response to another counter overflowing after N occurrences of some event.  In essence, for counting more complex types of event conditions, particularly in debug scenarios and less so in straight perf mon scenarios.

Currently cross-trigger capabilities like these aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires.  And note that hardware manipulation of mcountinhibit bits would be a change to the architectural definition of mcountinhibit.  This isn't a forcing issue, but having this Active bit in mhpmevent sidesteps that issue.

But even with all this, it is still up in the air whether people want or don't want to standardize this separate counter control bit as part of a counter extension.  We'll see where people fall on this.
 

2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.


Good point.  To support this feature (which some others have also been requesting) will require defining an alias CSR for each hpmcounter CSR that is "User-Read-Write".

Having two User aliases of the same CSR is conceptually not pretty, but this is simple and seems like a necessary evil for supporting this feature.

Like above, we'll have to see if the interest in this feature is significant enough to warrant adding read/write hpmcounter aliases.
 

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.


In my mind event_info simply fills in all the higher bits of mhpmevent that are not written by event_idx - which I believe was to be the default code path in the SBI PMU code.  (This, of course, applies for future implementations that choose to organize their mhpmevent registers in this simple manner.  Implementations are free to organize their mhpmevent CSR differently and supply corresponding implementation-specific SBI code.)  In other words (for RV64):

mhpmevent[19:16] = event_idx.type
mhpmevent[15:  0] = event_idx.code
mhpmevent[63:20] = event_info[43:0]

Greg
 

 

Regards,

Anup



Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Email accidentally sent early.  Let me finish the email and then I'll send it again.

Greg


On Mon, Aug 3, 2020 at 9:41 PM Greg Favor via lists.riscv.org <gfavor=ventanamicro.com@...> wrote:
Anup,

Thanks.  Comments below.

Greg

On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR


This is up in the air for inclusion or not in the proposal.  As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak.  Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared.  But with this bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).

The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit.  For example, in response to a cross-trigger from the debug Trigger Module (e.g. to start counting when a certain instruction executed and to stop counting when another address is executed).  Or to start/stop counting in response to another counter overflowing after N occurrences of some event.  Currently cross-trigger capabilities like this aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires. 


2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Greg Favor
Sent: 30 July 2020 12:57
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Alan,

 

I'm fine with taking the lead on this architecture extension.  But it should follow a proper process as directed by the TSC.  Thus far this would mean getting a new TG created or doing something less formally under an existing TG.  But for smaller extension proposals like this there is need for a proper lighter weight and faster process.  Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon.

 

So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available.  In the meantime I/we can prepare what we can in advance.  (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.)

 

Greg

 

On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote:

Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Anup,

Thanks.  Comments below.

Greg

On Mon, Aug 3, 2020 at 9:25 PM Anup Patel <Anup.Patel@...> wrote:

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR


This is up in the air for inclusion or not in the proposal.  As solely a bit that software can set/clear to start/stop a counter, the argument for having this bit is weak.  Although SBI calls for writing to the mhpmevent CSR for a counter would need some way to recognize when the associated bit in mcountinhibit needs to be set or cleared.  But with this bit in mhpmevent itself, no special support is needed (i.e. the writing of event_info into the upper part of mhpmevent takes care of whatever all bits are there).

The argument for this bit in mhpmevent grows when one allows for hardware setting and clearing of the bit.  For example, in response to a cross-trigger from the debug Trigger Module (e.g. to start counting when a certain instruction executed and to stop counting when another address is executed).  Or to start/stop counting in response to another counter overflowing after N occurrences of some event.  Currently cross-trigger capabilities like this aren't standardized but, irrespective of whether they get standardized or not, having a standard Active bit provides the framework for a design to have whatever mechanisms it desires. 


2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Greg Favor
Sent: 30 July 2020 12:57
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Alan,

 

I'm fine with taking the lead on this architecture extension.  But it should follow a proper process as directed by the TSC.  Thus far this would mean getting a new TG created or doing something less formally under an existing TG.  But for smaller extension proposals like this there is need for a proper lighter weight and faster process.  Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon.

 

So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available.  In the meantime I/we can prepare what we can in advance.  (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.)

 

Greg

 

On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote:

Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Anup Patel
 

Hi Greg,

 

Few comments on your proposal (https://lists.riscv.org/g/tech-privileged/message/205):

 

1. The BIT[31] is not required because we already have MCOUNTINHIBIT CSR

2. The BIT[28] contradicts CSR number semantics of HPMCOUNTER CSR because currently all HPMCOUNTER CSRs are “User-Read-Only”.

3. We need to align “event_info” definition in SBI PMU Extension to consider your prosed bits in MHPMEVENT CSRs.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Greg Favor
Sent: 30 July 2020 12:57
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Alan,

 

I'm fine with taking the lead on this architecture extension.  But it should follow a proper process as directed by the TSC.  Thus far this would mean getting a new TG created or doing something less formally under an existing TG.  But for smaller extension proposals like this there is need for a proper lighter weight and faster process.  Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon.

 

So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available.  In the meantime I/we can prepare what we can in advance.  (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.)

 

Greg

 

On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote:

Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


CSR address for debug scontext and hcontext

Ernie Edgar
 

Hello,

Background:

You may be aware that the RISC-V Debug Specification 0.13 defines two CSRs, mcontext and scontext, that can be used to qualify hardware breakpoints in a particular OS process or thread.  A modified S-mode OS kernel writes the process ID to scontext when switching processes.  Breakpoint hardware can be set to trigger only when the process ID in scontext matches the desired process.

Using ASID instead of scontext to qualify breakpoints has been suggested. However, many systems do not implement ASID or only implement a narrow field, forcing the OS to recycle ASID values.  This makes ASID useless for breakpoint qualification.

For those familiar with ARM, the equivalent registers in that architecture are CONTEXTIDR_EL1 and CONTEXTIDR_EL2.

Problem:

Scontext is defined in the ratified Debug Spec at CSR 0x7aa which is in the "Machine Standard read/write debug CSR" region and so is, by convention, inaccessible from S-mode.

The Debug Spec was ratified before work on the hypervisor had gotten very far, so Debug Spec 0.13 does not provide full support for hypervisor-based systems.  Among the missing items is a definition for an "hcontext" register to qualify breakpoints in a particular virtual machine.  An argument could be made to use VMID for this, but the discussion above about ASID qualification would also apply to VMID.

Proposed Solution:

The Debug Task Group would like to suggest allocating a range of CSR addresses in one of the Supervisor Standard read/write regions and in one of the Hypervisor Standard read/write regions to use for debug registers.  Our suggestion is 0x5A0-0x5AF for S-mode and 0x6A0-0x6AF for HS-mode, complementing the 0x7A0-0x7AF already defined for M-mode debug registers.  Allocating more than just one address gives the Debug TG flexibility for the future.

Thanks,
Ernie Edgar
RISC-V Debug Task Group



Re: Proposal for Custom Values in satp

Bill Huffman
 


On 8/3/20 7:01 AM, Jonathan Behrens wrote:
EXTERNAL MAIL

>For RV32, values with satp[31] clear and satp[30:0] non-zero are reserved.  I propose that values with satp[31] clear and satp[30:29]=0x3 be defined as Custom.

Are 7 bits enough , for ASID?


It depends on how you are using them. For x86-64, Linux actually only uses 4 ASID bits (out of the 12 available) because it assigns them per-core and recycles them aggressively. However, if you instead try to have globally unique ASIDs then you might need far more than 7 bits.

Jonathan

Agreed, but the proposal doesn't assume that a custom implementation will use the bits of satp in the same way the priv spec uses them.  The bits may well be used in a different fashion.

Of course, as with instruction opcodes, an implementation is free also to used reserved encodings if the implementers are willing to have a possible conflict with future standard extensions.

      Bill



Re: Proposal for Custom Values in satp

Jonathan Behrens <behrensj@...>
 

>For RV32, values with satp[31] clear and satp[30:0] non-zero are reserved.  I propose that values with satp[31] clear and satp[30:29]=0x3 be defined as Custom.

Are 7 bits enough , for ASID?


It depends on how you are using them. For x86-64, Linux actually only uses 4 ASID bits (out of the 12 available) because it assigns them per-core and recycles them aggressively. However, if you instead try to have globally unique ASIDs then you might need far more than 7 bits.

Jonathan


Re: Proposal for Custom Values in satp

Andrea Mondelli
 

>For RV32, values with satp[31] clear and satp[30:0] non-zero are reserved.  I propose that values with satp[31] clear and satp[30:29]=0x3 be defined as Custom.

Are 7 bits enough , for ASID?

 


Re: Proposal for Custom Values in satp

Bill Huffman
 

Looks appropriate to me.     Bill

On 7/31/20 4:22 PM, Andrew Waterman wrote:
EXTERNAL MAIL

I've written the idea up so it doesn't get lost, but others should still feel free to comment.  In the meantime, can you sanity-check my patch?  https://github.com/riscv/riscv-isa-manual/commit/f7710a02da497a721095a0252041122a6d0e0a6c

On Thu, Jul 30, 2020 at 10:26 AM Allen Baum <allen.baum@...> wrote:
That sounds like a no brainer good idea.

-Allen

On Jul 29, 2020, at 2:41 PM, Andrew Waterman <andrew@...> wrote:

I support this proposal.

On Wed, Jul 29, 2020 at 12:42 PM Bill Huffman <huffman@...> wrote:

The satp register has reserved values.  Some implementers will, no doubt, want to define non-standard behavior based on satp.  I would like to propose that we define some of the reserved values in satp as Custom now so that those who do so won't head in diverging directions.

For RV64, there are 11 reserved values of the Mode field.  I propose that the encodings 14 and 15 be defined as Custom.

For RV32, values with satp[31] clear and satp[30:0] non-zero are reserved.  I propose that values with satp[31] clear and satp[30:29]=0x3 be defined as Custom.

The RV32 encoding is not perfect.  To have the largest contiguous space free, we need to pick bits at the top or bottom.  Top bits encroach on what's now ASID space while bottom bits encroach on what's now PPN space.  The former seemed a bit better.

       Bill



Re: Proposal for Custom Values in satp

Andrew Waterman
 

I've written the idea up so it doesn't get lost, but others should still feel free to comment.  In the meantime, can you sanity-check my patch?  https://github.com/riscv/riscv-isa-manual/commit/f7710a02da497a721095a0252041122a6d0e0a6c


On Thu, Jul 30, 2020 at 10:26 AM Allen Baum <allen.baum@...> wrote:
That sounds like a no brainer good idea.

-Allen

On Jul 29, 2020, at 2:41 PM, Andrew Waterman <andrew@...> wrote:

I support this proposal.

On Wed, Jul 29, 2020 at 12:42 PM Bill Huffman <huffman@...> wrote:

The satp register has reserved values.  Some implementers will, no doubt, want to define non-standard behavior based on satp.  I would like to propose that we define some of the reserved values in satp as Custom now so that those who do so won't head in diverging directions.

For RV64, there are 11 reserved values of the Mode field.  I propose that the encodings 14 and 15 be defined as Custom.

For RV32, values with satp[31] clear and satp[30:0] non-zero are reserved.  I propose that values with satp[31] clear and satp[30:29]=0x3 be defined as Custom.

The RV32 encoding is not perfect.  To have the largest contiguous space free, we need to pick bits at the top or bottom.  Top bits encroach on what's now ASID space while bottom bits encroach on what's now PPN space.  The former seemed a bit better.

       Bill



CSR address for debug scontext and hcontext

Ernie Edgar
 

Hello,

Background:

You may be aware that the RISC-V Debug Specification 0.13 defines two CSRs, mcontext and scontext, that can be used to qualify hardware breakpoints in a particular OS process or thread.  A modified S-mode OS kernel writes the process ID to scontext when switching processes.  Breakpoint hardware can be set to trigger only when the process ID in scontext matches the desired process.

Using ASID instead of scontext to qualify breakpoints has been suggested. However, many systems do not implement ASID or only implement a narrow field, forcing the OS to recycle ASID values.  This makes ASID useless for breakpoint qualification.

For those familiar with ARM, the equivalent registers in that architecture are CONTEXTIDR_EL1 and CONTEXTIDR_EL2.

Problem:

Scontext is defined in the ratified Debug Spec at CSR 0x7aa which is in the "Machine Standard read/write debug CSR" region and so is, by convention, inaccessible from S-mode.

The Debug Spec was ratified before work on the hypervisor had gotten very far, so Debug Spec 0.13 does not provide full support for hypervisor-based systems.  Among the missing items is a definition for an "hcontext" register to qualify breakpoints in a particular virtual machine.  An argument could be made to use VMID for this, but the discussion above about ASID qualification would also apply to VMID.

Proposed Solution:

The Debug Task Group would like to suggest allocating a range of CSR addresses in one of the Supervisor Standard read/write regions and in one of the Hypervisor Standard read/write regions to use for debug registers.  Our suggestion is 0x5A0-0x5AF for S-mode and 0x6A0-0x6AF for HS-mode, complementing the 0x7A0-0x7AF already defined for M-mode debug registers.  Allocating more than just one address gives the Debug TG flexibility for the future.

Thanks,
Ernie Edgar
RISC-V Debug Task Group




Re: Proposal for Custom Values in satp

Allen Baum
 

That sounds like a no brainer good idea.

-Allen

On Jul 29, 2020, at 2:41 PM, Andrew Waterman <andrew@...> wrote:

I support this proposal.

On Wed, Jul 29, 2020 at 12:42 PM Bill Huffman <huffman@...> wrote:

The satp register has reserved values.  Some implementers will, no doubt, want to define non-standard behavior based on satp.  I would like to propose that we define some of the reserved values in satp as Custom now so that those who do so won't head in diverging directions.

For RV64, there are 11 reserved values of the Mode field.  I propose that the encodings 14 and 15 be defined as Custom.

For RV32, values with satp[31] clear and satp[30:0] non-zero are reserved.  I propose that values with satp[31] clear and satp[30:29]=0x3 be defined as Custom.

The RV32 encoding is not perfect.  To have the largest contiguous space free, we need to pick bits at the top or bottom.  Top bits encroach on what's now ASID space while bottom bits encroach on what's now PPN space.  The former seemed a bit better.

       Bill



Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Alan,

I'm fine with taking the lead on this architecture extension.  But it should follow a proper process as directed by the TSC.  Thus far this would mean getting a new TG created or doing something less formally under an existing TG.  But for smaller extension proposals like this there is need for a proper lighter weight and faster process.  Need for this is recognized and I suspect will probably be promulgated by the TSC some time soon.

So I suggest we pause for a short bit, and then see if we can follow that expedited process once it is available.  In the meantime I/we can prepare what we can in advance.  (I don't think this will represent a material slow down to getting to a frozen spec and then to ratification.)

Greg


On Wed, Jul 29, 2020 at 5:24 PM alankao <alankao@...> wrote:
Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

alankao
 

Hi all,

Although there were some non-resolved discussions, it has little to do with what we should do for the next step.  I believe Greg's proposal is superior to the original one in the starting thread because

1.  It reuses `hpmevents` for most of the functions that we all agree that RISC-V needs, instead of adding a bunch of new registers.
2.  It is H-ext-aware

I suggest Greg take the lead to start a PR in the ISA Repo, I can help review and evaluate the effort to patch existing software.

Thanks,
Alan


Re: Proposal for Custom Values in satp

Andrew Waterman
 

I support this proposal.


On Wed, Jul 29, 2020 at 12:42 PM Bill Huffman <huffman@...> wrote:

The satp register has reserved values.  Some implementers will, no doubt, want to define non-standard behavior based on satp.  I would like to propose that we define some of the reserved values in satp as Custom now so that those who do so won't head in diverging directions.

For RV64, there are 11 reserved values of the Mode field.  I propose that the encodings 14 and 15 be defined as Custom.

For RV32, values with satp[31] clear and satp[30:0] non-zero are reserved.  I propose that values with satp[31] clear and satp[30:29]=0x3 be defined as Custom.

The RV32 encoding is not perfect.  To have the largest contiguous space free, we need to pick bits at the top or bottom.  Top bits encroach on what's now ASID space while bottom bits encroach on what's now PPN space.  The former seemed a bit better.

       Bill



Proposal for Custom Values in satp

Bill Huffman
 

The satp register has reserved values.  Some implementers will, no doubt, want to define non-standard behavior based on satp.  I would like to propose that we define some of the reserved values in satp as Custom now so that those who do so won't head in diverging directions.

For RV64, there are 11 reserved values of the Mode field.  I propose that the encodings 14 and 15 be defined as Custom.

For RV32, values with satp[31] clear and satp[30:0] non-zero are reserved.  I propose that values with satp[31] clear and satp[30:29]=0x3 be defined as Custom.

The RV32 encoding is not perfect.  To have the largest contiguous space free, we need to pick bits at the top or bottom.  Top bits encroach on what's now ASID space while bottom bits encroach on what's now PPN space.  The former seemed a bit better.

       Bill



Re: RISC-V Hypervisor Updates

Andrew Waterman
 

Thanks for the update, Anup!


On Sat, Jul 25, 2020 at 3:46 AM Anup Patel <anup.patel@...> wrote:
Hi All,

We have updated Spike, QEMU RISC-V, KVM RISC-V and Xvisor RISC-V for
RISC-V H-Extension v0.6.1 spec.

The QEMU RISC-V is our default development vehicle for RISC-V hypervisor
software (because it is quite fast) whereas Spike can be quite useful to CPU
designers/architects for experimenting and generating instruction traces
of RISC-V hypervisors.

The QEMU repo with RISC-V H-Extension v0.6.1 support can be found here:
https://github.com/kvm-riscv/qemu.git

To try KVM RISC-V, refer:
https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
https://github.com/kvm-riscv/linux.git
https://github.com/kvm-riscv/kvmtool.git

To try Xvisor RISC-V, refer:
https://github.com/avpatel/xvisor-next/blob/master/docs/riscv/riscv64-qemu.txt
https://github.com/avpatel/xvisor-next/blob/master/docs/riscv/riscv64-spike.txt
https://github.com/avpatel/xvisor-next.git

Regards,
Anup




RISC-V H-Extension Nested MMU Test-suite

Anup Patel
 

Hi All,

We now have a simple Nested MMU (i.e. Two-stage MMU) test-suite
available as part of Xvisor white-box testing framework. This test-suite
runs in HS-mode and does nested MMU testing using the HSV/HLV
instructions. This means Nested MMU (i.e. Two-stage MMU) testing
is achieved without creating any Guest/VM on Xvisor.

To run the Xvisor nested MMU test-suite we only need OpenSBI
firmware and Xvisor binary. Currently, Xvisor nested MMU test-suite
works on both QEMU and Spike.

Refer following READMEs to try Xvisor nested MMU test-suite:
https://github.com/avpatel/xvisor-next/blob/master/docs/riscv/riscv64-nested-mmu-test-qemu.txt
https://github.com/avpatel/xvisor-next/blob/master/docs/riscv/riscv64-nested-mmu-test-spike.txt
https://github.com/avpatel/xvisor-next/blob/master/docs/riscv/riscv32-nested-mmu-test-qemu.txt

In future, the Xvisor Nested MMU test-suite will also help in
implementing nested hypervisors.

Regards,
Anup


RISC-V Hypervisor Updates

Anup Patel
 

Hi All,

We have updated Spike, QEMU RISC-V, KVM RISC-V and Xvisor RISC-V for
RISC-V H-Extension v0.6.1 spec.

The QEMU RISC-V is our default development vehicle for RISC-V hypervisor
software (because it is quite fast) whereas Spike can be quite useful to CPU
designers/architects for experimenting and generating instruction traces
of RISC-V hypervisors.

The QEMU repo with RISC-V H-Extension v0.6.1 support can be found here:
https://github.com/kvm-riscv/qemu.git

To try KVM RISC-V, refer:
https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-QEMU
https://github.com/kvm-riscv/howto/wiki/KVM-RISCV64-on-Spike
https://github.com/kvm-riscv/linux.git
https://github.com/kvm-riscv/kvmtool.git

To try Xvisor RISC-V, refer:
https://github.com/avpatel/xvisor-next/blob/master/docs/riscv/riscv64-qemu.txt
https://github.com/avpatel/xvisor-next/blob/master/docs/riscv/riscv64-spike.txt
https://github.com/avpatel/xvisor-next.git

Regards,
Anup


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Andy Glew Si5
 

You seem to be missing the whole point: the x86 PerfMon/ EMON event filtering  is generic.
For us, the low 8 bits select between one of 256 "types" of events
 Ditto for Intel:  in fact, x86 PERFEVTSEL has precisely an 8-bit field  to select which type of event. That may have increased, although it still seems to be only eight bits inside the current manuals that I just downloaded. 


the upper 8 bits provide pretty flexible event-specific filtering
Similarly, Intel PERFEVTSEL has an 8 bit  UMASK for event specific filtering.
 

However, there are further  fields that define generic event filtering. filters that you get for free, without having to design them on a per "event type" basis.


(if you care about implementation, the UMASK field stands for "Unit Mask" and is  propagated  to whatever hardware unit is actually performing the measurement.   the other filter bits live at performance counter,  and therefore apply to all events.)




The CMASK comparison  is applicable to, and relevant to, any event  that  can increment by more than one per clock cycle.

The E edge trigger is relevant to, and applicable to,  any event that  occurs in bursts of back-to-back events  in adjacent clock cycles.

That has turned out to very nicely support a very large variety of events in a very manageable way hardware-wise - in contrast to having many hundreds (or more) individual events.
Ditto with Intel.  A small number of events, filtered and transformed in several different ways.


The main difference, is that all of your filtering is event specific, and therefore you can't write portable code the takes advantage of it. Whereas most of the Intel filtering is. So you can write portable code the takes advantage of it.


There is, of course, some event specific filtering.    I also observed patterns in that event specific filtering that I think would be quite usefully standardized:  like that part about masking different port combinations.   however, that is not quite as generic as  a comparison threshold that applies to every event like increment by more than one in a clock cycle.

The "push-out"  profiling feature could also be generic,  counting   the length of intervals in which no event occurs, for any event.   I did not do this in P6  because  push-out profiling requires an extra counter, even if it's only a few bits like six.

Similarly, you could tweak the edge detect filter to smooth over a few clock cycles.  again, that would be generic.




















From: Greg Favor <gfavor@...>
Sent: Tuesday, July 21, 2020 6:16PM
To: Andy Glew <andy.glew@...>
Cc: Alankao <alankao@...>, Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/21/2020 6:16 PM, Greg Favor wrote:
A general comment about all the fancy forms of event filtering that can be nice to have:

The most basic one of general applicability is mode-specific filtering.  Past that one could try to define some further general filtering capabilities that aren't too specialized, but one quickly gets into having interesting filtering features specific to an event or class of events.

We take the view (in our design) that the lower 16 bits of mhpmevent are used for event selection in the broad sense of the word.  For us, the low 8 bits select between one of 256 "types" of events and then the upper 8 bits provide pretty flexible event-specific filtering.  That has turned out to very nicely support a very large variety of events in a very manageable way hardware-wise - in contrast to having many hundreds (or more) individual events.  But that is just our own implementation of event selection.

My point is that someone else can do similar or not so similar things in their own design with whatever types of general or event-specific filtering features that they might desire.  Trying to standardize that stuff can be tricky to say the least.

For now, at least, I think we should just let people decide what events and event filtering they feel are valuable in their designs.  We should only try to standardize any filtering features that are broadly applicable and valuable to have.  For myself, that results in just proposing mode-specific event filtering.

Greg


On Tue, Jul 21, 2020 at 3:53 PM Andy Glew Si5 <andy.glew@...> wrote:
I have NOT been working on a RISC-V performance monitoring proposal, but I've been involved with performance monitoring first as a user then is an architect for many years and at several companies.

I would like to draw this group's attention to some features of Intel x86 performance monitoring  that turned out pretty successful.

First, you're already talking about hardware performance monitoring interrupts for statistical profiling. Good. A few comments on that below.

But, I think one of the best bang for buck performance monitoring features of  Sadie six MIN/performance monitoringis the performance counter event filtering

---+ Performance event filtering and transformation logic before counting

Next, I think one of the best bang for buck performance monitoring features of  x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this.

Every x86 performance counter has per counter event select logic.

In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor.  I see that some of the Ri5 proposals also have that. Good.

But more filtering is also provided:  

Each counter  has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment.

=> This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution.  In vectors, it might count how many vector elements are masked in or out.   If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not.

INV - a bit that allows the CMASK comparison to be inverted.

=> so that you can count event > threshold, and event < threshold.

I would really have liked to have >, <, and == threshold.  And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware.

This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles.  Later, Intel added what I call  "push-out"  profiling: when the comparison condition is met,  e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering.  That was too much hardware for me to add in 1991, but it proved very useful.

My first instinct is always to minimize hardware cost for performance monitoring hardware.   The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache,
and left it in the performance monitoring unit.  (The question of centralized versus decentralized performance counters is always an issue.  Suffice it to say that Intel P6 had for centralized performance counters, to save hardware;
Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized )

More filtering logic: each count has an E, edge select bit.  This counts when the condition described by the privilege level mask and CMASK comparison changes.    Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on.   Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution.   

Certain events themselves are sent to the performance counters as bitmasks.  E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines.  By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on.   Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not.   (This proved successful enough to get me and the software guy who eventually started using it in achievement award.)

If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison.

Finally, simple filter stuff:

INT - Performance interrupt enable

PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed.  The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU.  Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices.  Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture.

---+ Performance counter freeze

There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios.     But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them.

When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR.   If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs.

Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal

---+ Precise performance monitoring inputs - You can only wish!

When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event.

If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware.

IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise.

At the very least there should be a way of discovering whether an event is precise or not.   

Machine check architectures have the same issue.


---+ What should the priority of the hardware performance monitoring interrupts be?

One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt.

This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority.  But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked.   The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor.  However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt.

I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,…

By the way: I think Greg Favor said that x86 is performance monitoring interrupts are  level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered.

Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals.


---+ Fancier stuff

The above is mostly about getting the most out of simple performance counters:  providing filter logic so that you can get the most insight out of the limited number of events;
providing enables in a central place so that you can freeze multiple counters at the same time;  allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI,  and possibly also external hardware.

There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce  the performance monitoring hardware cost as much as possible.   I am sure, however, that many of you are familiar with fancier performance monitoring features, such as

+ Histogram counting (allowing you to count distributions without making multiple runs)
    => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply.

+ Cycle attribution - defining performance events so that you can actually say things like  "X% of cycles are spent waitying for memory".   

IMHO the single most important "advanced"  performance monitoring feature is what I call "longitudinal profiling".   AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension).  The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address,  or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems.

However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper.




From: Alankao <alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".

I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".

881 - 900 of 1130