A proposal to enhance RISC-V HPM (Hardware Performance Monitor)


alankao
 

Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Brian Grayson
 

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Allen Baum
 

I had thought about reserving  config bits in hpmevent as well, but thought that it would not be backwards compatible then.

In practice, it might be (in practice, I suspect the MSU bits are already there...)

On Mon, Jul 20, 2020 at 1:54 PM Brian Grayson <brian.grayson@...> wrote:
I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


alankao
 

Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>
Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.


Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.
I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.
Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


alankao
 

It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Greg Favor
 

It's nice to see people starting to get serious about addressing this long standing issue.  We (Ventana) also worked out a proposal for these issues - that is more in the vein of what Brian mentioned (for the reasons he mentioned, plus additional reasons).  With this proposal:

- All per-counter state resides in the associated mhpmevent CSR.

- This proposal avoids adding any new CSR registers that would need to be context switched by a hypervisor.  Not mixing counter state from different counters into common registers also greatly simplifies this.

- There is just one added CSR - that is a shadow copy of the overflow status of all counters collected together in one place.

- There are per-mode filter bits that comprehend the H-extension.  Full inter-mode security (i.e. lower privilege modes should not be able to count and observe high modes) is mediated by each mode as it receives software calls to configure a counter from below and calls up (ultimately to M-mode through OpenSBI) to perform the mhpmevent CSR write.

- Providing hpmcounter write access (when enabled) to lower privilege modes is properly handled, even with the H-extension, by keying off of read access being mediated by the three *counteren CSRs.

- Nothing extra is needed hardware-wise for virtualization of this Counter extension.

- Lastly, to Brian's point re counter marking, our proposal includes a per-counter Active bit that software can use to enable/disable active counting of events.  (We also use it to activate/deactivate counting when various hardware conditions occur - but that's a story for another day.)  This state context switches along with the rest of a counter's state.  (A new CSR could be provided to enable non-M-mode software to directly control this bit.  This proposal doesn't include such, but that could be a point of discussion.)

Here is a summary of the proposal:

The following bits are added to 'mhpmevent' (with proposed bit positions):

bit [31]  Active         -  If set, then counting of events is enabled

bit [30]  Overflow     -  Sticky overflow status bit that is set when counter overflows

bit [29]  IntrEnable   -  If set, an interrupt request is raised while Overflow=1

bit [28]  CntrWrEn    -  If set, then lower privilege modes that can read hpmcounter can also write it

bit [27]  Mdisable      -  If set, then counting of events in M-mode is disabled
bit [26]  HSdisable    -  If set, then counting of events in S/HS-mode is disabled
bit [25]  Udisable      -  If set, then counting of events in U-mode is disabled
bit [24]  VSdisable    -  If set, then counting of events in VS-mode is disabled
bit [23]  VUdisable    -  If set, then counting of events in VU-mode is disabled

Notes:
- Since hpmcounter values are unsigned values, overflow (to be consistent) is defined as unsigned overflow.  (This matches x86 and ARMv8.)  Note that there is no loss of information after an overflow since the counter wraps around and keeps counting while the sticky Overflow bit is set.  (For a 64-bit counter it will be an awfully long time before another overflow could possibly occur.)

- A single level-sensitive "overflow interrupt request" signal is asserted while any Overflow bits are set.  This goes to whatever interrupt controller is present in the system (whether as some form of hart-local interrupt or as a global interrupt).  (This proposal doesn't try to introduce per-privilege mode overflow interrupt request signals.  ARMv8 doesn't have this and I don't think (?) x86 does either.)

The following one CSR is added:  hpmoverflow

- This is a 32-bit register that contains shadow copies of the Overflow bits in the 32 mhpmevent CSRs.  hpmoverflow bit X corresponds to mhpmeventX.  This register enables overflow interrupt handler software to quickly and easily determine which counter(s) have overflowed (and to directly clear the overflow bits as they are serviced)

- This register is readable and write-one-to-clear.  This read/write access is subject to the same *counteren CSRs that mediate access to the hpmcounter CSRs by each privilege mode.  In other words, the same "visibility" controls apply both to the hpmcounter's and to their associated hpmoverflow Overflow bits.  Bits that should not be visible are RAZ/WI (read-as-zero / write-ignored).

Greg


alankao
 

Hi Greg,

Questions:
- Is Active (bit[31]) any different from the inhibit register, functionally speaking?
- Assume that we are making this HPM as an extension (maybe Zmhpm, Zshpm?). How is it possible that no extra registers are needed together with H Extention?  At least we need the counteren.
- Did you implement this proposal into a solution that perf really works? As mentioned in the original post, we (Andes) released implementations both in hardware and software since two years ago. 

The main difference between our proposal and yours is the way we implement the essential HPM functionalities.  I resist the idea of overloading hpmevents purely because we have been working in the other way (adding CSRs) . After reviewing the existing code and the perf_event framework, I don't think there will be any trouble developing perf based on your proposal.  Also thank you for covering H extension here.

Alan


Bill Huffman
 

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:

EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Greg Favor
 

Alan,

Hi. Comments below.

On Mon, Jul 20, 2020 at 7:39 PM alankao <alankao@...> wrote:
Hi Greg,

Questions:
- Is Active (bit[31]) any different from the inhibit register, functionally speaking?

At this surface it isn't, but in practice it is or wants to be for the following reasons:

- One wants the 'Active' state to be with all the other state of a counter so that it can all be context switched together by a hypervisor, as needed, when context switching a VM.  Having it (and all the other state bits) in mhpmevent means that they are context-switched "for free" when hpmcounter and mhpmevent are saved/restored.  Also, mixing all the 'Active' bits together in a common CSR (like mcountinhibit) complicates context-switching a subset of counters (since one has to explicitly insert and extract the relevant bits from that CSR).

- New/extra OpenSBI calls would be needed to support reading/writing such state that is in other places besides mhpmevent.

- When one brings into the picture setting and clearing the Active bit in response to hardware events (e.g. overflow by another counter or firing of a debug trigger in the Trigger Module), that can't be the current mcountinhibit bits (without changing the definition of that CSR).  In general, one can allow both hardware and software to control the activation and deactivation of active counting by a counter by setting/clearing one common bit that represents the 'active' state of the counter (and in a place that is naturally context-switched along with the rest of the counter state).  (Also, to be clear, this proposal isn't trying to standardize hardware control of Active bits, but it does provide a simple standardized basis for someone wanting to add in their own hardware counter control.)

- mcountinhibit is an M-mode-only CSR.  Any support for lower modes to directly enable/disable counting would require another/new CSR.

- Assume that we are making this HPM as an extension (maybe Zmhpm, Zshpm?). How is it possible that no extra registers are needed together with H Extention?  At least we need the counteren.

The mcounteren, scounteren, and hcounteren CSR's already exist (between the base Privileged spec and the current H-extension draft).  Nothing additional is needed for this counter extension.

Greg


Greg Favor
 

Bill,

Hopefully my last email also answers your question.

Greg


On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Brian Grayson
 

Hi, Alan.

My proposal is still a work in progress, hence has not been shared publicly, but is significantly based on a proven architecture with about 30 years in the field and a few billion shipping cores, if not more -- the PowerPC performance monitor implementation. I did the in-house Linux kernel patches and tool support for it about two decades ago at Motorola :) so I used to know it quite well, and can see how a similar approach solves some of the current problems that we all have encountered with the current RISC-V approach. I am fairly new to the RISC-V ecosystem, so I was not aware of the work that you have done in the past; thanks for the pointer to that.

The SBI PMU extensions is more about the API between what perf (or another tool) communicates, and how the M-mode software interprets it, and not about actually changing the hardware interpretation of mhpmevent bits, at least that was my understanding.

I am glad that so many of us are converging on all the same fundamental needs!

Brian

On Mon, Jul 20, 2020 at 7:38 PM alankao <alankao@...> wrote:
Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>
Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.


Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.
I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.
Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Brian Grayson
 

The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian


On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Anup Patel
 

Hi All,

 

The proposed SBI PMU extension is a set of APIs between M-mode and HS-mode (also between HS-mode and VS-mode) such that no RISC-V spec changes are required for existing HPMCOUNTERs.

 

The high-level view of SBI PMU extension is as follows:

  1. The SBI PMU calls are only for discovering and configuring HARDWARE/SOFTWARE counters and events
  2. The HARDWARE/SOFTWARE counters will be read directly from S-mode software without any SBI calls
  3. There are two suggested approaches for overflow handling:
    1. No overflow interrupt. The S-mode software will use periodic timer interrupts to track counter overflows. This will be imprecise software approach for detecting overflow
    2. Per-HART edge-triggered interrupt routed through PLIC (or some other platform interrupt controller). The per-HART interrupt will have to be routed through PLIC because we don’t have interrupts defined in MIP/MIE CSRs for counter overflows. Making this interrupts edge-triggered will not require S-mode software to clear the interrupt using SBI call

 

The SBI PMU extension does not prevent anyone in defining new RISC-V PMU extension. Although, whatever new RISC-V PMU extension is defined should consider H-extension and also provide dedicated CSRs for S-mode.

 

In future, when a new RISC-V PMU extension is available in RISC-V privilege spec, the SBI PMU extension will continue to exist and will be mostly used for SOFTWARE counters provided by SBI implementation (OpenSBI or Hypervisors) to the S-mode software.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Brian Grayson
Sent: 21 July 2020 10:45
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Hi, Alan.

 

My proposal is still a work in progress, hence has not been shared publicly, but is significantly based on a proven architecture with about 30 years in the field and a few billion shipping cores, if not more -- the PowerPC performance monitor implementation. I did the in-house Linux kernel patches and tool support for it about two decades ago at Motorola :) so I used to know it quite well, and can see how a similar approach solves some of the current problems that we all have encountered with the current RISC-V approach. I am fairly new to the RISC-V ecosystem, so I was not aware of the work that you have done in the past; thanks for the pointer to that.

 

The SBI PMU extensions is more about the API between what perf (or another tool) communicates, and how the M-mode software interprets it, and not about actually changing the hardware interpretation of mhpmevent bits, at least that was my understanding.

 

I am glad that so many of us are converging on all the same fundamental needs!

 

Brian

 

On Mon, Jul 20, 2020 at 7:38 PM alankao <alankao@...> wrote:

Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Allen Baum
 

It's late; I'm missing something.
How is "mark" being set/cleared on entry/exit different than "active" in Greg's proposal being set/cleared on entry/exit ?

On Mon, Jul 20, 2020 at 10:26 PM Brian Grayson <brian.grayson@...> wrote:
The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian

On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Greg Favor
 

Ah, I see.  The 'marked' bit is state associated with and managed by the code running, not associated with a counter.  Then a counter could be configured (via its event selection) to count a selected event while the current 'marked' bit is set or not set.

As you note, for everyone using a perf-style approach, this 'marked' bit is not so useful.  And for the other bare-metal embedded customers that might desire this 'marked' bit, this bit of state needs to be added to some other existing or new CSR that is distinct from the current hpmcounter/mhpmevent CSR's.  That sounds like a separate (small) extension, orthogonal to the current discussion, targeted at this embedded segment of people.

Greg

On Mon, Jul 20, 2020 at 10:26 PM Brian Grayson <brian.grayson@...> wrote:
The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian
 


Greg Favor
 

Regarding overflow interrupts as edge-sensitive interrupts:

It seems like this would require that there not be any overflow status bit in mhpmevent CSRs (or any alternative CSR), otherwise this bit would need to be cleared by software - which is equivalent to the "clearing a serviced overflow interrupt" that is trying to be avoided below.  Which seems generally undesirable; as well as how would the overflow interrupt handler figure out which counter(s) overflowed?  (Or are you imagining 32 separate per-counter overflow interrupt requests to a PLIC?)

In contrast, both x86 and ARMv8 have explicit counter overflow status bits that are the basis for generating a shared level-sensitive interrupt request and these bits must be cleared by handler software.

Lastly, note that in my proposal the handler (say in S/HS mode) can directly clear the Overflow status bits for the counter overflows that it has serviced - via the hpmoverflow CSR.  That avoids the SBI call that you rightly are wanting to avoid.

Greg

On Mon, Jul 20, 2020 at 10:39 PM Anup Patel <anup.patel@...> wrote:

Hi All,

 

The proposed SBI PMU extension is a set of APIs between M-mode and HS-mode (also between HS-mode and VS-mode) such that no RISC-V spec changes are required for existing HPMCOUNTERs.

 

The high-level view of SBI PMU extension is as follows:

  1. The SBI PMU calls are only for discovering and configuring HARDWARE/SOFTWARE counters and events
  2. The HARDWARE/SOFTWARE counters will be read directly from S-mode software without any SBI calls
  3. There are two suggested approaches for overflow handling:
    1. No overflow interrupt. The S-mode software will use periodic timer interrupts to track counter overflows. This will be imprecise software approach for detecting overflow
    2. Per-HART edge-triggered interrupt routed through PLIC (or some other platform interrupt controller). The per-HART interrupt will have to be routed through PLIC because we don’t have interrupts defined in MIP/MIE CSRs for counter overflows. Making this interrupts edge-triggered will not require S-mode software to clear the interrupt using SBI call

 

The SBI PMU extension does not prevent anyone in defining new RISC-V PMU extension. Although, whatever new RISC-V PMU extension is defined should consider H-extension and also provide dedicated CSRs for S-mode.

 

In future, when a new RISC-V PMU extension is available in RISC-V privilege spec, the SBI PMU extension will continue to exist and will be mostly used for SOFTWARE counters provided by SBI implementation (OpenSBI or Hypervisors) to the S-mode software.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Brian Grayson
Sent: 21 July 2020 10:45
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Hi, Alan.

 

My proposal is still a work in progress, hence has not been shared publicly, but is significantly based on a proven architecture with about 30 years in the field and a few billion shipping cores, if not more -- the PowerPC performance monitor implementation. I did the in-house Linux kernel patches and tool support for it about two decades ago at Motorola :) so I used to know it quite well, and can see how a similar approach solves some of the current problems that we all have encountered with the current RISC-V approach. I am fairly new to the RISC-V ecosystem, so I was not aware of the work that you have done in the past; thanks for the pointer to that.

 

The SBI PMU extensions is more about the API between what perf (or another tool) communicates, and how the M-mode software interprets it, and not about actually changing the hardware interpretation of mhpmevent bits, at least that was my understanding.

 

I am glad that so many of us are converging on all the same fundamental needs!

 

Brian

 

On Mon, Jul 20, 2020 at 7:38 PM alankao <alankao@...> wrote:

Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Anup Patel
 

Hi Greg,

 

For per-HART edge sensitive interrupts, we can avoid the overflow status bit for each counter by keeping track of last read value and comparing this last read value in overflow interrupt handler.

 

I am suggesting one edge-sensitive interrupt for each HART routed through PLIC so that we don’t need too many PLIC interrupt lines.

 

Regards,

Anup

 

From: Greg Favor <gfavor@...>
Sent: 21 July 2020 11:36
To: Anup Patel <Anup.Patel@...>
Cc: Brian Grayson <brian.grayson@...>; alankao <alankao@...>; tech-privileged@...; andrew@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Regarding overflow interrupts as edge-sensitive interrupts:

 

It seems like this would require that there not be any overflow status bit in mhpmevent CSRs (or any alternative CSR), otherwise this bit would need to be cleared by software - which is equivalent to the "clearing a serviced overflow interrupt" that is trying to be avoided below.  Which seems generally undesirable; as well as how would the overflow interrupt handler figure out which counter(s) overflowed?  (Or are you imagining 32 separate per-counter overflow interrupt requests to a PLIC?)

 

In contrast, both x86 and ARMv8 have explicit counter overflow status bits that are the basis for generating a shared level-sensitive interrupt request and these bits must be cleared by handler software.

 

Lastly, note that in my proposal the handler (say in S/HS mode) can directly clear the Overflow status bits for the counter overflows that it has serviced - via the hpmoverflow CSR.  That avoids the SBI call that you rightly are wanting to avoid.

 

Greg

 

On Mon, Jul 20, 2020 at 10:39 PM Anup Patel <anup.patel@...> wrote:

Hi All,

 

The proposed SBI PMU extension is a set of APIs between M-mode and HS-mode (also between HS-mode and VS-mode) such that no RISC-V spec changes are required for existing HPMCOUNTERs.

 

The high-level view of SBI PMU extension is as follows:

  1. The SBI PMU calls are only for discovering and configuring HARDWARE/SOFTWARE counters and events
  2. The HARDWARE/SOFTWARE counters will be read directly from S-mode software without any SBI calls
  3. There are two suggested approaches for overflow handling:
    1. No overflow interrupt. The S-mode software will use periodic timer interrupts to track counter overflows. This will be imprecise software approach for detecting overflow
    2. Per-HART edge-triggered interrupt routed through PLIC (or some other platform interrupt controller). The per-HART interrupt will have to be routed through PLIC because we don’t have interrupts defined in MIP/MIE CSRs for counter overflows. Making this interrupts edge-triggered will not require S-mode software to clear the interrupt using SBI call

 

The SBI PMU extension does not prevent anyone in defining new RISC-V PMU extension. Although, whatever new RISC-V PMU extension is defined should consider H-extension and also provide dedicated CSRs for S-mode.

 

In future, when a new RISC-V PMU extension is available in RISC-V privilege spec, the SBI PMU extension will continue to exist and will be mostly used for SOFTWARE counters provided by SBI implementation (OpenSBI or Hypervisors) to the S-mode software.

 

Regards,

Anup

 

From: tech-privileged@... <tech-privileged@...> On Behalf Of Brian Grayson
Sent: 21 July 2020 10:45
To: alankao <alankao@...>
Cc: tech-privileged@...
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

 

Hi, Alan.

 

My proposal is still a work in progress, hence has not been shared publicly, but is significantly based on a proven architecture with about 30 years in the field and a few billion shipping cores, if not more -- the PowerPC performance monitor implementation. I did the in-house Linux kernel patches and tool support for it about two decades ago at Motorola :) so I used to know it quite well, and can see how a similar approach solves some of the current problems that we all have encountered with the current RISC-V approach. I am fairly new to the RISC-V ecosystem, so I was not aware of the work that you have done in the past; thanks for the pointer to that.

 

The SBI PMU extensions is more about the API between what perf (or another tool) communicates, and how the M-mode software interprets it, and not about actually changing the hardware interpretation of mhpmevent bits, at least that was my understanding.

 

I am glad that so many of us are converging on all the same fundamental needs!

 

Brian

 

On Mon, Jul 20, 2020 at 7:38 PM alankao <alankao@...> wrote:

Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Bill Huffman
 

Brian,

I think the advantages you explain are exactly the reason why I was asking whether the marked bit was a significant improvement.  I think the functionality is extremely similar to swapping mcountinhibit CSR values.  For example, there can be one counter that counts marked activities and another that counts non-marked activities.  This would be done by having two counters set the same ways except that one inhibit bit was set in one mcountinhibit value and the other was set in the other.

It seems to me that the advantage of the marked bit is that there isn't a register to swap.  But the bit in the status register still needs to be changed at those same times.  So, I'm seeing only tiny differences - such as that, with the marked bit, there's no need for a location to save a mcountinhibit value when it's not being used.  And presumably supervisor can change the marked bit in status.  On the other hand, the mcountinhibit method allows the effect of more than two values of the marked bit by having more than two mcountinhibit values.

So, it doesn't seem to me a very large improvement once mcountinhibit already exists, even for the embedded cases where context-switching of permon registers is not done.

      Bill


On 7/20/20 10:25 PM, Brian Grayson wrote:

EXTERNAL MAIL

The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian

On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Greg Favor
 

What I ultimately understood from Brian's email's is that this 'marked' bit conceptually is not a state bit associated with each counter, but a state bit maintained separately by software (as it transitions into and out of regions of code that it wants to be viewed as "marked" or not).  Then counters can be configured (via their event selection) to count either marked events or unmarked events or both.

With that view, the 'marked' bit wants to be changeable directly by the software without having to call into M-mode and without those bits of software having to be aware of which counters were configured to count marked events, or non-marked events, or both.  This is why one doesn't want to be trying to use mcountinhibit bits.

In any case, I'm just trying to represent what I understand to be Brian's request.  But as he also acknowledged, the primary use case is probably bare-metal embedded systems and may not be more generally relevant.

As far as the 'Active' bit in my proposal, it allows for both software and hardware events to set/clear a counter's Active bit.  That's more interesting when one has hardware cross-trigger events from the debug Trigger Module and from specific counter overflows.  But since this all wants to go together, I would be fine with removing the Active bit from my proposal.  This simple set of cross-trigger capabilities (to create rich events that can be counted and to create richer debug triggers and richer trace control) is best treated as a separate (future) extension proposal that may or may not catch enough people's interest.  (We're all for standardizing these richer capabilities in some form, but if that doesn't happen then we (Ventana) will implement this as our own custom stuff.)

Greg


On Tue, Jul 21, 2020 at 11:08 AM Bill Huffman <huffman@...> wrote:

Brian,

I think the advantages you explain are exactly the reason why I was asking whether the marked bit was a significant improvement.  I think the functionality is extremely similar to swapping mcountinhibit CSR values.  For example, there can be one counter that counts marked activities and another that counts non-marked activities.  This would be done by having two counters set the same ways except that one inhibit bit was set in one mcountinhibit value and the other was set in the other.

It seems to me that the advantage of the marked bit is that there isn't a register to swap.  But the bit in the status register still needs to be changed at those same times.  So, I'm seeing only tiny differences - such as that, with the marked bit, there's no need for a location to save a mcountinhibit value when it's not being used.  And presumably supervisor can change the marked bit in status.  On the other hand, the mcountinhibit method allows the effect of more than two values of the marked bit by having more than two mcountinhibit values.

So, it doesn't seem to me a very large improvement once mcountinhibit already exists, even for the embedded cases where context-switching of permon registers is not done.

      Bill


On 7/20/20 10:25 PM, Brian Grayson wrote:
EXTERNAL MAIL

The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian

On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Bill Huffman
 


On 7/21/20 11:56 AM, Greg Favor wrote:
EXTERNAL MAIL

What I ultimately understood from Brian's email's is that this 'marked' bit conceptually is not a state bit associated with each counter, but a state bit maintained separately by software (as it transitions into and out of regions of code that it wants to be viewed as "marked" or not).  Then counters can be configured (via their event selection) to count either marked events or unmarked events or both.

With that view, the 'marked' bit wants to be changeable directly by the software without having to call into M-mode and without those bits of software having to be aware of which counters were configured to count marked events, or non-marked events, or both.  This is why one doesn't want to be trying to use mcountinhibit bits.
I agree about calling M-mode.  But I think the software that swaps entire mcountinhibit registers doesn't have to know anything about the bits in them either.

In any case, I'm just trying to represent what I understand to be Brian's request.  But as he also acknowledged, the primary use case is probably bare-metal embedded systems and may not be more generally relevant.

As far as the 'Active' bit in my proposal, it allows for both software and hardware events to set/clear a counter's Active bit.  That's more interesting when one has hardware cross-trigger events from the debug Trigger Module and from specific counter overflows.  But since this all wants to go together, I would be fine with removing the Active bit from my proposal.  This simple set of cross-trigger capabilities (to create rich events that can be counted and to create richer debug triggers and richer trace control) is best treated as a separate (future) extension proposal that may or may not catch enough people's interest.  (We're all for standardizing these richer capabilities in some form, but if that doesn't happen then we (Ventana) will implement this as our own custom stuff.)

Greg

I'm not trying to argue for or against any particular thing - including the Active bit - at this point.  Just wanting a full understanding....

      Bill


On Tue, Jul 21, 2020 at 11:08 AM Bill Huffman <huffman@...> wrote:

Brian,

I think the advantages you explain are exactly the reason why I was asking whether the marked bit was a significant improvement.  I think the functionality is extremely similar to swapping mcountinhibit CSR values.  For example, there can be one counter that counts marked activities and another that counts non-marked activities.  This would be done by having two counters set the same ways except that one inhibit bit was set in one mcountinhibit value and the other was set in the other.

It seems to me that the advantage of the marked bit is that there isn't a register to swap.  But the bit in the status register still needs to be changed at those same times.  So, I'm seeing only tiny differences - such as that, with the marked bit, there's no need for a location to save a mcountinhibit value when it's not being used.  And presumably supervisor can change the marked bit in status.  On the other hand, the mcountinhibit method allows the effect of more than two values of the marked bit by having more than two mcountinhibit values.

So, it doesn't seem to me a very large improvement once mcountinhibit already exists, even for the embedded cases where context-switching of permon registers is not done.

      Bill


On 7/20/20 10:25 PM, Brian Grayson wrote:
EXTERNAL MAIL

The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.

Brian

On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:
Bill,

Hopefully my last email also answers your question.

Greg

On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!