Date   

Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Bill,

Hopefully my last email also answers your question.

Greg


On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:
EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

Alan,

Hi. Comments below.

On Mon, Jul 20, 2020 at 7:39 PM alankao <alankao@...> wrote:
Hi Greg,

Questions:
- Is Active (bit[31]) any different from the inhibit register, functionally speaking?

At this surface it isn't, but in practice it is or wants to be for the following reasons:

- One wants the 'Active' state to be with all the other state of a counter so that it can all be context switched together by a hypervisor, as needed, when context switching a VM.  Having it (and all the other state bits) in mhpmevent means that they are context-switched "for free" when hpmcounter and mhpmevent are saved/restored.  Also, mixing all the 'Active' bits together in a common CSR (like mcountinhibit) complicates context-switching a subset of counters (since one has to explicitly insert and extract the relevant bits from that CSR).

- New/extra OpenSBI calls would be needed to support reading/writing such state that is in other places besides mhpmevent.

- When one brings into the picture setting and clearing the Active bit in response to hardware events (e.g. overflow by another counter or firing of a debug trigger in the Trigger Module), that can't be the current mcountinhibit bits (without changing the definition of that CSR).  In general, one can allow both hardware and software to control the activation and deactivation of active counting by a counter by setting/clearing one common bit that represents the 'active' state of the counter (and in a place that is naturally context-switched along with the rest of the counter state).  (Also, to be clear, this proposal isn't trying to standardize hardware control of Active bits, but it does provide a simple standardized basis for someone wanting to add in their own hardware counter control.)

- mcountinhibit is an M-mode-only CSR.  Any support for lower modes to directly enable/disable counting would require another/new CSR.

- Assume that we are making this HPM as an extension (maybe Zmhpm, Zshpm?). How is it possible that no extra registers are needed together with H Extention?  At least we need the counteren.

The mcounteren, scounteren, and hcounteren CSR's already exist (between the base Privileged spec and the current H-extension draft).  Nothing additional is needed for this counter extension.

Greg


Re: Caching and sfence'ing (or not) of satp Bare mode "translations"

Greg Favor
 

Comments below:

On Mon, Jul 20, 2020 at 8:14 PM Bill Huffman <huffman@...> wrote:

Hi Greg,

My sense is that the transitions from SvXX to Bare and from Bare to the same SvXX that was previously in force are special transitions.  One reason seems to me the extreme simplicity of Bare compared with other modes.  It's easier to switch.

Switches to/from Bare mode should be rare.  Typically one will switch from Bare mode to a translated mode as part of booting up an OS (e.g. Linux), and then will remain in that mode (until, say, the system crashes and must reboot).  Further, all such switches are performed under full software control.

Switching to/from M-mode on the other hand is frequent and often hardware initiated.  Also, any sfence.vma on M-mode exit would have to be after the exit (in potentially arbitrary code that happens to be at the return address).

Hence sfence'ing M-mode entry/exit is impractical as well as something that needs to be performant.   Whereas Bare mode entry/exit is rare and software-managed.

If we require sfence.vma after a switch to or from Bare, does that also mean we have to require one after a switch to or from M-mode?  If no, why is it different? 

If a high-performance design caches "translations" in all modes of operation (including M-mode) in the TLBs, then M-mode translations must be distinguished from S/HS/U mode translations, which must be distinguished from VS/VU mode translations.  That is a small set of three translation regimes (to use an ARMv8 term) for hardware to support and handle properly.

If one has to also distinguish Bare and non-Bare modes within the S/HS/U translation regime, that effectively becomes two separate translation regimes.  Similarly, with the H-extension and two-stage translations inside VM's, the VS/VU regime needs to become four regimes (the four combinations of Bare and non-Bare stage 1 and stage 2 modes).  Consequently TLB entries and surrounding logic now need to distinguish between and properly handle seven translation regimes.  All to handle rare cases.

That, like most things, is doable, but isn't the whole point of a RISC architecture to reduce hardware cost and complexity and shift that to software where the software burden is small and the performance cost is minimal?

Greg

P.S. One could imagine instead doing data-dependent implicit sfence.vma operation on CSR writes to the *atp registers, but besides being data-dependent (which RISC-V avoids when it comes to having data-dependent exceptions) that is a rather CISC'y thing to do.  Which goes back to my preceding point.
 

If yes, it will cost more to switch briefly to M-mode than I'd want it to.

      Bill

On 7/20/20 7:08 PM, Greg Favor wrote:
EXTERNAL MAIL

I would like to get people's views on the question of when is an sfence.vma required after changing the satp.mode field (to see what support there is for the following change/clarification in the Privileged spec).

Currently an sfence.vma is required when changing between SvXX modes, but is not required when changing to/from Bare mode.

In both cases there is an implicit wholesale change to the page tables, i.e. the translation of any given address generally has suddenly changed.

For some designs (that cache Bare mode translations in the TLBs for the sake of caching the PMA and PMP info for an address), having software be required to do an sfence.vma can simplify the hardware.

So the question is whether there should be architectural consistency in requiring sfence'ing after changing satp.mode (i.e. all mode changes require an sfence), versus having some mode cases being treated differently (i.e. changes to/from Bare mode not requiring an sfence)?

My (and Ventana's) bias is towards the former - for our sake and for other future high performance CPU designs by others.  But I'm interested to see if others feel similarly or not.

Greg


Re: Caching and sfence'ing (or not) of satp Bare mode "translations"

Bill Huffman
 

Hi Greg,

My sense is that the transitions from SvXX to Bare and from Bare to the same SvXX that was previously in force are special transitions.  One reason seems to me the extreme simplicity of Bare compared with other modes.  It's easier to switch.

If we require sfence.vma after a switch to or from Bare, does that also mean we have to require one after a switch to or from M-mode?  If no, why is it different?  If yes, it will cost more to switch briefly to M-mode than I'd want it to.

      Bill

On 7/20/20 7:08 PM, Greg Favor wrote:

EXTERNAL MAIL

I would like to get people's views on the question of when is an sfence.vma required after changing the satp.mode field (to see what support there is for the following change/clarification in the Privileged spec).

Currently an sfence.vma is required when changing between SvXX modes, but is not required when changing to/from Bare mode.

In both cases there is an implicit wholesale change to the page tables, i.e. the translation of any given address generally has suddenly changed.

For some designs (that cache Bare mode translations in the TLBs for the sake of caching the PMA and PMP info for an address), having software be required to do an sfence.vma can simplify the hardware.

So the question is whether there should be architectural consistency in requiring sfence'ing after changing satp.mode (i.e. all mode changes require an sfence), versus having some mode cases being treated differently (i.e. changes to/from Bare mode not requiring an sfence)?

My (and Ventana's) bias is towards the former - for our sake and for other future high performance CPU designs by others.  But I'm interested to see if others feel similarly or not.

Greg


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Bill Huffman
 

Brian,

I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.

      Bill

On 7/20/20 1:54 PM, Brian Grayson wrote:

EXTERNAL MAIL

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

alankao
 

Hi Greg,

Questions:
- Is Active (bit[31]) any different from the inhibit register, functionally speaking?
- Assume that we are making this HPM as an extension (maybe Zmhpm, Zshpm?). How is it possible that no extra registers are needed together with H Extention?  At least we need the counteren.
- Did you implement this proposal into a solution that perf really works? As mentioned in the original post, we (Andes) released implementations both in hardware and software since two years ago. 

The main difference between our proposal and yours is the way we implement the essential HPM functionalities.  I resist the idea of overloading hpmevents purely because we have been working in the other way (adding CSRs) . After reviewing the existing code and the perf_event framework, I don't think there will be any trouble developing perf based on your proposal.  Also thank you for covering H extension here.

Alan


Caching and sfence'ing (or not) of satp Bare mode "translations"

Greg Favor
 

I would like to get people's views on the question of when is an sfence.vma required after changing the satp.mode field (to see what support there is for the following change/clarification in the Privileged spec).

Currently an sfence.vma is required when changing between SvXX modes, but is not required when changing to/from Bare mode.

In both cases there is an implicit wholesale change to the page tables, i.e. the translation of any given address generally has suddenly changed.

For some designs (that cache Bare mode translations in the TLBs for the sake of caching the PMA and PMP info for an address), having software be required to do an sfence.vma can simplify the hardware.

So the question is whether there should be architectural consistency in requiring sfence'ing after changing satp.mode (i.e. all mode changes require an sfence), versus having some mode cases being treated differently (i.e. changes to/from Bare mode not requiring an sfence)?

My (and Ventana's) bias is towards the former - for our sake and for other future high performance CPU designs by others.  But I'm interested to see if others feel similarly or not.

Greg


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Greg Favor
 

It's nice to see people starting to get serious about addressing this long standing issue.  We (Ventana) also worked out a proposal for these issues - that is more in the vein of what Brian mentioned (for the reasons he mentioned, plus additional reasons).  With this proposal:

- All per-counter state resides in the associated mhpmevent CSR.

- This proposal avoids adding any new CSR registers that would need to be context switched by a hypervisor.  Not mixing counter state from different counters into common registers also greatly simplifies this.

- There is just one added CSR - that is a shadow copy of the overflow status of all counters collected together in one place.

- There are per-mode filter bits that comprehend the H-extension.  Full inter-mode security (i.e. lower privilege modes should not be able to count and observe high modes) is mediated by each mode as it receives software calls to configure a counter from below and calls up (ultimately to M-mode through OpenSBI) to perform the mhpmevent CSR write.

- Providing hpmcounter write access (when enabled) to lower privilege modes is properly handled, even with the H-extension, by keying off of read access being mediated by the three *counteren CSRs.

- Nothing extra is needed hardware-wise for virtualization of this Counter extension.

- Lastly, to Brian's point re counter marking, our proposal includes a per-counter Active bit that software can use to enable/disable active counting of events.  (We also use it to activate/deactivate counting when various hardware conditions occur - but that's a story for another day.)  This state context switches along with the rest of a counter's state.  (A new CSR could be provided to enable non-M-mode software to directly control this bit.  This proposal doesn't include such, but that could be a point of discussion.)

Here is a summary of the proposal:

The following bits are added to 'mhpmevent' (with proposed bit positions):

bit [31]  Active         -  If set, then counting of events is enabled

bit [30]  Overflow     -  Sticky overflow status bit that is set when counter overflows

bit [29]  IntrEnable   -  If set, an interrupt request is raised while Overflow=1

bit [28]  CntrWrEn    -  If set, then lower privilege modes that can read hpmcounter can also write it

bit [27]  Mdisable      -  If set, then counting of events in M-mode is disabled
bit [26]  HSdisable    -  If set, then counting of events in S/HS-mode is disabled
bit [25]  Udisable      -  If set, then counting of events in U-mode is disabled
bit [24]  VSdisable    -  If set, then counting of events in VS-mode is disabled
bit [23]  VUdisable    -  If set, then counting of events in VU-mode is disabled

Notes:
- Since hpmcounter values are unsigned values, overflow (to be consistent) is defined as unsigned overflow.  (This matches x86 and ARMv8.)  Note that there is no loss of information after an overflow since the counter wraps around and keeps counting while the sticky Overflow bit is set.  (For a 64-bit counter it will be an awfully long time before another overflow could possibly occur.)

- A single level-sensitive "overflow interrupt request" signal is asserted while any Overflow bits are set.  This goes to whatever interrupt controller is present in the system (whether as some form of hart-local interrupt or as a global interrupt).  (This proposal doesn't try to introduce per-privilege mode overflow interrupt request signals.  ARMv8 doesn't have this and I don't think (?) x86 does either.)

The following one CSR is added:  hpmoverflow

- This is a 32-bit register that contains shadow copies of the Overflow bits in the 32 mhpmevent CSRs.  hpmoverflow bit X corresponds to mhpmeventX.  This register enables overflow interrupt handler software to quickly and easily determine which counter(s) have overflowed (and to directly clear the overflow bits as they are serviced)

- This register is readable and write-one-to-clear.  This read/write access is subject to the same *counteren CSRs that mediate access to the hpmcounter CSRs by each privilege mode.  In other words, the same "visibility" controls apply both to the hpmcounter's and to their associated hpmoverflow Overflow bits.  Bits that should not be visible are RAZ/WI (read-as-zero / write-ignored).

Greg


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

alankao
 

It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

alankao
 

Hi Brian,

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>
Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.


Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.
I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.
Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.

Best,
Alan


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Allen Baum
 

I had thought about reserving  config bits in hpmevent as well, but thought that it would not be backwards compatible then.

In practice, it might be (in practice, I suspect the MSU bits are already there...)

On Mon, Jul 20, 2020 at 1:54 PM Brian Grayson <brian.grayson@...> wrote:
I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Brian Grayson
 

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.

Brian

On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

alankao
 

Hi all,

This proposal is a refinement and update of a previous thread: https://lists.riscv.org/g/tech-privileged-archive/message/488.  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check https://github.com/NonerKao/PMU_enhancement_proposal/blob/master/proposal.md for a simple perf example and the details.

Thanks!


Re: mcycle behavior during stalled wfi

Krste Asanovic
 

mcycle is considered part of the HPM facility.

If the clock to the core is not running, mcycle does not need to increment.

If part of the core is being clocked (including just mcycle), then mcycle can increment.

Sometimes mcycle counter hardware will also be used to implement mtime, in which case might not want it to stop incrementing during WFI if it’s providing timer interrupt.

The spec allows either interpretation.

The intended use of mcycle is for timing short runs of code with low overhead and high fidelity., and for combining with other HPM values for analysis of code performance.

mtime should be used for wall-clock timing, and might have higher overhead and possibly lower fidelity.

Krste



On Jul 16, 2020, at 12:05 AM, Arjan Bink <Arjan.Bink@...> wrote:

Hi,

The mcycle CSR is described in the RISC-V Privileged Architecture spec as:

The mcycle CSR counts the number of clock cycles executed by the processor core on which the hart is running.
What does 'clock cycles executed by the processor' mean in the context of a WFI instruction? For example, if a core is stalled on a WFI (waiting for e.g. an interrupt to become pending), should mcycle keep incrementing even if for example the remainder of the core's pipeline is clock gated?

Best regards,
Arjan


mcycle behavior during stalled wfi

Arjan Bink
 

Hi,

The mcycle CSR is described in the RISC-V Privileged Architecture spec as:

The mcycle CSR counts the number of clock cycles executed by the processor core on which the hart is running.
What does 'clock cycles executed by the processor' mean in the context of a WFI instruction? For example, if a core is stalled on a WFI (waiting for e.g. an interrupt to become pending), should mcycle keep incrementing even if for example the remainder of the core's pipeline is clock gated?

Best regards,
Arjan


Re: xTVAL Compliance restriction proposal

Roger Espasa
 

Yes: X (our implementation conforms to this constraint or implements XLEN bits)

roger.


On Sun, Jun 14, 2020 at 11:39 PM Allen Baum <allen.baum@...> wrote:
When address-related exceptions occur, the xTVAL CSR is written with the faulting effective address.  However, the spec also says:

  Implementations may convert some invalid address patterns into other invalid addresses prior to writing them to xTVAL.

The purpose of this statement is to allow implementations to reduce the number of xTVAL storage bits when addresses are smaller than XLEN.  (If your implementation implements all XLEN bits of mtval, this point is moot, and you can skip to the last paragraph.)

We propose to constrain this behavior to simplify the task of compliance-testing the xTVAL registers.  In particular, allowing implementations to arbitrarily convert any invalid address to any other invalid address is impractical to test.

Some implementations, including the Rocket core, make xTVAL a two's complement value that's just 1-bit wider than the largest virtual address.  This allows xTVAL to represent all valid virtual addresses with the minimal number of physical storage bits, yet still allows it to distinguish properly sign-extended VAs from improperly sign-extended VAs.  

The scheme replaces improperly sign-extended bits with the inverse of a properly signed address[VA] bit. Or (in pseudo code):

  xTVAL[XLEN-1:VASIZE] = (address[XLEN-1:VASIZE] == sign_extend( address[VASIZE-1]))
    ?    sign_extend(address[VASIZE-1]
    :   ~sign_extend(address[VASIZE-1]; 

 The effect of this scheme is that valid virtual addresses are always preserved, whereas invalid virtual addresses always remain invalid, even though their MSBs are discarded.

We are proposing to restrict xTVAL implementations with fewer than XLEN physical storage bits to support only this invalid address conversion scheme.

-->   If your implementation conforms to this scheme, or if your xTVAL register can represent all XLEN bits unconditionally,  please select (reply with X following) "Yes."  
-->  If your implementation doesn't conform to this scheme, but still conforms to the ISA spec, please select (reply with X following) "No", and carefully explain in the comments field how your implementation recodes invalid addresses.  

"No" responses without a sufficient explanation in the comments will not be considered.

Yes: ___ (our implementation conforms to this constraint or implements XLEN bits)
No:  ___ (our implementation is ISA compliant but doesn't conform to this constraint)

Comment (if you select "no" above): 








Re: xTVAL Compliance restriction proposal

Greg Favor
 

Also a good implementation approach - adding a second variation to what WARL compliance testing would probably need to allow for.

At the end of the day, with both the "1-bit" and "2-bit" schemes, one is setting those 1-2 bits appropriately - based on the current modes and judging whether the 64-bit address being captured is "valid" or "invalid".

Greg

On Tue, Jun 30, 2020 at 4:29 PM Andrew Waterman <andrew@...> wrote:
The Rocket approach works here if you set the width of mtval to 1+max(VAbits,PAbits) and then sign-extend from the most-significant implemented bit.  The extra 1 bit allows for zero-extension of PAs.

On Tue, Jun 30, 2020 at 6:19 PM Allen Baum <allen.baum@...> wrote:
You said:
  Sign extend if translated
 Zero extend otherwise (I don’t think that’s is quite the same as Bare+Mmode because of MPRV, and not sure how hypervisor modes affect that)

Then you said that the extension would be from the highest address bit-  but if VA>PA, isn’t that effectively zero extending the PA even in bare mode? That seems to contradict the first statement.

As usual, I’m probably interpreting something  which might be interpreted more than one way in exactly the wrong way- that is my superpower.  What am I getting wrong? Compliance wants to know!


-Allen

On Jun 30, 2020, at 12:45 PM, Greg Favor <gfavor@...> wrote:

Coming back to designs where the implemented PA size is greater than the supported VA size (for example, 48b VA's and 50b+ PA's), and in contrast to always sign-extending when VA size is greater than PA size, the situation is a little more complicated when PA size is greater than VA size:

On hardware and software writes one wants to zero-extend if under M-mode or a Bare translation mode, or sign-extend if under a translated mode.  In other words that one extra storage bit is loaded with either zero or sign extension of the highest supported address bit.

Compliance testing should keep this class of scenarios (i.e. PAsize > VAsize) in mind.

Greg


Re: xTVAL Compliance restriction proposal

Greg Favor
 

On Tue, Jun 30, 2020 at 4:19 PM Allen Baum <allen.baum@...> wrote:
You said:
  Sign extend if translated
 Zero extend otherwise (I don’t think that’s is quite the same as Bare+Mmode because of MPRV, and not sure how hypervisor modes affect that)

One could view the effect of MPRV as changing the mode that is in effect for a memory access.  For *tval purposes that end result is all that matters.  Ditto for the new-ish HLV/HLVX/HSV instructions.

I don't think the new/added hypervisor VS/VU modes change much - since they are supposed to behave like S/U modes.

For two-stage translations what matters is the stage that is causing an exception and what that stage's translation mode is.

So I believe my admittedly terse statement is correct (albeit without all the extra verbiage describing these various cases).  It is the relevant and final/effective privilege/translation modes that matter.
 

Then you said that the extension would be from the highest address bit-  but if VA>PA, isn’t that effectively zero extending the PA even in bare mode? That seems to contradict the first statement.

I agree that if VA>PA, then always sign-extending from the implemented address msb works.  My statement was addressing the case of PA>VA - in which sign-extending is not always correct.

Greg


As usual, I’m probably interpreting something  which might be interpreted more than one way in exactly the wrong way- that is my superpower.  What am I getting wrong? Compliance wants to know!


-Allen

On Jun 30, 2020, at 12:45 PM, Greg Favor <gfavor@...> wrote:

Coming back to designs where the implemented PA size is greater than the supported VA size (for example, 48b VA's and 50b+ PA's), and in contrast to always sign-extending when VA size is greater than PA size, the situation is a little more complicated when PA size is greater than VA size:

On hardware and software writes one wants to zero-extend if under M-mode or a Bare translation mode, or sign-extend if under a translated mode.  In other words that one extra storage bit is loaded with either zero or sign extension of the highest supported address bit.

Compliance testing should keep this class of scenarios (i.e. PAsize > VAsize) in mind.

Greg


Re: xTVAL Compliance restriction proposal

Andrew Waterman
 

The Rocket approach works here if you set the width of mtval to 1+max(VAbits,PAbits) and then sign-extend from the most-significant implemented bit.  The extra 1 bit allows for zero-extension of PAs.


On Tue, Jun 30, 2020 at 6:19 PM Allen Baum <allen.baum@...> wrote:
You said:
  Sign extend if translated
 Zero extend otherwise (I don’t think that’s is quite the same as Bare+Mmode because of MPRV, and not sure how hypervisor modes affect that)

Then you said that the extension would be from the highest address bit-  but if VA>PA, isn’t that effectively zero extending the PA even in bare mode? That seems to contradict the first statement.

As usual, I’m probably interpreting something  which might be interpreted more than one way in exactly the wrong way- that is my superpower.  What am I getting wrong? Compliance wants to know!


-Allen

On Jun 30, 2020, at 12:45 PM, Greg Favor <gfavor@...> wrote:

Coming back to designs where the implemented PA size is greater than the supported VA size (for example, 48b VA's and 50b+ PA's), and in contrast to always sign-extending when VA size is greater than PA size, the situation is a little more complicated when PA size is greater than VA size:

On hardware and software writes one wants to zero-extend if under M-mode or a Bare translation mode, or sign-extend if under a translated mode.  In other words that one extra storage bit is loaded with either zero or sign extension of the highest supported address bit.

Compliance testing should keep this class of scenarios (i.e. PAsize > VAsize) in mind.

Greg


Re: xTVAL Compliance restriction proposal

Allen Baum
 

You said:
  Sign extend if translated
 Zero extend otherwise (I don’t think that’s is quite the same as Bare+Mmode because of MPRV, and not sure how hypervisor modes affect that)

Then you said that the extension would be from the highest address bit-  but if VA>PA, isn’t that effectively zero extending the PA even in bare mode? That seems to contradict the first statement.

As usual, I’m probably interpreting something  which might be interpreted more than one way in exactly the wrong way- that is my superpower.  What am I getting wrong? Compliance wants to know!


-Allen

On Jun 30, 2020, at 12:45 PM, Greg Favor <gfavor@...> wrote:

Coming back to designs where the implemented PA size is greater than the supported VA size (for example, 48b VA's and 50b+ PA's), and in contrast to always sign-extending when VA size is greater than PA size, the situation is a little more complicated when PA size is greater than VA size:

On hardware and software writes one wants to zero-extend if under M-mode or a Bare translation mode, or sign-extend if under a translated mode.  In other words that one extra storage bit is loaded with either zero or sign extension of the highest supported address bit.

Compliance testing should keep this class of scenarios (i.e. PAsize > VAsize) in mind.

Greg

1001 - 1020 of 1212