Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Bill Huffman

On 7/21/20 11:56 AM, Greg Favor wrote:

What I ultimately understood from Brian's email's is that this 'marked' bit conceptually is not a state bit associated with each counter, but a state bit maintained separately by software (as it transitions into and out of regions of code that it wants to be viewed as "marked" or not).  Then counters can be configured (via their event selection) to count either marked events or unmarked events or both.

With that view, the 'marked' bit wants to be changeable directly by the software without having to call into M-mode and without those bits of software having to be aware of which counters were configured to count marked events, or non-marked events, or both.  This is why one doesn't want to be trying to use mcountinhibit bits.
I agree about calling M-mode.  But I think the software that swaps entire mcountinhibit registers doesn't have to know anything about the bits in them either.

In any case, I'm just trying to represent what I understand to be Brian's request.  But as he also acknowledged, the primary use case is probably bare-metal embedded systems and may not be more generally relevant.

As far as the 'Active' bit in my proposal, it allows for both software and hardware events to set/clear a counter's Active bit.  That's more interesting when one has hardware cross-trigger events from the debug Trigger Module and from specific counter overflows.  But since this all wants to go together, I would be fine with removing the Active bit from my proposal.  This simple set of cross-trigger capabilities (to create rich events that can be counted and to create richer debug triggers and richer trace control) is best treated as a separate (future) extension proposal that may or may not catch enough people's interest.  (We're all for standardizing these richer capabilities in some form, but if that doesn't happen then we (Ventana) will implement this as our own custom stuff.)


I'm not trying to argue for or against any particular thing - including the Active bit - at this point.  Just wanting a full understanding....


On Tue, Jul 21, 2020 at 11:08 AM Bill Huffman <huffman@...> wrote:


I think the advantages you explain are exactly the reason why I was asking whether the marked bit was a significant improvement.  I think the functionality is extremely similar to swapping mcountinhibit CSR values.  For example, there can be one counter that counts marked activities and another that counts non-marked activities.  This would be done by having two counters set the same ways except that one inhibit bit was set in one mcountinhibit value and the other was set in the other.

It seems to me that the advantage of the marked bit is that there isn't a register to swap.  But the bit in the status register still needs to be changed at those same times.  So, I'm seeing only tiny differences - such as that, with the marked bit, there's no need for a location to save a mcountinhibit value when it's not being used.  And presumably supervisor can change the marked bit in status.  On the other hand, the mcountinhibit method allows the effect of more than two values of the marked bit by having more than two mcountinhibit values.

So, it doesn't seem to me a very large improvement once mcountinhibit already exists, even for the embedded cases where context-switching of permon registers is not done.


On 7/20/20 10:25 PM, Brian Grayson wrote:

The 'marked bit' in my proposal is different from a per-counter active bit, and may be a bit hard to explain well, but I'll try.

To me, there are two very different performance monitor approaches for system software:

- Linux-style, where one uses a tool like perf, and perfmon state is saved and restored on context switches. In this case, a 'marked bit' is not needed, as one can just control everything on a per-process basis

- embedded bare-metal whole-system performance monitoring, where there is no support like perf. This is where the 'marked bit' becomes more obvious, where basically one configures the counters to be free-running, except for the masking. So for example, consider an embedded application where there is the bread-and-butter ordinary work, but there is also some kind of exceptional/unordinary work (bad packet checksum, new route to establish, network link up/down, etc as networking examples). One could set and clear the marked bit on entry and exit from these routines, allowing easy profiling of everything, or of just the ordinary work, or of just the exceptional work, by using the marked bit. The same could be done by having each entry/exit point reprogram N counters, or altering a global mcountinhibit, but both of those approaches fall short. The first one forces a recompile of your system software whenever you want to change events, or a swapin/swapout of perfmon state (just like a context switch) when entering/leaving these routines, while the marked bit just requires setting a single bit; the second one (using mcountinhibit) forces one to choose between counting or not counting, and doesn't allow you to count marked-bit activities on one counter, and non-marked-bit activities on another counter, and all activities (both marked and unmarked) on a third counter.

If all of the embedded customers will be using a perf-style approach with context-switching of perfmon registers, I think I can agree that the marked bit is not as useful, but I don't think that will be true for all of our customers.


On Mon, Jul 20, 2020 at 11:30 PM Greg Favor <gfavor@...> wrote:

Hopefully my last email also answers your question.


On Mon, Jul 20, 2020 at 8:04 PM Bill Huffman <huffman@...> wrote:


I'm curious whether the 'marked' bit is a significant improvement on the mcountinhibit CSR, which could be swapped to enable counters for multiple processes.


On 7/20/20 1:54 PM, Brian Grayson wrote:

I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.

In your proposal, the control for counter3, for example, is in mcounteren (for interrupt enablement), mcounterovf (for overflow status), mcountermask (for which modes to count), and of course mhpmevent3. In my proposal, all of these would be contained in the existing per-counter configuration register, mhpmevent*. There is not much of a difference between these proposals, except that in your proposal, programming a counter may require read-modify-write'ing 5 registers (those 4 control plus the counter itself), whereas in my proposal it requires writing (and not read-modify-write'ing) only two registers. In practice, programming 4 counters would require 20 RMW's (20 csrr's and 20 csrw's and some glue ALU instructions in the general case -- some of those would be avoidable of course) in your proposal vs. 8 simple writes in mine -- four to set the event config, four to set/clear the counter registers. (I am a big fan of low-overhead perfmon interactions, to reduce the impact on the software-under-test.) With a shadow-CSR approach, both methods could even be supported on a single implementation, since it is really just a matter of where the bits are stored and how they are accessed, and not one of functionality.

From my count, roughly 8 or so bits would be needed in mhpmevent* to specify interrupt enable, masking mode, et al. These could be allocated from the top bits, allowing 32+ bits for ordinary event specification.

Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different register.

Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. If one allows masked counting based on whether this bit is set or clear, it allows additional filtering on when to count. For example, in a system with multiple processes, one set of processes could have the marked bit set, and the remaining processes could keep the marked bit clear. One can then count "total instructions in marked processes" and "total instructions in unmarked processes" in a single run on two different counters. In complicated systems, this can be enormously helpful. This is of course a bit intrusive in the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its purpose.

As you point out, interrupts and overflows are the key show-stoppers with the current performance monitor architecture, and those are the most important to get added.


On Sun, Jul 19, 2020 at 10:47 PM alankao <alankao@...> wrote:
Hi all,

This proposal is a refinement and update of a previous thread:  We noticed the current activities regarding HPM in Linux community and in Unix Platform Spec Task Group, so again we would like to call for attention to the fundamental problems of RISC-V HPM.  In brief, we plan to add 6 new CSRs to address these problems.  Please check for a simple perf example and the details.


Join to automatically receive all group messages.