Re: A proposal to enhance RISC-V HPM (Hardware Performance Monitor)

Andy Glew Si5

I have NOT been working on a RISC-V performance monitoring proposal, but I've been involved with performance monitoring first as a user then is an architect for many years and at several companies.

I would like to draw this group's attention to some features of Intel x86 performance monitoring  that turned out pretty successful.

First, you're already talking about hardware performance monitoring interrupts for statistical profiling. Good. A few comments on that below.

But, I think one of the best bang for buck performance monitoring features of  Sadie six MIN/performance monitoringis the performance counter event filtering

---+ Performance event filtering and transformation logic before counting

Next, I think one of the best bang for buck performance monitoring features of  x86 EMON performance monitoring is the performance counter event filtering. RISC-V has only the most primitive version of this.

Every x86 performance counter has per counter event select logic.

In addition to that logic, there is a mask that specifies what modes to cvount in - User, OS, hypervisor.  I see that some of the Ri5 proposals also have that. Good.

But more filtering is also provided:  

Each counter  has a "Counter Mask" CMASK - really, a threshold. When non-zero, this is compared to the count of the selected event in any given cycle. If >= CMASK, the counter is incremented by 1; if less, no increment.

=> This comparison allows a HW event to be used to profile things like "Number of cycles in which 1, 2 or more, 3 or more ... events happened - e.g. how often you are able to to acheive superscalar execution.  In vectors, it might count how many vector elements are masked in or out.   If you have events that correspond to buffer occupancy, you can profile to see where the buffer is full or not.

INV - a bit that allows the CMASK comparison to be inverted.

=> so that you can count event > threshold, and event < threshold.

I would really have liked to have >, <, and == threshold.  And also the ability to increment by 1 if exceeding threshold, or by the actual count that exceeds the threshold. The former allows you to find where you are getting good superscalar behavior or not, the latter allows you to determine the average when exceeding the threshold or not. When I do find this I had to save hardware.

This masked comparison allows you to get more different types of events, for events that occur more than one per cycle. That's pretty good, abut it doesn't help you with scarce events, t events that only occur once every four or eight or an cycles.  Later, Intel added what I call  "push-out"  profiling: when the comparison condition is met,  e.g. when no instruction retires, a counter that increments one every clock cycle starts ; when the condition changes, the value of that counter is what is recorded, and naturally subject to all of the natural filtering.  That was too much hardware for me to add in 1991, but it proved very useful.

My first instinct is always to minimize hardware cost for performance monitoring hardware.   The nice thing about the filtering logic at the performance counters is that it removed the hardware cost from the individual unit like the cache,
and left it in the performance monitoring unit.  (The question of centralized versus decentralized performance counters is always an issue.  Suffice it to say that Intel P6 had for centralized performance counters, to save hardware;
Pentium 4 went to a fully distributed performance counter architecture, but users hated it, so until return to the centralized model at least architecturally, although microarchitecture implementations might be decentralized )

More filtering logic: each count has an E, edge select bit.  This counts when the condition described by the privilege level mask and CMASK comparison changes.    Using such edge filtering, you can determine the average length of bursts, e.g. the average length of a period that you have not been able to execute any instruction, and so on.   Simple filters can give you average lengths and occupancies; fancier and more expensive stuff is necessary to actually determine a distribution.   

Certain events themselves are sent to the performance counters as bitmasks.  E.g. the utilization of the execution unit ports as a bitmask - on the original P6 { ALU0/MUL/DEV/FMUL/FADD, ALU1/LEA, LD, STA, STD }, fancier on modern machines.  By controlling the UMASK field of the filter control logic for each performance counter, you could specify to count all instruction dispatches, or just loads, and so on.   Changing the UMASK field allowed you to profile to find out which parts of the machine were being used and which not.   (This proved successful enough to get me and the software guy who eventually started using it in achievement award.)

If I were to do it over again I would have a generic POPCNT as part of the filter logic, as well as the comparison.

Finally, simple filter stuff:

INT - Performance interrupt enable

PC - pin control - this predated me: it toggled an external pin when the performance counter overflowed.  The very first EMON event sampling took that external pin and wired it back to the NMI pin of the CPU.  Obviously, it is better to have internal logic for performance monitoring interrupts. Nevertheless, there is still a need for externally visible performance event sampling, e.g. freeze performance events outside the CPU, in the fabric, or in I/O devices.  Exactly what those are is obviously implementation dependent, but it's still good to have a standard way of controlling such implementation dependent features. I call this the "pin architecture", and IMHO maintaining such system compatibility was as much a factor in Intel's success as instruction set architecture.

---+ Performance counter freeze

There are always several performance counters. At least two per privilege level. At least a pair, so you can compute things like cashless rates and other ratios.     But not necessarily dedicated to any specific privilege level, because that would be wasteful: you can study things a hell of a lot more quickly if you can use all of the performance counters, when other modes are not using them.

When you have several performance counters, you are often measuring things together. You therefore need the ability to freeze them all at the same time. This means that you need to have all of the enable bits for all of the counters, or at least a subset, in the same CSR.   If you want, the enable bit can be in both the per counter control registers and in a central CSR - i.e. there can be multiple views of the same bit indifferent CSRs.

Performance analysts would really like the ability to freeze multiple CPUs performance counters at the same time. This is one motivation for that pin control signal

---+ Precise performance monitoring inputs - You can only wish!

When you are doing performance counter event interrupt based sampling, it would be really nice if the interrupt occurred exactly at the instruction that had the event.

If you can do that, great. However, it takes extra hardware to do that. Also, some events do not in any way correspond to a retired instruction - think events that occur on speculative instructions that never graduate/retire. Again, you can create special registers that record, say, the program counter of such a speculative instruction, but that is again extra hardware.

IMHO there is zero chance that all implementations, particularly the lowest cost of limitations, will make all performance events precise.

At the very least there should be a way of discovering whether an event is precise or not.   

Machine check architectures have the same issue.

---+ What should the priority of the hardware performance monitoring interrupts be?

One of the best things I did for Intel was punt on this issue: because I was also the architect in charge of the APIC interrupt controller, I provided a special LVT interrupt register just for the performance monitoring interrupt.

This allowed the performance monitoring interrupt to use all of the APIC features, all of those that made sense. For example, the performance monitoring interrupt that Linux uses is just the normal interrupt of some priority.  But as I mentioned above, the very first usage of performance monitoring interrupts used NMI, and was therefore able to profile code that had interrupts blocked.   The interrupt would be directed to SMM, the not very good baby virtual machine monitor, allowing performance monitoring to be done independent of the operating system. Very nice, when you don't have source code for the operating system. And so on. I can't remember, but it is possible that the interrupt could be directed to other processors other than the local processor.  However, that would not subsume the externally visible pin control, because the hardware pin can be a lot less expensive a lot more precise than signaling and enter processor interrupt.

I used a similar approach for machine check interrupts, which could also be directed to the operating system, NMI, SMM, hypervisor,…

By the way: I think Greg Favor said that x86 is performance monitoring interrupts are  level sensitive. That is not strictly speaking true: whether they are level sensitive or not is programmed into the AIPAC local vector table. You can make it all sensitive or edge triggered.

Obviously, however, when there are multiple performance counters bearing the same interrupt, you need to know which counter overflowed. Hence the sticky bits that Greg noticed in the manuals.

---+ Fancier stuff

The above is mostly about getting the most out of simple performance counters:  providing filter logic so that you can get the most insight out of the limited number of events;
providing enables in a central place so that you can freeze multiple counters at the same time;  allowing the performance counter interrupts to be directed not just a different privilege levels but in different interrupt priorities including NMI,  and possibly also external hardware.

There's a lot more stuff that can be done to help performance monitoring. Unfortunately, I have always worked at a place where I had to reduce  the performance monitoring hardware cost as much as possible.   I am sure, however, that many of you are familiar with fancier performance monitoring features, such as

+ Histogram counting (allowing you to count distributions without making multiple runs)
    => the CMASK comparators allowing very simple form of this, assuming you have enough performance counters. Actual histogram counters can do this more cheaply.

+ Cycle attribution - defining performance events so that you can actually say things like  "X% of cycles are spent waitying for memory".   

IMHO the single most important "advanced"  performance monitoring feature is what I call "longitudinal profiling".   AMD Instruction Based Sampling (IBS), DEC ProfileMe, ARM SPE (Statistical Profiling Extension).  The basic idea is to set a bit on some randomly selected instruction package somewhere high up in the pipeline, e.g. yet instruction fetch, and then let that bit flow down the pipeline, sampling things as it goes. E.g. you might sample the past missed latency or address,  or whether it produced a stall in interaction with a different marked instruction. This sort of profiling is quite expensive, e.g. requiring a bit in many places in the pipeline, as well as registers to record the sample data, but it provides a lot of insight: it can give you distributions and averages, it can tell you what interactions between instructions are causing problems.

However, if RISC-V cannot yet afford to do longitudinal profiling, the performance counter filter logic that I described above is low hanging fruit, much cheaper.

From: Alankao <alankao@...>
Sent: Monday, July 20, 2020 5:43PM
To: Tech-Privileged <tech-privileged@...>
Subject: Re: [RISC-V] [tech-privileged] A proposal to enhance RISC-V HPM (Hardware Performance Monitor)


On 7/20/2020 5:43 PM, alankao wrote:
It was just so poorly rendered in my mail client, so please forgive my spam.

Hi Brian,

> I have been working on a similar proposal myself, with overflow, interrupts, masking, and delegation. One of the key differences in my proposal is that it unifies
> each counter's configuration control into a per-counter register, by using mhpmevent* but with some fields reserved/assigned a meaning.  <elaborating>

Thanks for sharing your experience and the elaboration. The overloading-hpmevent idea looks like the one in the SBI PMU extension threads in Unix Platform Spec TG by Greg. I have a bunch of questions.  How was your proposal later? Was it discussed in public? Did you manage to implement your idea into a working HW/S-mode SW/U-mode SW solution? If so, we can compete with each other by real benchmarking the LoC of the perf patch (assuming you do it on Linux) and the system overhead running a long perf sample.

> Another potential discussion point is, does overflow happen at 0x7fffffffffffffff -> 0x8000000000000000, or at 0xffffffffffffffff -> 0x0000000000000000? I have a
> bias towards the former so that even after overflow, the count is wholly contained in an XLEN-wide register treated as an unsigned number and accessible via
> a single read, which makes arithmetic convenient, but I know some people prefer to (or are used to?) have the overflow bit as a 33rd or 65th bit in a different
> register.

I have no bias here as long as the HPM interrupt can be triggered. But somehow it seems to me that you assume the HPM registers are XLEN-width but actually they are not (yet?).  The spec says they should be 64-bit width although obviously nobody implements nor remember that.

> Lastly, a feature I have enjoyed using in the past (on another ISA) is the concept of a 'marked' bit in the mstatus register. ... This is of course a bit intrusive in
> the architecture, as it requires adding a bit to mstatus, but the rest of the kernel just needs to save and restore this bit on context switches, without knowing its
> purpose.

Which architecture/OS are you referring to here? 

Through this discussion, we will understand which idea is the community prefer to: adding CSRs, overloading existing hpmevents, or any balanced compromise.  I believe the ultimate goal of this thread should be determining what the RISC-V HPM should really be like.


I apologize for some of the language errors that occur far too frequently in my email. I use speech recognition much of the time, and far too often do not catch misrecognition errors. This can be quite embarrassing, amusing, and/or confusing. Typical errors are not spelling but homonyms, words that sound the same - e.g. "cash" instead of "cache".

Join to automatically receive all group messages.